High Level Architecture
The high level architecture for Sphincs, derived from sphinx4 architecture is relatively straightforward. As shown in the following figure, the architecture consists of the front end, the decoder, a knowledge base, and the application.
HA.jpg
The Front End is responsible for gathering, annotating, and processing the input data. In addition, the Front End extracts features from the input data to be read by the decoder. The annotations provided by the Front End include the beginning and ending of a data segment. Operations performed by the Front End include preemphasis, noise cancellation, automatic gain control, end pointing, Fourier analysis, Mel spectrum filtering, cepstral extraction, etc.

The Knowledge Base provides the information the decoder needs to do its job. This information includes the acoustic model and the language model. The Knowledge Base can also receive feedback from the decoder, permitting the Knowledge Base to dynamically modify itself based upon successive search results. The modifications can include switching acoustic and/or language models as well as updating parameters such as mean and variance transformations for the acoustic models.

The Decoder performs the bulk of the work. It reads features from the Front End, couples this with data from the Knowledge Base and feedback from the application, and performs a search to determine the most likely sequences of words that could be represented by a series of features. The term "search space" is used to describe the most likely sequences of words, and is dynamically updated by the decoder during the decoding process.

Unlike many speech architectures, the sphinx4 and Sphincs architecture allows the application to control various features of the speech engine, permitting more sophisticated speech application development. As depicted in the previous figure, the application can receive events from the Front End and can also provide some level of control over the Front End. The type of control can be as simple as turning the audio input on or off, but may also include more sophisticated operations.

During the decoding process, the application may also receive events from the Decoder while the Decoder is working on a search. These events allow the application to monitor the decoding progress, but also allow the application to affect the decoding process before the decoding completes. Furthermore, the application can also update the Knowledge Base at any time.

Last edited Oct 12, 2014 at 6:30 PM by DxN, version 8