Both acoustic modeling and speech modeling are important parts of modern statistically based speech recognition algorithms. Hidden Markov Models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications, such as document classification or statistical machine translation.

Hidden Markov models

Main article: Hidden Markov model
Modern general-purpose speech recognition systems are based on hidden Markov models. These are statistical models that derive a sequence of symbols or values. HMMs are used for speech recognition because the speech signal can be treated as a piecewise stationary signal or a short-term stationary signal. For a short period of time (for example, 10 milliseconds), speech can be approximated as a stationary process. Speech can be thought of as a Markov model for many stochastic purposes.

Another reason why HMMs are popular is that they can be trained automatically and are simple and convenient to compute. In speech recognition, a hidden Markov model outputs a sequence of n-dimensional real-valued vectors (where n is a small integer, such as 10), outputting one of them every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficient. A hidden Markov model will typically have a statistical distribution in each state that is a mixture of the diagonal Gaussian covariance, giving a probability for each observed vector. Each word, or (for more general speech recognition systems) each phoneme, will have a different distribution of the raw data; a hidden Markov model for a sequence of words or phonemes is created by combining separate trained hidden Markov models for individual words and phonemes.

The main elements of the most common HMM-based speech recognition approach are described above. Modern speech recognition systems use various combinations of a number of standard techniques to improve results over the basic approach described above. A typical large dictionary system would require context dependence for phonemes (so phonemes with different left and right contexts have different realizations, as the HMM states); it will use cepstral normalization to normalize for different speakers and recording conditions; to further normalize the speaker, it can use vocal tract length normalization (VTLN) for male and female normalization maximum likelihood linear regression (MLLR) for more general speaker adaptation. Features would have so-called deltas and delta-delta coefficients to capture speech dynamics and, in addition, can use heteroskedastic linear discriminant analysis (HLDA); or can omit the delta-delta-delta coefficients and use splicing and an LDA based on the projection, followed by perhaps a heteroscedastic linear discriminant analysis or a global semi-linked variance covariance transform (also known as a maximum likelihood linear transform, or MLLT ). Many systems use so-called discriminant learning techniques, which abandon a purely statistical approach to HMM parameter estimation and instead optimize certain metrics of the training data associated with classification. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum telephone error

Speech decoding (the term for what happens when the system is presented with a new utterance and must compute the most likely output sentence) will probably use the Viterbi Algorithm to find the best path, and there is a choice between dynamically building a combined hidden Markov model that includes both acoustic and speech model information, and prior static combining (the finite state transformer, or FST, approach).

A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate and use a better scoring function (rescoring ) to score those good candidates so that we can select the best one based on this fancy score. The set of candidates can be stored as a list (N-best list approach) or as a subset of models (a lattice). Re-estimation is usually done by trying to minimize the Bayes Risk (or its approximation): Instead of taking the original sentence with maximum likelihood, we try to take the sentence that minimizes the expectation of the given loss function over all possible transcriptions (i.e. we take the sentence that minimizes the average distance to other possible sentences weighted by their predicted probability). The loss function is usually the Levenshtein distance, although it can be different distances for specific tasks; many possible transcriptions are, of course, truncated to preserve suitability. For the reevaluation, effective algorithms have been developed for the grid presented as a weighted final converter with editable distances presented as a final converter of the state of the verification of certain assumptions.

Speech recognition based on dynamic time warping (DTW)
Main article: Dynamic time warping
Dynamic time warping is an approach that has historically been used for speech recognition, but is now largely superseded by the more successful HMM-based approach.

Dynamic time warping is an algorithm for measuring the similarity between two sequences that can vary in time or speed. For example, similarities in walking patterns will be detected, even if a person walks slowly in one video and moves faster in another, or even if there are accelerations and decelerations during the same observation. DTW has been applied to video, audio, and graphics—indeed, any data that can be converted to a linear representation can be analyzed with DTW.

A well-known application is automatic speech recognition to cope with different speech rates. In general, it is a method that allows a computer to find the optimal match between two given sequences (such as time series) within certain constraints. That is, the sequences are “deformed” non-linearly to coincide with each other. This method of sequence alignment is often used in the context of hidden Markov models.

Neural networks

Main article: Artificial neural network
Neural networks emerged as an attractive approach to acoustic modeling in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition, such as phoneme classification, phoneme classification using multi-objective evolutionary algorithms, isolated word recognition, audiovisual speech recognition, audiovisual speaker recognition, and speaker adaptation.

Neural networks make fewer explicit assumptions about the characteristics of statistical properties than HMMs and have several qualities that make them attractive recognition models for speech recognition. When neural networks are used to estimate the probabilities of a speech segment, they allow for discrimination training in a natural and efficient way. However, despite their effectiveness in classifying short-term units such as single phonemes and discrete words, early neural networks rarely succeeded in continuous recognition tasks due to their limited ability to model temporal dependencies.

One approach to this limitation has been to use neural networks as preprocessing, feature transformation, or dimensionality reduction, a step toward HMM-based recognition. Recently, however, LSTMs and related recurrent neural networks (RNNs) and time-delay neural networks (TDNNs) have shown improved performance in this field.

Deep direct and recurrent neural networks

Main article: Deep learning
Deep neural networks and denoising Autoencoders are also under investigation. A deep feed-forward neural network (DNN) is an artificial neural network with several hidden layers of units between the input and output layers. Similar to shallow neural networks, DNNs can model complex nonlinear relationships. DNN architectures generate compositional models where additional layers allow elements from lower layers to be composed, providing enormous learning power and thus the potential to model complex speech data samples.

The success of DNNs in large-vocabulary language recognition occurred in 2010 by industrial researchers in collaboration with academic researchers, where large output levels of DNNs were adopted based on context-sensitive HMM states built on decision trees. See comprehensive reviews of this development and the current state of the art as of October 2014 in the latest Springer book from Microsoft Research. See Also related is the history of automatic speech recognition and the impact of various machine learning paradigms, including deep learning, recent review articles.

One of the main principles of deep learning is to get rid of manual functional engineering and use raw functions. This principle was first successfully explored in a deep autoencoder architecture on “raw” spectrogram or filter bank linear characteristics, demonstrating its superiority over Mel-Cepstral characteristics, which involve several stages of fixed transformation from spectrograms. True “raw” speech features, waveforms, have recently been shown to produce excellent large-scale speech recognition results.

End-to-end automatic speech recognition

Since 2014, there has been a great interest in research works on “end-to-end” ASW. Traditional phonetic-based (ie, all model-based HMM) approaches required separate components and training for pronunciation, acoustics, and the language model. End-to-end models jointly learn all components of a speech recognizer. This is valuable because it simplifies the training and deployment process. For example, an n-gram language model is required for all HMM-based systems, and a typical n-gram language model often takes up several gigabytes of memory, making them impractical for deployment on mobile devices. So, today’s commercial ASR systems from Google and Apple (as of 2017) are deployed in the cloud and require a network connection as opposed to a device locally.

The first attempt at end-to-end ASR was with the Connectionist Temporal Classification (CTC) system, founded by Alex Graves of Google DeepMind and Navdeep Jaitly of the University of Toronto in 2014. The model consisted of a recurrent neural network and a CTC level. Together, the RNN-CTC model learns the speech and the acoustic model together, but it is unable to learn speech due to the conditional independence assumption similar to HMM. Hence, CTC models can directly learn to map speech acoustics to English characters, but the models make many common spelling errors and must rely on a separate language model to clean the transcripts. Later, Baidu expanded to work with extremely large datasets and demonstrated some commercial success in Mandarin Chinese and English. In 2016, the University of Oxford introduced LipNet, the first sentence-level end-to-end lipreading model using spatiotemporal convolutions combined with an RNN-CTC architecture, which outperforms human-level performance on a limited grammatical dataset. The large-scale CNN-RNN-CTC architecture was presented in 2018 by Google DeepMind achieving 6 times better performance than human experts.

An alternative approach to CTC-based models is attention-based models. Attention-based ASR models were simultaneously presented by Chan et al. from Carnegie Mellon University and Google Brain and Bagdanau et al. from the University of Montreal in 2016. The model, called “Listen, Attend and Spell” (LAS), literally “listens” to the audio signal, pays “attention” to different parts of the signal and “writes” the transcription one character at a time. Unlike CTC-based models, attention-based models do not have condition-independent assumptions and can learn all components of a speech recognizer, including pronunciation, acoustic, and language models directly. This means that there is no need to apply the language model during deployment, making it very practical for memory-constrained applications. By the end of 2016, attention-based models had made significant progress, including outperforming CTC models (with or without an external language model). Various extensions have been proposed since the original LAS model. Latent Sequence Decomposition (LSD) has been proposed by Carnegie Mellon University, MIT, and Google Brain to directly extract subword units that are more natural than English characters; Oxford University and Google DeepMind extended LAS to “Watch, Listen, Attend and Spell ” (WLAS) to provide lip reading, surpassing human-level performance. Encyclopedia