ABCDEFGHIJKLMNOPQRSTU
1Main
2acoustic modelUsed in TTS, predicts acoustic features. Tacotron 2 and FastSpeech 2 are examples. Input transcript is one feature, but prosody and linguistic feature can be another. Phoneme-level prosodies, word-level prosodies and hierarchical prosodies. Syntactic graph, word embeddings. L1, L2 loss assumes distribution of the acoustic feature is unimodal, but the real distribution is much more complex. Normalizing flow is an approach (FlowTTS, GlowTTS). Txt2vec (self-supervised VQ acoustic feature vs. mel spectrogram).
3activation functionsigmoid
4active learningparadigm where the learner interacts with the environment at training time
5ADALINElinear model (adaptive linear element)
6adaptive learning rateMelGAN
7ADMAblated diffusion model
8adversarial training
9AI2Allen Institute open language model
10anthropic principleThe "observation selection effect" suggests that the range of observations we could make is limited by the fact that observations could only happen in a universe capable of developing intelligent life in the first place. The universe has properties that accomodate life because we wouldn't be around to make observations otherwise. There are more than 30 forms of anthropic principles. Weak states the unvierse's tuning is due to selection/survivorship bias. Strong anthropic principle considers the universe compelled to eventually have conscious and sapient life emerge within it. Participatory anthropic principle by Wheeler suggests the universe must be observed. Final AP suggests the universe's structure as expressible by bits of information.
11APaperiodic parameter used in speech synthesis
12attentionConnects related words using dot products between word vectors
13autoencoderAn RL algorithm which combines encoder and decoder. Autoencoders try to preserve information, but create new representations with useful properties. Undercomplete, regularized.
14AutoModelHuggingface library for loading pretrained models
15autoregressive
16AutoTokenizerHuggingface tool
17backpropagationHinton algorithm for neural networks. Did not live up to high expectations in the 1980s.
18bagging
19Bard
20batch normalizationDL technique
21batch size, head (number)3M-6M, 1.5M-3M
22Bayesian networks
23beam search decoding
24BERTBidirectional (encoder representations from) transformer powers Google Search, succeeded Transformers. Base tool had 110m parameters, 12 attention heads, 768 hidden nodes, 12 layers. BERT-large had 24 layers, 1024 hidden nodes, 16 attention heads and 340m parameters.
25BERT-Large
26bidirectional LSTM (BiLSTM)BiLSTM processes both forward and backward directions simultaneously using forward and backward pass.
27BLEUmachine translation scoring method
28BlLOOMLanguage models by Hugging Face
29BoolQNLP test
30BOSSCMU car which won DARPA Urban Challenge
31brightness normalizationtechnique mentioned by Krizhevsky, Sutskever and Hinton in 2012 ImageNet paper
32CaffeML framework
33Caltech-101/256Early database of labeled images
34CBNLP test
35ChincillaCompute-optimal LLM, 70B parameters
36CIFAR
37CIFAR-10/100Early database of labeled images
38classificationtypical ML goal
39classifierML technique decides whether input belongs in class A or B. (in LLMs): Abbreviated CLS, begins tokenizing.
40ClaudeAnthropic LLM
41clusteringcanonical ML task of partitioning a set into homogeneous subsets
42CLVPContrastive Language-Voice Pretraining. Compares autoregressive samples vs CLVP to find the best candidates.
43CNTKML framework
44cognitive sciencesimilar to neuroscience?
45Cognitronearly neuron network
46Common Crawl
47configuratorIn Le Cun's model, a module which performs executive control. Given a task to be executed, it preconfigures the perception module.
48connectionism1980s movement also known as parallel distributed processing emphasized distributed representation
49constrained
50contrast normalization
51contrastive loss
52Convolutional Neural NetworksInvented by Le Cun in the 1980s. Based on persistent states that gather information layer-by-layer. They have fewer connections and parameters than standard NNs, and are easier to train.
53cosine similarity functioncompares word embeddings using Euclidean norm in unit spheres
54cost function
55Cycold knowledge base AI approach. Used CycL language. Demonstrates flaws of knowledge base approach.
56D_modelOutput of every sublayer of transformer architecture, 512 in original Google paper.
57Daedalus
58DALL-E
59DARTDARPA logistics program
60DavinciEarly OpenAI model
61decoderDecoder constructs the output sequence based on the vector received from the encoder. In transformers, the decoder produces text sequences and has the multi-headed attention layer, add and norm layers, feed-forward layers, and a linear layer with proabilities of output. In representational learning, and in general, decoders convert encoded representations back into their original formats.
62Deep BlueIBM chess playing program, defeated Kasparov in 1997
63DeepMindGoogle-owned NN company
64deep belief networktrained by greedy layer-wise pre-training
65deep learningDeep learning is a subgroup of machine learning that allows representations to be composed of other, simpler representations. The quintessential example is the feedforward deep network, or multilayer perceptron.
66deep neural networks
67DENDRALMass spectrometry program for predicting compounds in 1969.
68differentiablecan compute gradient estimates of some objective function with respect to its own input and propagate to upstream modules
69DiffSVCDiffusion probabilistic model for SVC
70diffusion modelAlso known as diffusion probabilistic models, a class of latent variable models which outperform GANs. Also known as denoising diffusion models or score-based generative models. Diffusion models are parameterized Markov chains using two processes: forward diffusion and parametrized reverse. The forward diffusion process passes randomly sample noise with Gaussian noise and then reverses through the "learned denoising process". Diffusion models map latent space using a fixed Markov chain, which adds noise. Training a diffusion model requires finding reverse Markov transitions that maximize the likelihood of the training data. Diffusion models scale well to large datasets for image synthesis.
71diffusion processData are progressively noised
72DiffwaveNeural-based vocoder
73dimensionality reductionAlso known as manifold learning, a canonical ML task which transforms an initial representation into a lower-dimensional one with preserved properties
74DistBeliefframework
75distributed representationeach input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs
76diversity loss
77DreamFusionGoogle text-to-3D
78dropoutRegularization method that limits overfitting by setting to zero a subset of features by multiplying them with a Bernoulli random variable.
79dualismCartesian idea that the mind is outside the laws of nature.
80echo state networks
81ELMo
82embeddingTokens go from single integers to vectors of d_model dimensions using learned embeddings in the model
83empiricismFrancis Bacon and John Locke's movement held the senses are responsible for understanding
84encoderEncodes text (or any input data) into a vector (or different representation) and transmits to the decoder.
85ensemble methods
86ERMEmpirical Risk Minimization learning rules
87espeakrobotic TTS
88Euclidean normvector distance from origin
89expert systemsee knowledge base. A rule-based system for determining answers, an older AI methodology
90exploration vs exploitation dilemmaReinforcement learning issue
91F0Fundamental frequency variable used in speech synthesis
92FairseqFacebook AI research sequence-to-sequence toolkit written in Python
93FastSpeechA transformer block acoustic model
94FastSpeech2
95FastSVCFast cross-domain SVC
96featureElements of data for ML algorithms
97feed-forward networkNormal neural network
98FID
99Fifth GenerationJapanese project announced in 1981
100FlashAttention (tiling, recomputation)
101forward difussionIn diffusion models, maps data to noise by gradually perturbing the input data, using Gaussian diffusion kernel.
102forward propagation
103framesMinsky idea
104Galatea
105gated recurrent neural networks
106GeminiGoogle effort to regain supremacy post-OpenAI
107generative adversarial network (GAN)
108GET3DNvidia library
109GloVeVector encoding scheme
110GlowTTS
111GLUEGeneral Language Understanding Evaluation test. BERT set records at 80.5% when it came out 5/24/19.
112GM
113GopherDeepMind 280B parameter LLM
114GPSGeneral Problem Solver, created in 1957.
115GPT-21.5 billion parameters
116GPT-3175 billion parameters
117gradient boosting machine
118gradient descentClass of line-search optimization algorithms for neural network learning.
119greedilylearning representations jointly instead of in succession
120greedy layer-wise pre-trainingcan train DBN
121greedy search decoding
122GRUgated recurrent unit, a gating mechanism in RNNs
123gumbel softmax
124Hadamard productUsed in LSTMs, element-wise multiplication.
125headEach individual process of self-attention. 128 for Gopher, 64 for Chinchilla.
126Hephaestus
127Hessian
128Heuristic Programming Projecteffort at Stanford by Feigenbaum to extend expert systems
129hidden layersbetween input and output layers, introduce nonlinearity
130hidden Markov models
131Hifi-GANNeural, GAN-based vocoder
132HuBERThidden-unit BERT
133HuggingfaceML model hub
134hyperbolic tangentActivating function used in hidden layers. Also known as tanh = (exp(2a)-1) / (exp(2a)+1), creates [-1, 1] bound; this is a saturating non-linearity which is slower than Relu. Equivalent to sinh(x)/cosh(x). Tanh has a stronger gradient than sigmoid, so it can be faster, but still suffers from vanishing gradient problem. Tanh is zero-centered, which makes it easier to learn data centered around zero, especially in non-output layers.
135hyperparametersettings/parameters which do not change during training and are not related to training data, and therefore, external to the learning algorithm; for example, learning rate.
136hypothesis setset of functions mapping features to labels
137ILSVCRC-2010, -2012ImageNet Large-Scale Visual Recognition Challenge, began in 2010. Top-1 error was 47.1% and improved to 37.5% in 2012.
138ImageNetImage recognition database with 15 millioned high-res labeled images in 22k categories and contest. Krizhevsky, Sutskever and Hinton's CNN advanced SOTA in 2012 with a 60m parameter, 650k neuron 5-convolutional, 3-fully-connected layer model.
139inductive biasincorporation of prior knowledge biases learning mechanisms
140instruction fine-tuningSupervised learning technique endows LLM with instruction-following capability. Widely-used example is Alpaca 52k.
141InterSpeechSpeech conference
142kernel methodsgroup of classification algorithms
143knowledge baseClassical AI approach hard-codes world knowledge, eg Cyc, DENDRAL
144knowledge representationStoring knowledge and understanding for computers
145Kullback-Leibler divergenceStatistical distance for probability distributions
146
147
148
149
150LabelMeimage database with hundreds of thousands of labeled segmented images
151LaMDA
152Langevin dynamics
153LarynxTTS system
154latent variablecannot be observed directly, must be inferred from statistical models
155layer normalizationOperation performed on output of residual connection
156LayerNorm
157Layers80 for Chinchilla, Gopher
158leaky units
159learning ratecritical hyperparameter when training networks
160LeNet-51990s CNN by Le Cun.
161lexical field similaritydo the words go together without respect to position
162LibriTTS
163line search
164linear regressionA learning algorithm which optimizes a line (slope + intercept)
165LispJohn McCarthy's AI programming language
166LJSpeech
167local average pooling
168Loebner PrizeTuring Test competition
169Logic TheoristNewell, Simon early AI program
170logical positivismRudolf Carnap idea holding all knowledge can be characterized by logical theories connected to observation sentences that correspond to sensory inputs
171logistic regressionA simple ML algorithm which optimizes a quadratic.
172Logits
173long short-term memory (LSTM)A type of RNN that is designed to remember information over extended time intervals. The hidden state defines the LSTM's memory cell. Four gates (input, forget, cell, output) are computed with the current input, prior hidden state and bias. Sigmoid and tanh add non-linearity. LSTMs can be multilayered or "stacked", where output layers become inputs to the next layer. LSTMs can incorporate projections. See also bidirectional LSTM.
174Input gate: σ(Wii​xt​+bii​+Whi​ht−1​+bhi​)Sigmoid σ bounds output between 0 and 1.
175Forget gate: σ(Wif​xt​+bif​+Whf​ht−1​+bhf​)Sigmoid σ bounds output between 0 and 1.
176Cell Gate: tanh(Wig​xt​+big​+Whg​ht−1​+bhg​)Creates vector of new candidate values for updating cell state. Tanh normalizes values between -1 and 1.
177Output Gate: σ(Wio​xt​+bio​+Who​ht−1​+bho​)Sigmoid σ bounds output between 0 and 1.
178Cell State Update: ft​⊙ct−1​+it​⊙gt​Combines the old state and new candidate values using Hadamard product.
179Hidden State Update: ot​⊙tanh(ct​)
180loss functionLeast squared error, logistic loss, hinge loss, cross-entropy, sum of the squared residuals. Calculates error.
181M-AILABS
182machine learningsubgroup of AI where machines can extract patterns from raw data
183Make-A-VideoFacebook library
184manifold tangent classifier
185masked language modelingBERT technique
186maskingUsed by decoder to avoid computing attention for future words
187materialistThe brain operates according to the laws of physics, including the mind
188Max Learning Rate4E-5, 1E-4
189MCCMicroelectronics and Computer Technology Corporation, a US research consortium
190MDLMinimum Description Length learning rule
191MDMTel Aviv library on human motion diffusion model
192Megatron-LM530B parameter system by Microsoft and Nvidia. Repository includes training pipeline. https://github.com/NVIDIA/Megatron-LM
193mel-spectrogramThe main acoustic feature for acoustic models.
194MelGANGAN-based vocoder, multi-band.
195minimaFinding whether a minima is local or global in a NN is challenging
196MMLUTest for LLMs
197MNISThandwriting recognition
198momentum
199MOSMean Opinion Score, used to rate TTS
200multi-task learning
201multihead attention layerUses self-attention to capture different kinds of attentions, for each individual process of self-attention. Divides d_model into 8 64-dimension "heads" to represent different subspaces of how each word relates to another. This output is a matrix Z_i with shape x*d_k
202MultiNLIBERT set standard with 86.7%
203MultiRCNLP test
204MXNetML framework
205MYCIN450 rule expert system for medicine
206naïve BayesSimple ML algorithm which classifies.
207NarrativeQATest for LLMs
208natural language processing (NLP)The field of studying how computers can be made to communicate in human language
209NaturalQuestionstest for LLMs
210NeocognitronPredecessor to CNN
211NetTalk
212neural network
213Newton direction
214next sentence prediction (NSP)
215NIPS 2017Neural Information Processing Systems conference
216noise
217NORBEarly database of labeled images
218norm
219NPLM
220online learningMultiple rounds of mixed training/testing phases
221OPTFacebook Open Pre-Trained Transformer
222optimization
223output layerfinal layer which requires modulation to achieve correct form of inference
224overfittingpoor generalization due to a small sample size and complex function
225PACProbably Approximately Correct learning model/framework
226PaLM540B parameters
227Pandora
228parameter
229parametrized reverse processIn diffusion models, undoes forward diffusion and performs iterative denoising. Converts random noise into realistic data.
230Pascal Visual Object Challengerelated to ImageNet
231passive learningparadigm where the learner does not interact with the environment at training time
232PDDL+Planning/modeling language
233penalty
234perceptronA multilayer perceptron is the quintessential example of a DL model. It maps inputs to outputs by composing simpler functions.
235perceptron convergence theoremlearning algorithm can adjust the connection strengths of a perceptron to match any input data
236physical symbol systemany system exhibiting intelligence is manipulating symbol-based data structures
237PixelCNN
238PLANNERKnowledge representation scheme
239pre-activationcomputed prior to
240pooling
241positional encodingUsed in transformers to encode the position of specifics words in a sentence
242principcal component analysisProbablistic, independent
243prior
244projectionReducing dimensionality to simplify data, improve performance, etc. See LCA, LDA.
245Prolog
246pseudolabelingself-training, a semi-supervised learning technique
247PygmalionMythological God which created Galatea, a sculpture which came to life
248PyLearn2older framework
249Pytorchaka Torch, leading ML framework introduced 9/2016 by Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan
250R1Early expert system deployed by DEC in the 1980s
251random forest
252rankingtypical ML task
253Rectified Linear Activation Functionmodel of neuron incorporating max(0, a). non-saturating nonlinearity, faster than tanh, do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron.
254recurrent neural networks (RNN)Created by John Hopfield in 1982, also known as Hopfield networks or associative neural networks. LSTMs emerged from RNNs. Can be bidirectional. Can be gated (LSTM).
255recursive network
256regressioncanonical ML task
257regularizationDL technique
258relusee Rectified Linear Activation Unit/Function
259representation learningML approach which learns the representation of data. See autoencoder. Representation learning has a critical issue: representations are often as hard to obtain as solving the original problem. Deep learning is the solution.
260residual connectionPositional input embedding given to multi-headed attention output vector
261residual neural networkaka ResNet, continuous
262reverse diffusion process
263Rhasspygithub.com/rhasspy
264RMSNorm
265RoBERTaOptimized BERT. 88.5 on GLUE.
266RTENLP test
267RVCRetrieval-based voice conversion. Example: https://github.com/w-okada/voice-changer
268SAINT1963 program by James Slagle for integration
269self-attentionAllows association of each word in the input with other words in the same sentence. Based on query, key and value vectors.
270semantic role labeling
271semi-supervised training/learningtraining set includes both labeled and unlabeled data
272SEPends tokenizing
273sequence-to-sequence (Seq2Seq)A type of NN that converts one sequence of components into another
274sequence modelinguses recurrent and recursive networks
275SHRDLU1972 program by Terry Winograd
276sigmoidtype of activation function = 1/(1+exp(-a)), transforms input to [0, 1]
277singing voice conversionConverts source voice to target voice
278skip-gramFocuses on center word in a window of words and predicts context words around it. Generally contains input layer, weights, hidden layer and output containing the word embeddings, resulting in a d_model vector embedding for each word. Word2vec embedding approach is an example of a skipgram.
279sliding-block puzzlesnxn matrices with one blank tile, initial state is random, goal state is ordered
280SOARagent architecture by Newell, Laird and Rosenbloom
281softmaxOutput layer activation function.
282SPSpectral envelope variable used in speech synthesis
283sparse representation
284SQuADStanford Question Answering Dataset. Reading comprehension. V1.1, V2.0. BERT set standard with 93.2 on v1.1 (F1) and 83.1 on v2.0 (F1)
285squared lossloss function
286SRMStructural Risk Minimization learning rule
287Stanford Sentiment Treebank
288STANLEYWinner of DARPA Grand Challenge
289steepest descent
290stochastic gradient descenttraining algorithm
291structured prediction problemsContext-free parsing, dependency parsing, named-entity recognition, part-of-speech tagging
292SuperGLUELeaderboard
293supervised learningBuild a classifier of labeled x-y pairs: linear regression vs. logistic regression, SVM, Naïve Bayes
294SupervisionU Toronto DNN for image recognition
295support vector machinebest known kernel method
296SVCCSinging Voice Conversion challenge
297SWAGSituations with Adversarial Generations
298Switchboardspeech recognition
299t-SNEoptimize clusters
300T5Model from HF
301Tacotron2An encoder-attention-decoder acoustic model
302Talos
303tangent distance
304tangent drop
305tanhsee hyperbolic tangent
306Taylor expansion
307Theanoolder ML framework
308Theseus
309tokenizerepresent each word with a token
310TorchLua-based predecessor to PyTorch
311total Turing testa modified Turing Test requiring computer vision
312transductive inferenceTraining data is labeled and unlabeled test points, but the objective is to predict labels only for unlabeled test points and not unseen test set data.
313transfer learning
314transformerGoogle-invented NN in 2017. 6-layer encoder stack, 6-layer decoder stack. Encoder stack has two major sublayers: multi-headed attention mechanism and fully connected position-wise feedforward network.
315Transformer-TTS
316TransformersHuggingface library
317Traxlibrary by Google Brain
318trust regionnumerical optimization idea
319txt2vecAcoustic model using vector-quantized acoustic feature instead of mel-spectrogram. A classification model rather than a traditional regression model. Uses labeled phoneme-level prosodies for all phonemes in advance. The text encoder consists of 6 Conformer blocks, which encode the input phonemes into hidden states h.
320TTStext-to-speech. Current models include neural (wavenet, tacotron) and statistical parametric speech synthesis (spss).
321Turing TestProposed in 1950
322U-Net
323underfittingtoo simple function to achieve sufficiency predictive power
324unsupervised learningminority vs. supervised
325unsupervised pre-trainingDL technique
326VAEVariational AutoEncoder
327vanishing gradient problemGradients become very small during backpropagation.
328vec2wavVocoder used in VQTTS, uses an additional feature encoder before HifiGAN generation to smooth discontinuous quantized feature.
329vector quantizationcan be applied to self-supervised feature extraction
330vectorizeconvert tokens to vectors
331visible layerThe input or output layers. So called because we can see and understand them.
332VITSConditional variational autoencoder with adversarial learning. TTS Model. https://github.com/jaywalnut310/vits
333vllmCompetitor to Huggingface Transformers for inferencetxt2vec
334vocoderGenerates waveform for TTS. GAN-based (MelGAN, HifiGAN). Universal-Vocoder
335VoxCelebWith VoxCeleb2, celebrity voice/video training set. https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
336VQTTSTTS approach consisting of acoustic model txt2vec and vocoder vec2wav. Adds log pitch, energy and probability of voice features for prosody.
337wav2vecMulti-layer convolutional network optimized via contrastive loss. Extracts features to predict successive frames.
338WaveNetNeural-based vocoder by Deepmind in 2016.
339WaveRNNNeural-based vocoder
340weight decayregularizer, reduces the model's straining error
341WiCNLP test
342word embedding layerLookup table which represents each word as a vector
343word embedding vector
344word2vecEmbedding scheme.
345WORLDVocoder
346Wu Dao 2.01.75T parameters
347X-CLIPMicrosoft video-to-text
348XLNet-Largeformer GLUE leader
349XORexclusive OR cannot be approximated by linear models
350zero-one lossalso known as the misclassification loss function
351
352HumanEval
353
354
355
356
357
358
359
360
361vec2wav