|
Main | | | | | | | | | | | | | | | | | | | | |
| acoustic model | Used in TTS, predicts acoustic features. Tacotron 2 and FastSpeech 2 are examples. Input transcript is one feature, but prosody and linguistic feature can be another. Phoneme-level prosodies, word-level prosodies and hierarchical prosodies. Syntactic graph, word embeddings. L1, L2 loss assumes distribution of the acoustic feature is unimodal, but the real distribution is much more complex. Normalizing flow is an approach (FlowTTS, GlowTTS). Txt2vec (self-supervised VQ acoustic feature vs. mel spectrogram). | | | | | | | | | | | | | | | | | | |
| activation function | sigmoid | | | | | | | | | | | | | | | | | | |
| active learning | paradigm where the learner interacts with the environment at training time | | | | | | | | | | | | | | | | | | |
| ADALINE | linear model (adaptive linear element) | | | | | | | | | | | | | | | | | | |
| adaptive learning rate | MelGAN | | | | | | | | | | | | | | | | | | |
| ADM | Ablated diffusion model | | | | | | | | | | | | | | | | | | |
| adversarial training | | | | | | | | | | | | | | | | | | | |
| AI2 | Allen Institute open language model | | | | | | | | | | | | | | | | | | |
| anthropic principle | The "observation selection effect" suggests that the range of observations we could make is limited by the fact that observations could only happen in a universe capable of developing intelligent life in the first place. The universe has properties that accomodate life because we wouldn't be around to make observations otherwise. There are more than 30 forms of anthropic principles. Weak states the unvierse's tuning is due to selection/survivorship bias. Strong anthropic principle considers the universe compelled to eventually have conscious and sapient life emerge within it. Participatory anthropic principle by Wheeler suggests the universe must be observed. Final AP suggests the universe's structure as expressible by bits of information. | | | | | | | | | | | | | | | | | | |
| AP | aperiodic parameter used in speech synthesis | | | | | | | | | | | | | | | | | | |
| attention | Connects related words using dot products between word vectors | | | | | | | | | | | | | | | | | | |
| autoencoder | An RL algorithm which combines encoder and decoder. Autoencoders try to preserve information, but create new representations with useful properties. Undercomplete, regularized. | | | | | | | | | | | | | | | | | | |
| AutoModel | Huggingface library for loading pretrained models | | | | | | | | | | | | | | | | | | |
| autoregressive | | | | | | | | | | | | | | | | | | | |
| AutoTokenizer | Huggingface tool | | | | | | | | | | | | | | | | | | |
| backpropagation | Hinton algorithm for neural networks. Did not live up to high expectations in the 1980s. | | | | | | | | | | | | | | | | | | |
| bagging | | | | | | | | | | | | | | | | | | | |
| Bard | | | | | | | | | | | | | | | | | | | |
| batch normalization | DL technique | | | | | | | | | | | | | | | | | | |
| batch size, head (number) | 3M-6M, 1.5M-3M | | | | | | | | | | | | | | | | | | |
| Bayesian networks | | | | | | | | | | | | | | | | | | | |
| beam search decoding | | | | | | | | | | | | | | | | | | | |
| BERT | Bidirectional (encoder representations from) transformer powers Google Search, succeeded Transformers. Base tool had 110m parameters, 12 attention heads, 768 hidden nodes, 12 layers. BERT-large had 24 layers, 1024 hidden nodes, 16 attention heads and 340m parameters. | | | | | | | | | | | | | | | | | | |
| BERT-Large | | | | | | | | | | | | | | | | | | | |
| bidirectional LSTM (BiLSTM) | BiLSTM processes both forward and backward directions simultaneously using forward and backward pass. | | | | | | | | | | | | | | | | | | |
| BLEU | machine translation scoring method | | | | | | | | | | | | | | | | | | |
| BlLOOM | Language models by Hugging Face | | | | | | | | | | | | | | | | | | |
| BoolQ | NLP test | | | | | | | | | | | | | | | | | | |
| BOSS | CMU car which won DARPA Urban Challenge | | | | | | | | | | | | | | | | | | |
| brightness normalization | technique mentioned by Krizhevsky, Sutskever and Hinton in 2012 ImageNet paper | | | | | | | | | | | | | | | | | | |
| Caffe | ML framework | | | | | | | | | | | | | | | | | | |
| Caltech-101/256 | Early database of labeled images | | | | | | | | | | | | | | | | | | |
| CB | NLP test | | | | | | | | | | | | | | | | | | |
| Chincilla | Compute-optimal LLM, 70B parameters | | | | | | | | | | | | | | | | | | |
| CIFAR | | | | | | | | | | | | | | | | | | | |
| CIFAR-10/100 | Early database of labeled images | | | | | | | | | | | | | | | | | | |
| classification | typical ML goal | | | | | | | | | | | | | | | | | | |
| classifier | ML technique decides whether input belongs in class A or B. (in LLMs): Abbreviated CLS, begins tokenizing. | | | | | | | | | | | | | | | | | | |
| Claude | Anthropic LLM | | | | | | | | | | | | | | | | | | |
| clustering | canonical ML task of partitioning a set into homogeneous subsets | | | | | | | | | | | | | | | | | | |
| CLVP | Contrastive Language-Voice Pretraining. Compares autoregressive samples vs CLVP to find the best candidates. | | | | | | | | | | | | | | | | | | |
| CNTK | ML framework | | | | | | | | | | | | | | | | | | |
| cognitive science | similar to neuroscience? | | | | | | | | | | | | | | | | | | |
| Cognitron | early neuron network | | | | | | | | | | | | | | | | | | |
| Common Crawl | | | | | | | | | | | | | | | | | | | |
| configurator | In Le Cun's model, a module which performs executive control. Given a task to be executed, it preconfigures the perception module. | | | | | | | | | | | | | | | | | | |
| connectionism | 1980s movement also known as parallel distributed processing emphasized distributed representation | | | | | | | | | | | | | | | | | | |
| constrained | | | | | | | | | | | | | | | | | | | |
| contrast normalization | | | | | | | | | | | | | | | | | | | |
| contrastive loss | | | | | | | | | | | | | | | | | | | |
| Convolutional Neural Networks | Invented by Le Cun in the 1980s. Based on persistent states that gather information layer-by-layer. They have fewer connections and parameters than standard NNs, and are easier to train. | | | | | | | | | | | | | | | | | | |
| cosine similarity function | compares word embeddings using Euclidean norm in unit spheres | | | | | | | | | | | | | | | | | | |
| cost function | | | | | | | | | | | | | | | | | | | |
| Cyc | old knowledge base AI approach. Used CycL language. Demonstrates flaws of knowledge base approach. | | | | | | | | | | | | | | | | | | |
| D_model | Output of every sublayer of transformer architecture, 512 in original Google paper. | | | | | | | | | | | | | | | | | | |
| Daedalus | | | | | | | | | | | | | | | | | | | |
| DALL-E | | | | | | | | | | | | | | | | | | | |
| DART | DARPA logistics program | | | | | | | | | | | | | | | | | | |
| Davinci | Early OpenAI model | | | | | | | | | | | | | | | | | | |
| decoder | Decoder constructs the output sequence based on the vector received from the encoder. In transformers, the decoder produces text sequences and has the multi-headed attention layer, add and norm layers, feed-forward layers, and a linear layer with proabilities of output. In representational learning, and in general, decoders convert encoded representations back into their original formats. | | | | | | | | | | | | | | | | | | |
| Deep Blue | IBM chess playing program, defeated Kasparov in 1997 | | | | | | | | | | | | | | | | | | |
| DeepMind | Google-owned NN company | | | | | | | | | | | | | | | | | | |
| deep belief network | trained by greedy layer-wise pre-training | | | | | | | | | | | | | | | | | | |
| deep learning | Deep learning is a subgroup of machine learning that allows representations to be composed of other, simpler representations. The quintessential example is the feedforward deep network, or multilayer perceptron. | | | | | | | | | | | | | | | | | | |
| deep neural networks | | | | | | | | | | | | | | | | | | | |
| DENDRAL | Mass spectrometry program for predicting compounds in 1969. | | | | | | | | | | | | | | | | | | |
| differentiable | can compute gradient estimates of some objective function with respect to its own input and propagate to upstream modules | | | | | | | | | | | | | | | | | | |
| DiffSVC | Diffusion probabilistic model for SVC | | | | | | | | | | | | | | | | | | |
| diffusion model | Also known as diffusion probabilistic models, a class of latent variable models which outperform GANs. Also known as denoising diffusion models or score-based generative models. Diffusion models are parameterized Markov chains using two processes: forward diffusion and parametrized reverse. The forward diffusion process passes randomly sample noise with Gaussian noise and then reverses through the "learned denoising process". Diffusion models map latent space using a fixed Markov chain, which adds noise. Training a diffusion model requires finding reverse Markov transitions that maximize the likelihood of the training data. Diffusion models scale well to large datasets for image synthesis. | | | | | | | | | | | | | | | | | | |
| diffusion process | Data are progressively noised | | | | | | | | | | | | | | | | | | |
| Diffwave | Neural-based vocoder | | | | | | | | | | | | | | | | | | |
| dimensionality reduction | Also known as manifold learning, a canonical ML task which transforms an initial representation into a lower-dimensional one with preserved properties | | | | | | | | | | | | | | | | | | |
| DistBelief | framework | | | | | | | | | | | | | | | | | | |
| distributed representation | each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs | | | | | | | | | | | | | | | | | | |
| diversity loss | | | | | | | | | | | | | | | | | | | |
| DreamFusion | Google text-to-3D | | | | | | | | | | | | | | | | | | |
| dropout | Regularization method that limits overfitting by setting to zero a subset of features by multiplying them with a Bernoulli random variable. | | | | | | | | | | | | | | | | | | |
| dualism | Cartesian idea that the mind is outside the laws of nature. | | | | | | | | | | | | | | | | | | |
| echo state networks | | | | | | | | | | | | | | | | | | | |
| ELMo | | | | | | | | | | | | | | | | | | | |
| embedding | Tokens go from single integers to vectors of d_model dimensions using learned embeddings in the model | | | | | | | | | | | | | | | | | | |
| empiricism | Francis Bacon and John Locke's movement held the senses are responsible for understanding | | | | | | | | | | | | | | | | | | |
| encoder | Encodes text (or any input data) into a vector (or different representation) and transmits to the decoder. | | | | | | | | | | | | | | | | | | |
| ensemble methods | | | | | | | | | | | | | | | | | | | |
| ERM | Empirical Risk Minimization learning rules | | | | | | | | | | | | | | | | | | |
| espeak | robotic TTS | | | | | | | | | | | | | | | | | | |
| Euclidean norm | vector distance from origin | | | | | | | | | | | | | | | | | | |
| expert system | see knowledge base. A rule-based system for determining answers, an older AI methodology | | | | | | | | | | | | | | | | | | |
| exploration vs exploitation dilemma | Reinforcement learning issue | | | | | | | | | | | | | | | | | | |
| F0 | Fundamental frequency variable used in speech synthesis | | | | | | | | | | | | | | | | | | |
| Fairseq | Facebook AI research sequence-to-sequence toolkit written in Python | | | | | | | | | | | | | | | | | | |
| FastSpeech | A transformer block acoustic model | | | | | | | | | | | | | | | | | | |
| FastSpeech2 | | | | | | | | | | | | | | | | | | | |
| FastSVC | Fast cross-domain SVC | | | | | | | | | | | | | | | | | | |
| feature | Elements of data for ML algorithms | | | | | | | | | | | | | | | | | | |
| feed-forward network | Normal neural network | | | | | | | | | | | | | | | | | | |
| FID | | | | | | | | | | | | | | | | | | | |
| Fifth Generation | Japanese project announced in 1981 | | | | | | | | | | | | | | | | | | |
| FlashAttention (tiling, recomputation) | | | | | | | | | | | | | | | | | | | |
| forward difussion | In diffusion models, maps data to noise by gradually perturbing the input data, using Gaussian diffusion kernel. | | | | | | | | | | | | | | | | | | |
| forward propagation | | | | | | | | | | | | | | | | | | | |
| frames | Minsky idea | | | | | | | | | | | | | | | | | | |
| Galatea | | | | | | | | | | | | | | | | | | | |
| gated recurrent neural networks | | | | | | | | | | | | | | | | | | | |
| Gemini | Google effort to regain supremacy post-OpenAI | | | | | | | | | | | | | | | | | | |
| generative adversarial network (GAN) | | | | | | | | | | | | | | | | | | | |
| GET3D | Nvidia library | | | | | | | | | | | | | | | | | | |
| GloVe | Vector encoding scheme | | | | | | | | | | | | | | | | | | |
| GlowTTS | | | | | | | | | | | | | | | | | | | |
| GLUE | General Language Understanding Evaluation test. BERT set records at 80.5% when it came out 5/24/19. | | | | | | | | | | | | | | | | | | |
| GM | | | | | | | | | | | | | | | | | | | |
| Gopher | DeepMind 280B parameter LLM | | | | | | | | | | | | | | | | | | |
| GPS | General Problem Solver, created in 1957. | | | | | | | | | | | | | | | | | | |
| GPT-2 | 1.5 billion parameters | | | | | | | | | | | | | | | | | | |
| GPT-3 | 175 billion parameters | | | | | | | | | | | | | | | | | | |
| gradient boosting machine | | | | | | | | | | | | | | | | | | | |
| gradient descent | Class of line-search optimization algorithms for neural network learning. | | | | | | | | | | | | | | | | | | |
| greedily | learning representations jointly instead of in succession | | | | | | | | | | | | | | | | | | |
| greedy layer-wise pre-training | can train DBN | | | | | | | | | | | | | | | | | | |
| greedy search decoding | | | | | | | | | | | | | | | | | | | |
| GRU | gated recurrent unit, a gating mechanism in RNNs | | | | | | | | | | | | | | | | | | |
| gumbel softmax | | | | | | | | | | | | | | | | | | | |
| Hadamard product | Used in LSTMs, element-wise multiplication. | | | | | | | | | | | | | | | | | | |
| head | Each individual process of self-attention. 128 for Gopher, 64 for Chinchilla. | | | | | | | | | | | | | | | | | | |
| Hephaestus | | | | | | | | | | | | | | | | | | | |
| Hessian | | | | | | | | | | | | | | | | | | | |
| Heuristic Programming Project | effort at Stanford by Feigenbaum to extend expert systems | | | | | | | | | | | | | | | | | | |
| hidden layers | between input and output layers, introduce nonlinearity | | | | | | | | | | | | | | | | | | |
| hidden Markov models | | | | | | | | | | | | | | | | | | | |
| Hifi-GAN | Neural, GAN-based vocoder | | | | | | | | | | | | | | | | | | |
| HuBERT | hidden-unit BERT | | | | | | | | | | | | | | | | | | |
| Huggingface | ML model hub | | | | | | | | | | | | | | | | | | |
| hyperbolic tangent | Activating function used in hidden layers. Also known as tanh = (exp(2a)-1) / (exp(2a)+1), creates [-1, 1] bound; this is a saturating non-linearity which is slower than Relu. Equivalent to sinh(x)/cosh(x). Tanh has a stronger gradient than sigmoid, so it can be faster, but still suffers from vanishing gradient problem. Tanh is zero-centered, which makes it easier to learn data centered around zero, especially in non-output layers. | | | | | | | | | | | | | | | | | | |
| hyperparameter | settings/parameters which do not change during training and are not related to training data, and therefore, external to the learning algorithm; for example, learning rate. | | | | | | | | | | | | | | | | | | |
| hypothesis set | set of functions mapping features to labels | | | | | | | | | | | | | | | | | | |
| ILSVCRC-2010, -2012 | ImageNet Large-Scale Visual Recognition Challenge, began in 2010. Top-1 error was 47.1% and improved to 37.5% in 2012. | | | | | | | | | | | | | | | | | | |
| ImageNet | Image recognition database with 15 millioned high-res labeled images in 22k categories and contest. Krizhevsky, Sutskever and Hinton's CNN advanced SOTA in 2012 with a 60m parameter, 650k neuron 5-convolutional, 3-fully-connected layer model. | | | | | | | | | | | | | | | | | | |
| inductive bias | incorporation of prior knowledge biases learning mechanisms | | | | | | | | | | | | | | | | | | |
| instruction fine-tuning | Supervised learning technique endows LLM with instruction-following capability. Widely-used example is Alpaca 52k. | | | | | | | | | | | | | | | | | | |
| InterSpeech | Speech conference | | | | | | | | | | | | | | | | | | |
| kernel methods | group of classification algorithms | | | | | | | | | | | | | | | | | | |
| knowledge base | Classical AI approach hard-codes world knowledge, eg Cyc, DENDRAL | | | | | | | | | | | | | | | | | | |
| knowledge representation | Storing knowledge and understanding for computers | | | | | | | | | | | | | | | | | | |
| Kullback-Leibler divergence | Statistical distance for probability distributions | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| LabelMe | image database with hundreds of thousands of labeled segmented images | | | | | | | | | | | | | | | | | | |
| LaMDA | | | | | | | | | | | | | | | | | | | |
| Langevin dynamics | | | | | | | | | | | | | | | | | | | |
| Larynx | TTS system | | | | | | | | | | | | | | | | | | |
| latent variable | cannot be observed directly, must be inferred from statistical models | | | | | | | | | | | | | | | | | | |
| layer normalization | Operation performed on output of residual connection | | | | | | | | | | | | | | | | | | |
| LayerNorm | | | | | | | | | | | | | | | | | | | |
| Layers | 80 for Chinchilla, Gopher | | | | | | | | | | | | | | | | | | |
| leaky units | | | | | | | | | | | | | | | | | | | |
| learning rate | critical hyperparameter when training networks | | | | | | | | | | | | | | | | | | |
| LeNet-5 | 1990s CNN by Le Cun. | | | | | | | | | | | | | | | | | | |
| lexical field similarity | do the words go together without respect to position | | | | | | | | | | | | | | | | | | |
| LibriTTS | | | | | | | | | | | | | | | | | | | |
| line search | | | | | | | | | | | | | | | | | | | |
| linear regression | A learning algorithm which optimizes a line (slope + intercept) | | | | | | | | | | | | | | | | | | |
| Lisp | John McCarthy's AI programming language | | | | | | | | | | | | | | | | | | |
| LJSpeech | | | | | | | | | | | | | | | | | | | |
| local average pooling | | | | | | | | | | | | | | | | | | | |
| Loebner Prize | Turing Test competition | | | | | | | | | | | | | | | | | | |
| Logic Theorist | Newell, Simon early AI program | | | | | | | | | | | | | | | | | | |
| logical positivism | Rudolf Carnap idea holding all knowledge can be characterized by logical theories connected to observation sentences that correspond to sensory inputs | | | | | | | | | | | | | | | | | | |
| logistic regression | A simple ML algorithm which optimizes a quadratic. | | | | | | | | | | | | | | | | | | |
| Logits | | | | | | | | | | | | | | | | | | | |
| long short-term memory (LSTM) | A type of RNN that is designed to remember information over extended time intervals. The hidden state defines the LSTM's memory cell. Four gates (input, forget, cell, output) are computed with the current input, prior hidden state and bias. Sigmoid and tanh add non-linearity. LSTMs can be multilayered or "stacked", where output layers become inputs to the next layer. LSTMs can incorporate projections. See also bidirectional LSTM. | | | | | | | | | | | | | | | | | | |
| | Input gate: σ(Wiixt+bii+Whiht−1+bhi) | Sigmoid σ bounds output between 0 and 1. | | | | | | | | | | | | | | | | | |
| | Forget gate: σ(Wifxt+bif+Whfht−1+bhf) | Sigmoid σ bounds output between 0 and 1. | | | | | | | | | | | | | | | | | |
| | Cell Gate: tanh(Wigxt+big+Whght−1+bhg) | Creates vector of new candidate values for updating cell state. Tanh normalizes values between -1 and 1. | | | | | | | | | | | | | | | | | |
| | Output Gate: σ(Wioxt+bio+Whoht−1+bho) | Sigmoid σ bounds output between 0 and 1. | | | | | | | | | | | | | | | | | |
| | Cell State Update: ft⊙ct−1+it⊙gt | Combines the old state and new candidate values using Hadamard product. | | | | | | | | | | | | | | | | | |
| | Hidden State Update: ot⊙tanh(ct) | | | | | | | | | | | | | | | | | | |
| loss function | Least squared error, logistic loss, hinge loss, cross-entropy, sum of the squared residuals. Calculates error. | | | | | | | | | | | | | | | | | | |
| M-AILABS | | | | | | | | | | | | | | | | | | | |
| machine learning | subgroup of AI where machines can extract patterns from raw data | | | | | | | | | | | | | | | | | | |
| Make-A-Video | Facebook library | | | | | | | | | | | | | | | | | | |
| manifold tangent classifier | | | | | | | | | | | | | | | | | | | |
| masked language modeling | BERT technique | | | | | | | | | | | | | | | | | | |
| masking | Used by decoder to avoid computing attention for future words | | | | | | | | | | | | | | | | | | |
| materialist | The brain operates according to the laws of physics, including the mind | | | | | | | | | | | | | | | | | | |
| Max Learning Rate | 4E-5, 1E-4 | | | | | | | | | | | | | | | | | | |
| MCC | Microelectronics and Computer Technology Corporation, a US research consortium | | | | | | | | | | | | | | | | | | |
| MDL | Minimum Description Length learning rule | | | | | | | | | | | | | | | | | | |
| MDM | Tel Aviv library on human motion diffusion model | | | | | | | | | | | | | | | | | | |
| Megatron-LM | 530B parameter system by Microsoft and Nvidia. Repository includes training pipeline. https://github.com/NVIDIA/Megatron-LM | | | | | | | | | | | | | | | | | | |
| mel-spectrogram | The main acoustic feature for acoustic models. | | | | | | | | | | | | | | | | | | |
| MelGAN | GAN-based vocoder, multi-band. | | | | | | | | | | | | | | | | | | |
| minima | Finding whether a minima is local or global in a NN is challenging | | | | | | | | | | | | | | | | | | |
| MMLU | Test for LLMs | | | | | | | | | | | | | | | | | | |
| MNIST | handwriting recognition | | | | | | | | | | | | | | | | | | |
| momentum | | | | | | | | | | | | | | | | | | | |
| MOS | Mean Opinion Score, used to rate TTS | | | | | | | | | | | | | | | | | | |
| multi-task learning | | | | | | | | | | | | | | | | | | | |
| multihead attention layer | Uses self-attention to capture different kinds of attentions, for each individual process of self-attention. Divides d_model into 8 64-dimension "heads" to represent different subspaces of how each word relates to another. This output is a matrix Z_i with shape x*d_k | | | | | | | | | | | | | | | | | | |
| MultiNLI | BERT set standard with 86.7% | | | | | | | | | | | | | | | | | | |
| MultiRC | NLP test | | | | | | | | | | | | | | | | | | |
| MXNet | ML framework | | | | | | | | | | | | | | | | | | |
| MYCIN | 450 rule expert system for medicine | | | | | | | | | | | | | | | | | | |
| naïve Bayes | Simple ML algorithm which classifies. | | | | | | | | | | | | | | | | | | |
| NarrativeQA | Test for LLMs | | | | | | | | | | | | | | | | | | |
| natural language processing (NLP) | The field of studying how computers can be made to communicate in human language | | | | | | | | | | | | | | | | | | |
| NaturalQuestions | test for LLMs | | | | | | | | | | | | | | | | | | |
| Neocognitron | Predecessor to CNN | | | | | | | | | | | | | | | | | | |
| NetTalk | | | | | | | | | | | | | | | | | | | |
| neural network | | | | | | | | | | | | | | | | | | | |
| Newton direction | | | | | | | | | | | | | | | | | | | |
| next sentence prediction (NSP) | | | | | | | | | | | | | | | | | | | |
| NIPS 2017 | Neural Information Processing Systems conference | | | | | | | | | | | | | | | | | | |
| noise | | | | | | | | | | | | | | | | | | | |
| NORB | Early database of labeled images | | | | | | | | | | | | | | | | | | |
| norm | | | | | | | | | | | | | | | | | | | |
| NPLM | | | | | | | | | | | | | | | | | | | |
| online learning | Multiple rounds of mixed training/testing phases | | | | | | | | | | | | | | | | | | |
| OPT | Facebook Open Pre-Trained Transformer | | | | | | | | | | | | | | | | | | |
| optimization | | | | | | | | | | | | | | | | | | | |
| output layer | final layer which requires modulation to achieve correct form of inference | | | | | | | | | | | | | | | | | | |
| overfitting | poor generalization due to a small sample size and complex function | | | | | | | | | | | | | | | | | | |
| PAC | Probably Approximately Correct learning model/framework | | | | | | | | | | | | | | | | | | |
| PaLM | 540B parameters | | | | | | | | | | | | | | | | | | |
| Pandora | | | | | | | | | | | | | | | | | | | |
| parameter | | | | | | | | | | | | | | | | | | | |
| parametrized reverse process | In diffusion models, undoes forward diffusion and performs iterative denoising. Converts random noise into realistic data. | | | | | | | | | | | | | | | | | | |
| Pascal Visual Object Challenge | related to ImageNet | | | | | | | | | | | | | | | | | | |
| passive learning | paradigm where the learner does not interact with the environment at training time | | | | | | | | | | | | | | | | | | |
| PDDL+ | Planning/modeling language | | | | | | | | | | | | | | | | | | |
| penalty | | | | | | | | | | | | | | | | | | | |
| perceptron | A multilayer perceptron is the quintessential example of a DL model. It maps inputs to outputs by composing simpler functions. | | | | | | | | | | | | | | | | | | |
| perceptron convergence theorem | learning algorithm can adjust the connection strengths of a perceptron to match any input data | | | | | | | | | | | | | | | | | | |
| physical symbol system | any system exhibiting intelligence is manipulating symbol-based data structures | | | | | | | | | | | | | | | | | | |
| PixelCNN | | | | | | | | | | | | | | | | | | | |
| PLANNER | Knowledge representation scheme | | | | | | | | | | | | | | | | | | |
| pre-activation | computed prior to | | | | | | | | | | | | | | | | | | |
| pooling | | | | | | | | | | | | | | | | | | | |
| positional encoding | Used in transformers to encode the position of specifics words in a sentence | | | | | | | | | | | | | | | | | | |
| principcal component analysis | Probablistic, independent | | | | | | | | | | | | | | | | | | |
| prior | | | | | | | | | | | | | | | | | | | |
| projection | Reducing dimensionality to simplify data, improve performance, etc. See LCA, LDA. | | | | | | | | | | | | | | | | | | |
| Prolog | | | | | | | | | | | | | | | | | | | |
| pseudolabeling | self-training, a semi-supervised learning technique | | | | | | | | | | | | | | | | | | |
| Pygmalion | Mythological God which created Galatea, a sculpture which came to life | | | | | | | | | | | | | | | | | | |
| PyLearn2 | older framework | | | | | | | | | | | | | | | | | | |
| Pytorch | aka Torch, leading ML framework introduced 9/2016 by Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan | | | | | | | | | | | | | | | | | | |
| R1 | Early expert system deployed by DEC in the 1980s | | | | | | | | | | | | | | | | | | |
| random forest | | | | | | | | | | | | | | | | | | | |
| ranking | typical ML task | | | | | | | | | | | | | | | | | | |
| Rectified Linear Activation Function | model of neuron incorporating max(0, a). non-saturating nonlinearity, faster than tanh, do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. | | | | | | | | | | | | | | | | | | |
| recurrent neural networks (RNN) | Created by John Hopfield in 1982, also known as Hopfield networks or associative neural networks. LSTMs emerged from RNNs. Can be bidirectional. Can be gated (LSTM). | | | | | | | | | | | | | | | | | | |
| recursive network | | | | | | | | | | | | | | | | | | | |
| regression | canonical ML task | | | | | | | | | | | | | | | | | | |
| regularization | DL technique | | | | | | | | | | | | | | | | | | |
| relu | see Rectified Linear Activation Unit/Function | | | | | | | | | | | | | | | | | | |
| representation learning | ML approach which learns the representation of data. See autoencoder. Representation learning has a critical issue: representations are often as hard to obtain as solving the original problem. Deep learning is the solution. | | | | | | | | | | | | | | | | | | |
| residual connection | Positional input embedding given to multi-headed attention output vector | | | | | | | | | | | | | | | | | | |
| residual neural network | aka ResNet, continuous | | | | | | | | | | | | | | | | | | |
| reverse diffusion process | | | | | | | | | | | | | | | | | | | |
| Rhasspy | github.com/rhasspy | | | | | | | | | | | | | | | | | | |
| RMSNorm | | | | | | | | | | | | | | | | | | | |
| RoBERTa | Optimized BERT. 88.5 on GLUE. | | | | | | | | | | | | | | | | | | |
| RTE | NLP test | | | | | | | | | | | | | | | | | | |
| RVC | Retrieval-based voice conversion. Example: https://github.com/w-okada/voice-changer | | | | | | | | | | | | | | | | | | |
| SAINT | 1963 program by James Slagle for integration | | | | | | | | | | | | | | | | | | |
| self-attention | Allows association of each word in the input with other words in the same sentence. Based on query, key and value vectors. | | | | | | | | | | | | | | | | | | |
| semantic role labeling | | | | | | | | | | | | | | | | | | | |
| semi-supervised training/learning | training set includes both labeled and unlabeled data | | | | | | | | | | | | | | | | | | |
| SEP | ends tokenizing | | | | | | | | | | | | | | | | | | |
| sequence-to-sequence (Seq2Seq) | A type of NN that converts one sequence of components into another | | | | | | | | | | | | | | | | | | |
| sequence modeling | uses recurrent and recursive networks | | | | | | | | | | | | | | | | | | |
| SHRDLU | 1972 program by Terry Winograd | | | | | | | | | | | | | | | | | | |
| sigmoid | type of activation function = 1/(1+exp(-a)), transforms input to [0, 1] | | | | | | | | | | | | | | | | | | |
| singing voice conversion | Converts source voice to target voice | | | | | | | | | | | | | | | | | | |
| skip-gram | Focuses on center word in a window of words and predicts context words around it. Generally contains input layer, weights, hidden layer and output containing the word embeddings, resulting in a d_model vector embedding for each word. Word2vec embedding approach is an example of a skipgram. | | | | | | | | | | | | | | | | | | |
| sliding-block puzzles | nxn matrices with one blank tile, initial state is random, goal state is ordered | | | | | | | | | | | | | | | | | | |
| SOAR | agent architecture by Newell, Laird and Rosenbloom | | | | | | | | | | | | | | | | | | |
| softmax | Output layer activation function. | | | | | | | | | | | | | | | | | | |
| SP | Spectral envelope variable used in speech synthesis | | | | | | | | | | | | | | | | | | |
| sparse representation | | | | | | | | | | | | | | | | | | | |
| SQuAD | Stanford Question Answering Dataset. Reading comprehension. V1.1, V2.0. BERT set standard with 93.2 on v1.1 (F1) and 83.1 on v2.0 (F1) | | | | | | | | | | | | | | | | | | |
| squared loss | loss function | | | | | | | | | | | | | | | | | | |
| SRM | Structural Risk Minimization learning rule | | | | | | | | | | | | | | | | | | |
| Stanford Sentiment Treebank | | | | | | | | | | | | | | | | | | | |
| STANLEY | Winner of DARPA Grand Challenge | | | | | | | | | | | | | | | | | | |
| steepest descent | | | | | | | | | | | | | | | | | | | |
| stochastic gradient descent | training algorithm | | | | | | | | | | | | | | | | | | |
| structured prediction problems | Context-free parsing, dependency parsing, named-entity recognition, part-of-speech tagging | | | | | | | | | | | | | | | | | | |
| SuperGLUE | Leaderboard | | | | | | | | | | | | | | | | | | |
| supervised learning | Build a classifier of labeled x-y pairs: linear regression vs. logistic regression, SVM, Naïve Bayes | | | | | | | | | | | | | | | | | | |
| Supervision | U Toronto DNN for image recognition | | | | | | | | | | | | | | | | | | |
| support vector machine | best known kernel method | | | | | | | | | | | | | | | | | | |
| SVCC | Singing Voice Conversion challenge | | | | | | | | | | | | | | | | | | |
| SWAG | Situations with Adversarial Generations | | | | | | | | | | | | | | | | | | |
| Switchboard | speech recognition | | | | | | | | | | | | | | | | | | |
| t-SNE | optimize clusters | | | | | | | | | | | | | | | | | | |
| T5 | Model from HF | | | | | | | | | | | | | | | | | | |
| Tacotron2 | An encoder-attention-decoder acoustic model | | | | | | | | | | | | | | | | | | |
| Talos | | | | | | | | | | | | | | | | | | | |
| tangent distance | | | | | | | | | | | | | | | | | | | |
| tangent drop | | | | | | | | | | | | | | | | | | | |
| tanh | see hyperbolic tangent | | | | | | | | | | | | | | | | | | |
| Taylor expansion | | | | | | | | | | | | | | | | | | | |
| Theano | older ML framework | | | | | | | | | | | | | | | | | | |
| Theseus | | | | | | | | | | | | | | | | | | | |
| tokenize | represent each word with a token | | | | | | | | | | | | | | | | | | |
| Torch | Lua-based predecessor to PyTorch | | | | | | | | | | | | | | | | | | |
| total Turing test | a modified Turing Test requiring computer vision | | | | | | | | | | | | | | | | | | |
| transductive inference | Training data is labeled and unlabeled test points, but the objective is to predict labels only for unlabeled test points and not unseen test set data. | | | | | | | | | | | | | | | | | | |
| transfer learning | | | | | | | | | | | | | | | | | | | |
| transformer | Google-invented NN in 2017. 6-layer encoder stack, 6-layer decoder stack. Encoder stack has two major sublayers: multi-headed attention mechanism and fully connected position-wise feedforward network. | | | | | | | | | | | | | | | | | | |
| Transformer-TTS | | | | | | | | | | | | | | | | | | | |
| Transformers | Huggingface library | | | | | | | | | | | | | | | | | | |
| Trax | library by Google Brain | | | | | | | | | | | | | | | | | | |
| trust region | numerical optimization idea | | | | | | | | | | | | | | | | | | |
| txt2vec | Acoustic model using vector-quantized acoustic feature instead of mel-spectrogram. A classification model rather than a traditional regression model. Uses labeled phoneme-level prosodies for all phonemes in advance. The text encoder consists of 6 Conformer blocks, which encode the input phonemes into hidden states h. | | | | | | | | | | | | | | | | | | |
| TTS | text-to-speech. Current models include neural (wavenet, tacotron) and statistical parametric speech synthesis (spss). | | | | | | | | | | | | | | | | | | |
| Turing Test | Proposed in 1950 | | | | | | | | | | | | | | | | | | |
| U-Net | | | | | | | | | | | | | | | | | | | |
| underfitting | too simple function to achieve sufficiency predictive power | | | | | | | | | | | | | | | | | | |
| unsupervised learning | minority vs. supervised | | | | | | | | | | | | | | | | | | |
| unsupervised pre-training | DL technique | | | | | | | | | | | | | | | | | | |
| VAE | Variational AutoEncoder | | | | | | | | | | | | | | | | | | |
| vanishing gradient problem | Gradients become very small during backpropagation. | | | | | | | | | | | | | | | | | | |
| vec2wav | Vocoder used in VQTTS, uses an additional feature encoder before HifiGAN generation to smooth discontinuous quantized feature. | | | | | | | | | | | | | | | | | | |
| vector quantization | can be applied to self-supervised feature extraction | | | | | | | | | | | | | | | | | | |
| vectorize | convert tokens to vectors | | | | | | | | | | | | | | | | | | |
| visible layer | The input or output layers. So called because we can see and understand them. | | | | | | | | | | | | | | | | | | |
| VITS | Conditional variational autoencoder with adversarial learning. TTS Model. https://github.com/jaywalnut310/vits | | | | | | | | | | | | | | | | | | |
| vllm | Competitor to Huggingface Transformers for inference | | | | | | | | | | | | | | | | | | txt2vec |
| vocoder | Generates waveform for TTS. GAN-based (MelGAN, HifiGAN). Universal-Vocoder | | | | | | | | | | | | | | | | | | |
| VoxCeleb | With VoxCeleb2, celebrity voice/video training set. https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html | | | | | | | | | | | | | | | | | | |
| VQTTS | TTS approach consisting of acoustic model txt2vec and vocoder vec2wav. Adds log pitch, energy and probability of voice features for prosody. | | | | | | | | | | | | | | | | | | |
| wav2vec | Multi-layer convolutional network optimized via contrastive loss. Extracts features to predict successive frames. | | | | | | | | | | | | | | | | | | |
| WaveNet | Neural-based vocoder by Deepmind in 2016. | | | | | | | | | | | | | | | | | | |
| WaveRNN | Neural-based vocoder | | | | | | | | | | | | | | | | | | |
| weight decay | regularizer, reduces the model's straining error | | | | | | | | | | | | | | | | | | |
| WiC | NLP test | | | | | | | | | | | | | | | | | | |
| word embedding layer | Lookup table which represents each word as a vector | | | | | | | | | | | | | | | | | | |
| word embedding vector | | | | | | | | | | | | | | | | | | | |
| word2vec | Embedding scheme. | | | | | | | | | | | | | | | | | | |
| WORLD | Vocoder | | | | | | | | | | | | | | | | | | |
| Wu Dao 2.0 | 1.75T parameters | | | | | | | | | | | | | | | | | | |
| X-CLIP | Microsoft video-to-text | | | | | | | | | | | | | | | | | | |
| XLNet-Large | former GLUE leader | | | | | | | | | | | | | | | | | | |
| XOR | exclusive OR cannot be approximated by linear models | | | | | | | | | | | | | | | | | | |
| zero-one loss | also known as the misclassification loss function | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| HumanEval | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | vec2wav | |