Martin Shkreli

	A	B	C	D	T	U
1	Main
2		acoustic model	Used in TTS, predicts acoustic features. Tacotron 2 and FastSpeech 2 are examples. Input transcript is one feature, but prosody and linguistic feature can be another. Phoneme-level prosodies, word-level prosodies and hierarchical prosodies. Syntactic graph, word embeddings. L1, L2 loss assumes distribution of the acoustic feature is unimodal, but the real distribution is much more complex. Normalizing flow is an approach (FlowTTS, GlowTTS). Txt2vec (self-supervised VQ acoustic feature vs. mel spectrogram).
3		activation function	sigmoid
4		active learning	paradigm where the learner interacts with the environment at training time
5		ADALINE	linear model (adaptive linear element)
6		adaptive learning rate	MelGAN
7		ADM	Ablated diffusion model
8		adversarial training
9		AI2	Allen Institute open language model
10		anthropic principle	The "observation selection effect" suggests that the range of observations we could make is limited by the fact that observations could only happen in a universe capable of developing intelligent life in the first place. The universe has properties that accomodate life because we wouldn't be around to make observations otherwise. There are more than 30 forms of anthropic principles. Weak states the unvierse's tuning is due to selection/survivorship bias. Strong anthropic principle considers the universe compelled to eventually have conscious and sapient life emerge within it. Participatory anthropic principle by Wheeler suggests the universe must be observed. Final AP suggests the universe's structure as expressible by bits of information.
11		AP	aperiodic parameter used in speech synthesis
12		attention	Connects related words using dot products between word vectors
13		autoencoder	An RL algorithm which combines encoder and decoder. Autoencoders try to preserve information, but create new representations with useful properties. Undercomplete, regularized.
14		AutoModel	Huggingface library for loading pretrained models
15		autoregressive
16		AutoTokenizer	Huggingface tool
17		backpropagation	Hinton algorithm for neural networks. Did not live up to high expectations in the 1980s.
18		bagging
19		Bard
20		batch normalization	DL technique
21		batch size, head (number)	3M-6M, 1.5M-3M
22		Bayesian networks
23		beam search decoding
24		BERT	Bidirectional (encoder representations from) transformer powers Google Search, succeeded Transformers. Base tool had 110m parameters, 12 attention heads, 768 hidden nodes, 12 layers. BERT-large had 24 layers, 1024 hidden nodes, 16 attention heads and 340m parameters.
25		BERT-Large
26		bidirectional LSTM (BiLSTM)	BiLSTM processes both forward and backward directions simultaneously using forward and backward pass.
27		BLEU	machine translation scoring method
28		BlLOOM	Language models by Hugging Face
29		BoolQ	NLP test
30		BOSS	CMU car which won DARPA Urban Challenge
31		brightness normalization	technique mentioned by Krizhevsky, Sutskever and Hinton in 2012 ImageNet paper
32		Caffe	ML framework
33		Caltech-101/256	Early database of labeled images
34		CB	NLP test
35		Chincilla	Compute-optimal LLM, 70B parameters
36		CIFAR
37		CIFAR-10/100	Early database of labeled images
38		classification	typical ML goal
39		classifier	ML technique decides whether input belongs in class A or B. (in LLMs): Abbreviated CLS, begins tokenizing.
40		Claude	Anthropic LLM
41		clustering	canonical ML task of partitioning a set into homogeneous subsets
42		CLVP	Contrastive Language-Voice Pretraining. Compares autoregressive samples vs CLVP to find the best candidates.
43		CNTK	ML framework
44		cognitive science	similar to neuroscience?
45		Cognitron	early neuron network
46		Common Crawl
47		configurator	In Le Cun's model, a module which performs executive control. Given a task to be executed, it preconfigures the perception module.
48		connectionism	1980s movement also known as parallel distributed processing emphasized distributed representation
49		constrained
50		contrast normalization
51		contrastive loss
52		Convolutional Neural Networks	Invented by Le Cun in the 1980s. Based on persistent states that gather information layer-by-layer. They have fewer connections and parameters than standard NNs, and are easier to train.
53		cosine similarity function	compares word embeddings using Euclidean norm in unit spheres
54		cost function
55		Cyc	old knowledge base AI approach. Used CycL language. Demonstrates flaws of knowledge base approach.
56		D_model	Output of every sublayer of transformer architecture, 512 in original Google paper.
57		Daedalus
58		DALL-E
59		DART	DARPA logistics program
60		Davinci	Early OpenAI model
61		decoder	Decoder constructs the output sequence based on the vector received from the encoder. In transformers, the decoder produces text sequences and has the multi-headed attention layer, add and norm layers, feed-forward layers, and a linear layer with proabilities of output. In representational learning, and in general, decoders convert encoded representations back into their original formats.
62		Deep Blue	IBM chess playing program, defeated Kasparov in 1997
63		DeepMind	Google-owned NN company
64		deep belief network	trained by greedy layer-wise pre-training
65		deep learning	Deep learning is a subgroup of machine learning that allows representations to be composed of other, simpler representations. The quintessential example is the feedforward deep network, or multilayer perceptron.
66		deep neural networks
67		DENDRAL	Mass spectrometry program for predicting compounds in 1969.
68		differentiable	can compute gradient estimates of some objective function with respect to its own input and propagate to upstream modules
69		DiffSVC	Diffusion probabilistic model for SVC
70		diffusion model	Also known as diffusion probabilistic models, a class of latent variable models which outperform GANs. Also known as denoising diffusion models or score-based generative models. Diffusion models are parameterized Markov chains using two processes: forward diffusion and parametrized reverse. The forward diffusion process passes randomly sample noise with Gaussian noise and then reverses through the "learned denoising process". Diffusion models map latent space using a fixed Markov chain, which adds noise. Training a diffusion model requires finding reverse Markov transitions that maximize the likelihood of the training data. Diffusion models scale well to large datasets for image synthesis.
71		diffusion process	Data are progressively noised
72		Diffwave	Neural-based vocoder
73		dimensionality reduction	Also known as manifold learning, a canonical ML task which transforms an initial representation into a lower-dimensional one with preserved properties
74		DistBelief	framework
75		distributed representation	each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs
76		diversity loss
77		DreamFusion	Google text-to-3D
78		dropout	Regularization method that limits overfitting by setting to zero a subset of features by multiplying them with a Bernoulli random variable.
79		dualism	Cartesian idea that the mind is outside the laws of nature.
80		echo state networks
81		ELMo
82		embedding	Tokens go from single integers to vectors of d_model dimensions using learned embeddings in the model
83		empiricism	Francis Bacon and John Locke's movement held the senses are responsible for understanding
84		encoder	Encodes text (or any input data) into a vector (or different representation) and transmits to the decoder.
85		ensemble methods
86		ERM	Empirical Risk Minimization learning rules
87		espeak	robotic TTS
88		Euclidean norm	vector distance from origin
89		expert system	see knowledge base. A rule-based system for determining answers, an older AI methodology
90		exploration vs exploitation dilemma	Reinforcement learning issue
91		F0	Fundamental frequency variable used in speech synthesis
92		Fairseq	Facebook AI research sequence-to-sequence toolkit written in Python
93		FastSpeech	A transformer block acoustic model
94		FastSpeech2
95		FastSVC	Fast cross-domain SVC
96		feature	Elements of data for ML algorithms
97		feed-forward network	Normal neural network
98		FID
99		Fifth Generation	Japanese project announced in 1981
100		FlashAttention (tiling, recomputation)
101		forward difussion	In diffusion models, maps data to noise by gradually perturbing the input data, using Gaussian diffusion kernel.
102		forward propagation
103		frames	Minsky idea
104		Galatea
105		gated recurrent neural networks
106		Gemini	Google effort to regain supremacy post-OpenAI
107		generative adversarial network (GAN)
108		GET3D	Nvidia library
109		GloVe	Vector encoding scheme
110		GlowTTS
111		GLUE	General Language Understanding Evaluation test. BERT set records at 80.5% when it came out 5/24/19.
112		GM
113		Gopher	DeepMind 280B parameter LLM
114		GPS	General Problem Solver, created in 1957.
115		GPT-2	1.5 billion parameters
116		GPT-3	175 billion parameters
117		gradient boosting machine
118		gradient descent	Class of line-search optimization algorithms for neural network learning.
119		greedily	learning representations jointly instead of in succession
120		greedy layer-wise pre-training	can train DBN
121		greedy search decoding
122		GRU	gated recurrent unit, a gating mechanism in RNNs
123		gumbel softmax
124		Hadamard product	Used in LSTMs, element-wise multiplication.
125		head	Each individual process of self-attention. 128 for Gopher, 64 for Chinchilla.
126		Hephaestus
127		Hessian
128		Heuristic Programming Project	effort at Stanford by Feigenbaum to extend expert systems
129		hidden layers	between input and output layers, introduce nonlinearity
130		hidden Markov models
131		Hifi-GAN	Neural, GAN-based vocoder
132		HuBERT	hidden-unit BERT
133		Huggingface	ML model hub
134		hyperbolic tangent	Activating function used in hidden layers. Also known as tanh = (exp(2a)-1) / (exp(2a)+1), creates [-1, 1] bound; this is a saturating non-linearity which is slower than Relu. Equivalent to sinh(x)/cosh(x). Tanh has a stronger gradient than sigmoid, so it can be faster, but still suffers from vanishing gradient problem. Tanh is zero-centered, which makes it easier to learn data centered around zero, especially in non-output layers.
135		hyperparameter	settings/parameters which do not change during training and are not related to training data, and therefore, external to the learning algorithm; for example, learning rate.
136		hypothesis set	set of functions mapping features to labels
137		ILSVCRC-2010, -2012	ImageNet Large-Scale Visual Recognition Challenge, began in 2010. Top-1 error was 47.1% and improved to 37.5% in 2012.
138		ImageNet	Image recognition database with 15 millioned high-res labeled images in 22k categories and contest. Krizhevsky, Sutskever and Hinton's CNN advanced SOTA in 2012 with a 60m parameter, 650k neuron 5-convolutional, 3-fully-connected layer model.
139		inductive bias	incorporation of prior knowledge biases learning mechanisms
140		instruction fine-tuning	Supervised learning technique endows LLM with instruction-following capability. Widely-used example is Alpaca 52k.
141		InterSpeech	Speech conference
142		kernel methods	group of classification algorithms
143		knowledge base	Classical AI approach hard-codes world knowledge, eg Cyc, DENDRAL
144		knowledge representation	Storing knowledge and understanding for computers
145		Kullback-Leibler divergence	Statistical distance for probability distributions
146
147
148
149
150		LabelMe	image database with hundreds of thousands of labeled segmented images
151		LaMDA
152		Langevin dynamics
153		Larynx	TTS system
154		latent variable	cannot be observed directly, must be inferred from statistical models
155		layer normalization	Operation performed on output of residual connection
156		LayerNorm
157		Layers	80 for Chinchilla, Gopher
158		leaky units
159		learning rate	critical hyperparameter when training networks
160		LeNet-5	1990s CNN by Le Cun.
161		lexical field similarity	do the words go together without respect to position
162		LibriTTS
163		line search
164		linear regression	A learning algorithm which optimizes a line (slope + intercept)
165		Lisp	John McCarthy's AI programming language
166		LJSpeech
167		local average pooling
168		Loebner Prize	Turing Test competition
169		Logic Theorist	Newell, Simon early AI program
170		logical positivism	Rudolf Carnap idea holding all knowledge can be characterized by logical theories connected to observation sentences that correspond to sensory inputs
171		logistic regression	A simple ML algorithm which optimizes a quadratic.
172		Logits
173		long short-term memory (LSTM)	A type of RNN that is designed to remember information over extended time intervals. The hidden state defines the LSTM's memory cell. Four gates (input, forget, cell, output) are computed with the current input, prior hidden state and bias. Sigmoid and tanh add non-linearity. LSTMs can be multilayered or "stacked", where output layers become inputs to the next layer. LSTMs can incorporate projections. See also bidirectional LSTM.
174			Input gate: σ(Wiixt+bii+Whiht−1+bhi)	Sigmoid σ bounds output between 0 and 1.
175			Forget gate: σ(Wifxt+bif+Whfht−1+bhf)	Sigmoid σ bounds output between 0 and 1.
176			Cell Gate: tanh(Wigxt+big+Whght−1+bhg)	Creates vector of new candidate values for updating cell state. Tanh normalizes values between -1 and 1.
177			Output Gate: σ(Wioxt+bio+Whoht−1+bho)	Sigmoid σ bounds output between 0 and 1.
178			Cell State Update: ft⊙ct−1+it⊙gt	Combines the old state and new candidate values using Hadamard product.
179			Hidden State Update: ot⊙tanh(ct)
180		loss function	Least squared error, logistic loss, hinge loss, cross-entropy, sum of the squared residuals. Calculates error.
181		M-AILABS
182		machine learning	subgroup of AI where machines can extract patterns from raw data
183		Make-A-Video	Facebook library
184		manifold tangent classifier
185		masked language modeling	BERT technique
186		masking	Used by decoder to avoid computing attention for future words
187		materialist	The brain operates according to the laws of physics, including the mind
188		Max Learning Rate	4E-5, 1E-4
189		MCC	Microelectronics and Computer Technology Corporation, a US research consortium
190		MDL	Minimum Description Length learning rule
191		MDM	Tel Aviv library on human motion diffusion model
192		Megatron-LM	530B parameter system by Microsoft and Nvidia. Repository includes training pipeline. https://github.com/NVIDIA/Megatron-LM
193		mel-spectrogram	The main acoustic feature for acoustic models.
194		MelGAN	GAN-based vocoder, multi-band.
195		minima	Finding whether a minima is local or global in a NN is challenging
196		MMLU	Test for LLMs
197		MNIST	handwriting recognition
198		momentum
199		MOS	Mean Opinion Score, used to rate TTS
200		multi-task learning
201		multihead attention layer	Uses self-attention to capture different kinds of attentions, for each individual process of self-attention. Divides d_model into 8 64-dimension "heads" to represent different subspaces of how each word relates to another. This output is a matrix Z_i with shape x*d_k
202		MultiNLI	BERT set standard with 86.7%
203		MultiRC	NLP test
204		MXNet	ML framework
205		MYCIN	450 rule expert system for medicine
206		naïve Bayes	Simple ML algorithm which classifies.
207		NarrativeQA	Test for LLMs
208		natural language processing (NLP)	The field of studying how computers can be made to communicate in human language
209		NaturalQuestions	test for LLMs
210		Neocognitron	Predecessor to CNN
211		NetTalk
212		neural network
213		Newton direction
214		next sentence prediction (NSP)
215		NIPS 2017	Neural Information Processing Systems conference
216		noise
217		NORB	Early database of labeled images
218		norm
219		NPLM
220		online learning	Multiple rounds of mixed training/testing phases
221		OPT	Facebook Open Pre-Trained Transformer
222		optimization
223		output layer	final layer which requires modulation to achieve correct form of inference
224		overfitting	poor generalization due to a small sample size and complex function
225		PAC	Probably Approximately Correct learning model/framework
226		PaLM	540B parameters
227		Pandora
228		parameter
229		parametrized reverse process	In diffusion models, undoes forward diffusion and performs iterative denoising. Converts random noise into realistic data.
230		Pascal Visual Object Challenge	related to ImageNet
231		passive learning	paradigm where the learner does not interact with the environment at training time
232		PDDL+	Planning/modeling language
233		penalty
234		perceptron	A multilayer perceptron is the quintessential example of a DL model. It maps inputs to outputs by composing simpler functions.
235		perceptron convergence theorem	learning algorithm can adjust the connection strengths of a perceptron to match any input data
236		physical symbol system	any system exhibiting intelligence is manipulating symbol-based data structures
237		PixelCNN
238		PLANNER	Knowledge representation scheme
239		pre-activation	computed prior to
240		pooling
241		positional encoding	Used in transformers to encode the position of specifics words in a sentence
242		principcal component analysis	Probablistic, independent
243		prior
244		projection	Reducing dimensionality to simplify data, improve performance, etc. See LCA, LDA.
245		Prolog
246		pseudolabeling	self-training, a semi-supervised learning technique
247		Pygmalion	Mythological God which created Galatea, a sculpture which came to life
248		PyLearn2	older framework
249		Pytorch	aka Torch, leading ML framework introduced 9/2016 by Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan
250		R1	Early expert system deployed by DEC in the 1980s
251		random forest
252		ranking	typical ML task
253		Rectified Linear Activation Function	model of neuron incorporating max(0, a). non-saturating nonlinearity, faster than tanh, do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron.
254		recurrent neural networks (RNN)	Created by John Hopfield in 1982, also known as Hopfield networks or associative neural networks. LSTMs emerged from RNNs. Can be bidirectional. Can be gated (LSTM).
255		recursive network
256		regression	canonical ML task
257		regularization	DL technique
258		relu	see Rectified Linear Activation Unit/Function
259		representation learning	ML approach which learns the representation of data. See autoencoder. Representation learning has a critical issue: representations are often as hard to obtain as solving the original problem. Deep learning is the solution.
260		residual connection	Positional input embedding given to multi-headed attention output vector
261		residual neural network	aka ResNet, continuous
262		reverse diffusion process
263		Rhasspy	github.com/rhasspy
264		RMSNorm
265		RoBERTa	Optimized BERT. 88.5 on GLUE.
266		RTE	NLP test
267		RVC	Retrieval-based voice conversion. Example: https://github.com/w-okada/voice-changer
268		SAINT	1963 program by James Slagle for integration
269		self-attention	Allows association of each word in the input with other words in the same sentence. Based on query, key and value vectors.
270		semantic role labeling
271		semi-supervised training/learning	training set includes both labeled and unlabeled data
272		SEP	ends tokenizing
273		sequence-to-sequence (Seq2Seq)	A type of NN that converts one sequence of components into another
274		sequence modeling	uses recurrent and recursive networks
275		SHRDLU	1972 program by Terry Winograd
276		sigmoid	type of activation function = 1/(1+exp(-a)), transforms input to [0, 1]
277		singing voice conversion	Converts source voice to target voice
278		skip-gram	Focuses on center word in a window of words and predicts context words around it. Generally contains input layer, weights, hidden layer and output containing the word embeddings, resulting in a d_model vector embedding for each word. Word2vec embedding approach is an example of a skipgram.
279		sliding-block puzzles	nxn matrices with one blank tile, initial state is random, goal state is ordered
280		SOAR	agent architecture by Newell, Laird and Rosenbloom
281		softmax	Output layer activation function.
282		SP	Spectral envelope variable used in speech synthesis
283		sparse representation
284		SQuAD	Stanford Question Answering Dataset. Reading comprehension. V1.1, V2.0. BERT set standard with 93.2 on v1.1 (F1) and 83.1 on v2.0 (F1)
285		squared loss	loss function
286		SRM	Structural Risk Minimization learning rule
287		Stanford Sentiment Treebank
288		STANLEY	Winner of DARPA Grand Challenge
289		steepest descent
290		stochastic gradient descent	training algorithm
291		structured prediction problems	Context-free parsing, dependency parsing, named-entity recognition, part-of-speech tagging
292		SuperGLUE	Leaderboard
293		supervised learning	Build a classifier of labeled x-y pairs: linear regression vs. logistic regression, SVM, Naïve Bayes
294		Supervision	U Toronto DNN for image recognition
295		support vector machine	best known kernel method
296		SVCC	Singing Voice Conversion challenge
297		SWAG	Situations with Adversarial Generations
298		Switchboard	speech recognition
299		t-SNE	optimize clusters
300		T5	Model from HF
301		Tacotron2	An encoder-attention-decoder acoustic model
302		Talos
303		tangent distance
304		tangent drop
305		tanh	see hyperbolic tangent
306		Taylor expansion
307		Theano	older ML framework
308		Theseus
309		tokenize	represent each word with a token
310		Torch	Lua-based predecessor to PyTorch
311		total Turing test	a modified Turing Test requiring computer vision
312		transductive inference	Training data is labeled and unlabeled test points, but the objective is to predict labels only for unlabeled test points and not unseen test set data.
313		transfer learning
314		transformer	Google-invented NN in 2017. 6-layer encoder stack, 6-layer decoder stack. Encoder stack has two major sublayers: multi-headed attention mechanism and fully connected position-wise feedforward network.
315		Transformer-TTS
316		Transformers	Huggingface library
317		Trax	library by Google Brain
318		trust region	numerical optimization idea
319		txt2vec	Acoustic model using vector-quantized acoustic feature instead of mel-spectrogram. A classification model rather than a traditional regression model. Uses labeled phoneme-level prosodies for all phonemes in advance. The text encoder consists of 6 Conformer blocks, which encode the input phonemes into hidden states h.
320		TTS	text-to-speech. Current models include neural (wavenet, tacotron) and statistical parametric speech synthesis (spss).
321		Turing Test	Proposed in 1950
322		U-Net
323		underfitting	too simple function to achieve sufficiency predictive power
324		unsupervised learning	minority vs. supervised
325		unsupervised pre-training	DL technique
326		VAE	Variational AutoEncoder
327		vanishing gradient problem	Gradients become very small during backpropagation.
328		vec2wav	Vocoder used in VQTTS, uses an additional feature encoder before HifiGAN generation to smooth discontinuous quantized feature.
329		vector quantization	can be applied to self-supervised feature extraction
330		vectorize	convert tokens to vectors
331		visible layer	The input or output layers. So called because we can see and understand them.
332		VITS	Conditional variational autoencoder with adversarial learning. TTS Model. https://github.com/jaywalnut310/vits
333		vllm	Competitor to Huggingface Transformers for inference			txt2vec
334		vocoder	Generates waveform for TTS. GAN-based (MelGAN, HifiGAN). Universal-Vocoder
335		VoxCeleb	With VoxCeleb2, celebrity voice/video training set. https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
336		VQTTS	TTS approach consisting of acoustic model txt2vec and vocoder vec2wav. Adds log pitch, energy and probability of voice features for prosody.
337		wav2vec	Multi-layer convolutional network optimized via contrastive loss. Extracts features to predict successive frames.
338		WaveNet	Neural-based vocoder by Deepmind in 2016.
339		WaveRNN	Neural-based vocoder
340		weight decay	regularizer, reduces the model's straining error
341		WiC	NLP test
342		word embedding layer	Lookup table which represents each word as a vector
343		word embedding vector
344		word2vec	Embedding scheme.
345		WORLD	Vocoder
346		Wu Dao 2.0	1.75T parameters
347		X-CLIP	Microsoft video-to-text
348		XLNet-Large	former GLUE leader
349		XOR	exclusive OR cannot be approximated by linear models
350		zero-one loss	also known as the misclassification loss function
351
352		HumanEval
353
354
355
356
357
358
359
360
361					vec2wav