Perplexity
The most basic intrinsic measure of a language model's performance is its perplexity on a given text corpus: It measures how well a model is able to predict the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity.
Perplexity is related to the cross-entropy loss function which is used to train neural language model.
TODO
Context size/length: the number of tokens taken into account to predict the next token.
Components of a transformer LLM
A transformer LLM consists of a
- tokenizer, a
- stack of transformer blocks and an
- language modeling (LM) head
Tokens and embeddings
The central concepts for using LLMs are tokens and embeddings.
In a technical sense, language is a sequence of tokens.
When a model is trained, it (hopefully) captures the important token patterns that appear or emerge in the training data.
Some tokenization algorithms include
- Byte Pair Encoding (BPE) - typically used by GPT models
- WordPiece (typically used by (BERT)
- SentencePiece (FLan-T5)
Important factors for tokenizers are
- The size of its vocabulary (as of 2024: typically between 30 and 50K, but moving towards 100K)
- The set of special tokens (Beginning of text (
<S>
), End of text, Padding, Unknown, classification ([CLS]
), seperation ([SEP]
), masking…). Models like Galactica which focus on scientific knowledge also include tokens like for citations, reasoning, mathematics, amino acid sequences and DNA sequences). Note that [CLS]
and <s>
are similar in nature.
- How to numerically represent each token (Is this really a thing?)
- How to deal with capitilization
For each token in the tokenizer's vocabulary, there is an associated embedding which is a vector of numerical values.
The values of these vectors is initialized with random numbers before the model is trained. These values are updated as the model is trained.
Transformer blocks
Each token (and in the case of autoregressive models also each generated token) is passed through the transformer blocks. The last tranformer block hands its output to the LM head.
A transformer block basically consists of an
- attention layer and a
- feedforward neural network.
LM Head
The LM head calculates the propability distribution for the next token.
Decoding: Choosing next token, temperature
After the LM head has calculated the probability distribution for the next token, the model needs to choose the next token.
A simple strategy is to always pick the token with the highest probality score (also called greedy token).
The temparature parameter allows to define how often the decoder should deviate from that rule.
If the temparature is 0, it never deviates, the higher the temparature is, the more often it deviates.
Context size
The context size is the number of tokens that the model can operate on in parallel.
Autoregressive models
An autoregressive model predicts the next token based on based on the input plus the already predicted tokens.
BERT is not an autoregressive model (the B stands for bidirectional).
GGUF
GGUF (GGML Universal File) is format to to stores both tensors and metadata in a single file whose goal is to save and load of model data fast.
Another focus of GGUF is quantization (which reduces memory usage and increases speed by reducing precision in the model weights, which, of course, also lowers the model's accuracy).
GGUF was introduced as successor of file formats such as GGML in August 2023 by the llama.cpp project.
GGUF are typically created by converting models of machine learning libraries such as PyTorch.
gguf-tools (antirez)
git clone https://github.com/antirez/gguf-tools
cd gguf-tools
make
curl -OL https://huggingface.co/aisuko/gpt2-117M-gguf/resolve/main/ggml-model-Q4_K_M.gguf
./gguf-tools show ggml-model-Q4_K_M.gguf
Models
Cerebras-GPT
Cerebras-GPT is a family of seven GPT models ranging from 111 million to 13 billion parameters.
These models were trained on CS-2 systems (Andromeda AI supercomputer) using the Chinchilla formula.
Their weights and checkpoints are available on Hugging Face and GitHub under the Apache 2.0 license.
GPT-4
GPT-4 was released without any information about its model architecture, training data, training hardware or hyperparameters.
Chinchilla (DeepMind)
Same budget as Gopher but with 70 B parameters and 4 times more data.
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. It uses substantially less computing for fine-tuning and inference, greatly facilitating downstream usage.
Gemini
This model is created from the ground up to be multi modal, highly efficient at tool and API integrations and built to enable future innovations like memory and planning.
Scaling laws
We study empirical scaling laws for language model performance on the cross-entropy loss.
Benchmarks
BIG-Bench
BIG-Bench is a collaborative benchmark with over 150 tasks which aim at producing challenging tasks for LLMs.
The tasks include
- logical reasoning,
- translation,
- question answering,
- mathematics, and
- others
Finding and downloading LLMs / PTMs
Many LLMs and/or PTMs are exchanged on registries (quite similar to how software components are shared in
NPM,
PyPI, Maven etc.)
As of 2024-10, this hub hosts 900k models, 200k datasets and 300k demos.