Large Language Models (LLMs)

TODO

Context size/length: the number of tokens taken into account to predict the next token.

Components of a transformer LLM

A transformer LLM consists of a

tokenizer, a
stack of transformer blocks and an
language modeling (LM) head

Tokens and embeddings

The central concepts for using LLMs are tokens and embeddings.

In a technical sense, language is a sequence of tokens.

When a model is trained, it (hopefully) captures the important token patterns that appear or emerge in the training data.

Some tokenization algorithms include

Byte Pair Encoding (BPE) - typically used by GPT models
WordPiece (typically used by (BERT)
SentencePiece (FLan-T5)

Important factors for tokenizers are

The size of its vocabulary (as of 2024: typically between 30 and 50K, but moving towards 100K)
The set of special tokens (Beginning of text (<S>), End of text, Padding, Unknown, classification ([CLS]), seperation ([SEP]), masking…). Models like Galactica which focus on scientific knowledge also include tokens like for citations, reasoning, mathematics, amino acid sequences and DNA sequences). Note that [CLS] and <s> are similar in nature.
How to numerically represent each token (Is this really a thing?)
How to deal with capitilization

For each token in the tokenizer's vocabulary, there is an associated embedding which is a vector of numerical values.

The values of these vectors is initialized with random numbers before the model is trained. These values are updated as the model is trained.

Transformer blocks

Each token (and in the case of autoregressive models also each generated token) is passed through the transformer blocks. The last tranformer block hands its output to the LM head.

A transformer block basically consists of an

attention layer and a
feedforward neural network.

LM Head

The LM head calculates the propability distribution for the next token.

Decoding: Choosing next token, temperature

After the LM head has calculated the probability distribution for the next token, the model needs to choose the next token.

A simple strategy is to always pick the token with the highest probality score (also called greedy token).

The temparature parameter allows to define how often the decoder should deviate from that rule.

If the temparature is 0, it never deviates, the higher the temparature is, the more often it deviates.

Context size

The context size is the number of tokens that the model can operate on in parallel.

Autoregressive models

An autoregressive model predicts the next token based on based on the input plus the already predicted tokens.

BERT is not an autoregressive model (the B stands for bidirectional).

GGUF

GGUF (GGML Universal File) is format to to stores both tensors and metadata in a single file whose goal is to save and load of model data fast.

Another focus of GGUF is quantization (which reduces memory usage and increases speed by reducing precision in the model weights, which, of course, also lowers the model's accuracy).

GGUF was introduced as successor of file formats such as GGML in August 2023 by the llama.cpp project.

GGUF are typically created by converting models of machine learning libraries such as PyTorch.

gguf-tools (antirez)

git clone https://github.com/antirez/gguf-tools
cd gguf-tools
make
curl -OL https://huggingface.co/aisuko/gpt2-117M-gguf/resolve/main/ggml-model-Q4_K_M.gguf
./gguf-tools show ggml-model-Q4_K_M.gguf

convert-hf-to-gguf.py

convert_hf_to_gguf.py (of llama.cpp) converts HuggingFace to GGUF.

Models

A list of models is maintained by LifeArchitect.ai.

Cerebras-GPT

Cerebras-GPT is a family of seven GPT models ranging from 111 million to 13 billion parameters.

These models were trained on CS-2 systems (Andromeda AI supercomputer) using the Chinchilla formula.

Their weights and checkpoints are available on Hugging Face and GitHub under the Apache 2.0 license.

GPT-4

GPT-4 was released without any information about its model architecture, training data, training hardware or hyperparameters.

As per The secret history of Elon Musk, Sam Altman, and OpenAI, GPT-4 has 1 trillion parameters.

Chinchilla (DeepMind)

Same budget as Gopher but with 70 B parameters and 4 times more data.

Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. It uses substantially less computing for fine-tuning and inference, greatly facilitating downstream usage.

Gemini

Gemini is Google's next generation foundation model (as of 2023-05-13):

This model is created from the ground up to be multi modal, highly efficient at tool and API integrations and built to enable future innovations like memory and planning.

Scaling laws

J. Kaplan, S. McCandlish: Scaling Laws for Neural Language Models:

We study empirical scaling laws for language model performance on the cross-entropy loss.

Benchmarks

BIG-Bench

BIG-Bench is a collaborative benchmark with over 150 tasks which aim at producing challenging tasks for LLMs.

The tasks include

logical reasoning,
translation,
question answering,
mathematics, and
others

Finding and downloading LLMs / PTMs

Many LLMs and/or PTMs are exchanged on registries (quite similar to how software components are shared in NPM, PyPI, Maven etc.)

One of these registries is the Hugging Face Hub.

As of 2024-10, this hub hosts 900k models, 200k datasets and 300k demos.