Search notes:

llama.cpp

The goal of llama.cpp is to run the LLaMA model on a MacBook with a C/C++ only implementation.

Building the tools

Get the sources:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Debian

On Debian, I was able to compile the sources, as indicated in the repository's README.md with
cmake -B build
cmake --build build
The -B and --build options specify the build directory (which is created with -B)

Windows, with MinGW

Note the -G "MinGW Makefiles" option:
P:\ath\to\llama.cpp> cmake -B build -G "MinGW Makefiles"
P:\ath\to\llama.cpp> cmake --build build --config Release

Built executables

After building build 4589, I found the following executables under build/bin (the common prefix llama- is removed):
batched-bench Benchmark the batched decoding performance
batched Demonstration of batched generation from a given prompt. A swift clone is found under examples/batched.swift
bench Benchmark the performance of the inference for various parameters.
cli A CLI tool for accessing and experimenting with most of llama.cpp's functionality. TODO For the --grammar, --grammar-file and --json-schema command line options see grammars below.
convert-llama2c-to-ggml This example reads weights from Andrej Karpathy's llama2.c project and saves them in ggml compatible format. The vocab that is available in models/ggml-vocab.bin is used by default
cvector-generator Demonstration of how to generate a control vector using gguf models.
embedding Demonstrates the generation of a(?) high-dimensional vector of a given text
eval-callback Using callbacks during intference to print all operations and tensor data to the console.
export-lora Apply LORA adapters to base models and export the resulting model
gen-docs
gguf-hash Hash GGUF files to detect difference on a per model and per tensor level.
gguf-split CLI to split and merge GGUF files
gguf
gritlm Example for Generative Representational Instruction Tuning (GRIT). A gritlm model can generate embeddings as well as "normal" text generation depending on the instructions in the prompt.
imatrix Computes an importance matrix for a model and given text dataset - which Can be used during quantization to enchance the quality of the quantized models (See Pull Request 4861)
infill Using the infill mode with Code Llama models supporting infill mode.
llava-cli LLaVA ?
lookahead Demonstration of lookahead decoding technique, see also Break the Sequential Dependency of LLM Inference Using Lookahead Decoding and Pull Request 4207
lookup-create Demonstration of Prompt Lookup Decoding, see also apoorvumang/prompt-lookup-decoding, Pull Reqeust 4484 and Issue 4226.
lookup-merge
lookup-stats
lookup
minicpmv-cli
parallel (Simplified) simulation of erving incoming requests in parallel
passkey A passkey retrieval task is an evaluation method used to measure a language models ability to recall information from long contexts. See also Pull Requests 3856 and 4810.
perplexity A tool for measuring the perplexity (and other quality metrics) of a model over a given text.
q8dot
quantize
qwen2vl-cli
retrieval Demonstration of a simple retrieval technique based on cosine similarity.
run A comprehensive example for (minimally) running llama.cpp models. Useful for inferencing. Used with RamaLama
save-load-state
server The LLaMA.cpp HTTP Server: A lightweight, OpenAI API compatible, HTTP server for serving LLMs. Based on cpp-httplib and nlohman/json. (llama-server -m model.gguf --port 11434). See also grammars/ below.
simple-chat Demonstration of a simple chat program using the chat template from the GGUF file.
simple A minimal example for implementing apps with llama.cpp. Useful for developers.
speculative-simple Demonstration of basic greedy speculative decoding
speculative Demonstration of speculative decoding and tree-based speculative decoding techniques
tokenize
tts
vdot
In addition, I also found the following executables (Again, with the common prefix test_ omitted):
arg-parser
autorelease
backend-ops
barrier
c
chat-template
gguf
log
model-load-cancel
quantize-fns
quantize-perf
rope
tokenizer-0

Using the deprecated Makefile

cd llama.cpp
make -j

Trying a model

Under models/, I found some *.gguf files which I believed I could use for a first test:
$ build/bin/llama-cli -m models/ggml-vocab-bert-bge.gguf --prompt "tell me a nice story" --predict 100
…
llama_model_load: error loading model: missing tensor 'token_embd.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/ggml-vocab-bert-bge.gguf'
main: error: unable to load model
Apparently, the files under model/ are not real models, just the vocabulary part. The actual models need to be downloaded.
So, I download a couple of (small) models…
$ mkdir ~/llm-models
$ curl -L https://huggingface.co/aisuko/gpt2-117M-gguf/resolve/main/ggml-model-Q4_K_M.gguf                                -o ~/llm-models/ggml-model-Q4_K_M.gguf
$ curl -L https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -o ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
… show the sizes of the downloaded models …
$ (cd ~/llm-models; stat --printf '%s\t%n\n' *)
4368450304      dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
112858624       ggml-model-Q4_K_M.gguf
… and test them with the CLI tool:
$ ./build/bin/llama-cli -m ~/llm-models/ggml-model-Q4_K_M.gguf                -p "Tell me about programmers and coffee" -n 200
$ ./build/bin/llama-cli -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf  -p "Tell me about programmers and coffee" -n 200
In Windows, with PowerShell, a module can be downloaded with
PS:> $progressPreference = 'SilentlyContinue'
PS:> invoke-webRequest https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -outfile ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf

Testing the webserver

A model can be run in a webserver and then accessed from a browser:
$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
By default, the server listens on port 8080, but this can be changed like so:
$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf --port 8888

grammars/

GBNF (GGML BNF) as a format to constrain the output produced by llama.cpp (like, for example: only valid JSON, or emojis)
GBNF grammars can be useed in
  • llama-server (where the grammer is passed in the grammar body field)
  • llama-cli using the --grammar and --grammar-file flags
  • llama-gbnf-validator
See also
-j (--json-schema) flag in action:
llama-cli \
  -hfr bartowski/Phi-3-medium-128k-instruct-GGUF \
  -hff Phi-3-medium-128k-instruct-Q8_0.gguf \
  -j '{
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "name": {
                "type": "string",
                "minLength": 1,
                "maxLength": 100
            },
            "age": {
                "type": "integer",
                "minimum": 0,
                "maximum": 150
            }
        },
        "required": ["name", "age"],
        "additionalProperties": false
    },
    "minItems": 10,
    "maxItems": 100
  }' \
  -p 'Generate a {name, age}[] JSON array with famous actors of all ages.'
Any schema can be converted in the command line with
examples/json_schema_to_grammar.py name-age-schema.json

scripts

build-info.sh
check-requirements.sh checks all requirements files for each top-level convert*.py script.
ci-run.sh What's the difference to ci/run.sh
compare-commits.sh Checks out two different commits from the repository, builds the project and then runs compare-llama-bench.py
compare-llama-bench.py
debug-test.sh
gen-authors.sh Adds new authors to the AUTHORS
gen-unicode-data.py
get-flags.mk
get-hellaswag.sh
get_chat_template.py Fetches the Jinja chat template of a HuggingFace model.
get-pg.sh
get-wikitext-103.sh
get-wikitext-2.sh
get-winogrande.sh
hf.sh Dowonload a Hugging Face model (like for example ./llama-cli -m $(./scripts/hf.sh https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf))
install-oneapi.bat
qnt-all.sh
run-all-perf.sh
run-all-ppl.sh
sync-ggml-am.sh Synchronize ggml changes to llama.cpp
sync-ggml.last
sync-ggml.sh
verify-checksum-models.py
xxd.cmake

cmake/x64-windows-llvm.cmake

The cmake directory contains a file named x64-windows-llvm.cmake.
Can this file explicitely be used for an LLVM build?

TODO

llama build number, build commit, build compiler and build target

scripts/build-info.sh determines the following values:
Value Command Example value
llama build number git rev-list --count HEAD 4589
build commit git rev-parse --short HEAD eb7cf15a
build compiler $CC --version | head -1 gcc (Debian 10.2.1-6) 10.2.1 20210110
build target $CC --dumpmachine x86_64-linux-gnu
These value seems then to be used to produce common/build-info.cpp from common/build-info.cpp.in.
However, I don't find build-info.sh referenced in any other file except in the Makefile which is deprecated.
Therefore, I now believe hat these figures are determined in cmake/build-info.cmake (which in turn seems to be invoked or referenced in common/cmake/build-info-gen-cpp.cmake).

Environment variables

LLAMA_ARG_MODEL Model path? (for example models/7B/ggml-model-f16.gguf)
GG_BUILD_CUDA ?
GG_BUILD_SYCL ?
GG_BUILD_VULKAN ?

models/7B/ggml-model-f16.gguf

models/7B/ggml-model-f16.gguf is the default path where llama.cpp looks for a model if not explicitely specified, see #define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf" in common/common.h.

ci/run.sh

In addition to llama.cpp's github actions, a commit to the repository triggers the execution fo ci/run.sh. on dedicated cloud instances which permits heavier workloads than just Github actions.
What's the difference to scripts/ci-run.sh

examples/jeopardy

examples/jeopardy is pretty much just a straight port of aigoopy/llm-jeopardy with an added graph viewer.

examples/llama.android

examples/quantize-stats

examples/rpc

The RPC server allows running ggml backend on a remote host.

examples/save-load-state

examples/sycl

examples/simple-cmake-pkg

examples/llama.swiftui

Local inference of llama.cpp on an iPhone.
See also this discussion

Windows: PrefetchVirtualMemory unavailable

Apparently, when compiling/running on Windows, _WIN32_WINNT must be at least 0x602, otherwise, the constructor of llama_mmap::impl throws PrefetchVirtualMemory unavailable.

main

main was renamed to llama-cli (See also examples/deprecation-warning/README.md.

Options

param value
-h --help Show this help message and exit
-i --interactive Run in interactive mode
--interactive-first Run in interactive mode and wait for input right away
-ins, --instruct Run in instruction mode (use with Alpaca models)
-r --reverse-prompt PROMPT Run in interactive mode and poll user input upon seeing PROMPT (can be specified more than once for multiple prompts).
--color Colorise output to distinguish prompt and user input from generations
-s --seed SEED Seed for random number generator (default: -1, use random seed for <= 0)
-t --threads N Number of threads to use during computation (default: 12)
-p --prompt PROMPT Prompt to start generation with (default: empty)
--random-prompt Start with a randomized prompt.
--in-prefix STRING String to prefix user inputs with (default: empty)
-f --file FNAME Prompt file to start generation.
-n --n_predict N Number of tokens to predict (default: 128, -1 = infinity)
--top_k N Top-k sampling (default: 40)
--top_p N Top-p sampling (default: 0.9)
--repeat_last_n N Last n tokens to consider for penalize (default: 64)
--repeat_penalty N Penalize repeat sequence of tokens (default: 1.1)
-c --ctx_size N Size of the prompt context (default: 512)
--ignore-eos Ignore end of stream token and continue generating
--memory_f32 Use f32 instead of f16 for memory key+value
--temp N Temperature (default: 0.8)
--n_parts N Number of model parts (default: -1 = determine from dimensions)
-b --batch_size N Batch size for prompt processing (default: 8)
--perplexity Compute perplexity over the prompt
--keep Number of tokens to keep from the initial prompt (default: 0, -1 = all)
--mlock Force system to keep model in RAM rather than swapping or compressing
--mtest Determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters (uncomment the "used_mem" line in llama.cpp to see the results)
--verbose-prompt Print prompt before generation
-m --model FNAME Model path (default: models/llama-7B/ggml-model.bin)

Ollama

Ollama is based on llama.cpp.

(Manual) installation

The instructions for a manual instruction point to a self-contrained binary that can be downloaded and extracted:
$ curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o /tmp/ollama-linux-amd64.tgz
Inspecting the downloaded archive file:
$ tar tf /tmp/ollama-linux-amd64.tgz | grep -v '/$'
./bin/ollama
./lib/ollama/libcudart.so.12
./lib/ollama/libcublasLt.so.12.4.5.8
./lib/ollama/libcudart.so.11.0
./lib/ollama/libcublas.so.11
./lib/ollama/libcublas.so.12
./lib/ollama/libcudart.so.11.3.109
./lib/ollama/libcublas.so.12.4.5.8
./lib/ollama/libcudart.so.12.4.127
./lib/ollama/libcublasLt.so.11.5.1.109
./lib/ollama/libcublasLt.so.11
./lib/ollama/runners/cuda_v11_avx/ollama_llama_server
./lib/ollama/runners/cuda_v11_avx/libggml_cuda_v11.so
./lib/ollama/runners/rocm_avx/ollama_llama_server
./lib/ollama/runners/rocm_avx/libggml_rocm.so
./lib/ollama/runners/cuda_v12_avx/ollama_llama_server
./lib/ollama/runners/cuda_v12_avx/libggml_cuda_v12.so
./lib/ollama/runners/cpu_avx2/ollama_llama_server
./lib/ollama/runners/cpu_avx/ollama_llama_server
./lib/ollama/libcublas.so.11.5.1.109
./lib/ollama/libcublasLt.so.12
These files need to be extracted under /usr:
$ sudo tar -C /usr -xzf /tmp/ollama-linux-amd64.tgz

Testing the installation

Start the Ollama server:
$ ollama serve
In another terminal:
$ ollama -v
ollama version is 0.5.7
As can be seen in the terminal where the server was started, ollama -v basically accesses the APi /api/version, i. e. something like
$ curl -s localhost:11434/api/version

Web API

$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffe"
}' | jq

{
  "error": "model 'llama2' not found"
}
The model is not available, we need to pull it:
$ curl -s http://localhost:11434/api/pull -d '{
  "model": "llama2"
}' | jq
After downloading the model, query the available models:
$ curl -s http://localhost:11434/api/tags | jq -r '.models[].name'
Try again:
$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffee"
}' | jq

{
  "error": "model requires more system memory (8.4 GiB) than is available (1.8 GiB)"
}
Same thing, but on a machine with more memory:
$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffee"
}' | jq

{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.867393127Z",
  "response": "\n",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.947556123Z",
  "response": "Program",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.996489966Z",
  "response": "mers",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:47.044551309Z",
  "response": " and",
  "done": false
}

   …

{
  "model": "llama2",
  "created_at": "2025-02-01T14:28:57.985358597Z",
  "response": "!",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:28:58.038683996Z",
  "response": "",
  "done": true,
  "done_reason": "stop",
  "context": [
    518,
    25580,
    …
    26529,
    29991
  ],
  "total_duration": 33946247078,
  "load_duration": 7990899,
  "prompt_eval_count": 29,
  "prompt_eval_duration": 558000000,
  "eval_count": 703,
  "eval_duration": 33378000000
}

Checking if GPU is used

While a model is being used, ollama ps lists, among others, if a or the GPU is being used:
$ ollama ps
NAME             ID              SIZE      PROCESSOR    UNTIL              
llama2:latest    78e26419b446    5.6 GB    100% GPU     4 minutes from now    

Links

ggml is a tensor library, written in C, that is used in llama.cpp. In fact, the description of ggml reads: Note that this project is under development and not ready for production use. Some of the development is currently happening in the llama.cpp and whisper.cpp repos
Python bindings for llama.cpp provides

Index

Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 8 attempt to write a readonly database in /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php:78 Stack trace: #0 /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php(78): PDOStatement->execute(Array) #1 /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php(30): insert_webrequest_('/notes/developm...', 1759562002, '216.73.216.149', 'Mozilla/5.0 App...', NULL) #2 /home/httpd/vhosts/renenyffenegger.ch/httpsdocs/notes/development/Artificial-intelligence/language-model/LLM/LLaMA/libs/llama_cpp/index(574): insert_webrequest() #3 {main} thrown in /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php on line 78