llama.cpp

Building the tools

Get the sources:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Debian

On Debian, I was able to compile the sources, as indicated in the repository's README.md with

cmake -B build
cmake --build build

The -B and --build options specify the build directory (which is created with -B)

Windows, with MinGW

Note the -G "MinGW Makefiles" option:

P:\ath\to\llama.cpp> cmake -B build -G "MinGW Makefiles"
P:\ath\to\llama.cpp> cmake --build build --config Release

Built executables

After building build 4589, I found the following executables under build/bin (the common prefix llama- is removed):

`batched-bench`	Benchmark the batched decoding performance
`batched`	Demonstration of batched generation from a given prompt. A swift clone is found under `examples/batched.swift`
`bench`	Benchmark the performance of the inference for various parameters.
`cli`	A CLI tool for accessing and experimenting with most of llama.cpp's functionality. TODO For the `--grammar`, `--grammar-file` and `--json-schema` command line options see `grammars` below.
`convert-llama2c-to-ggml`	This example reads weights from Andrej Karpathy's llama2.c project and saves them in ggml compatible format. The vocab that is available in `models/ggml-vocab.bin` is used by default
`cvector-generator`	Demonstration of how to generate a control vector using gguf models.
`embedding`	Demonstrates the generation of a(?) high-dimensional vector of a given text
`eval-callback`	Using callbacks during intference to print all operations and tensor data to the console.
`export-lora`	Apply LORA adapters to base models and export the resulting model
`gen-docs`
`gguf-hash`	Hash GGUF files to detect difference on a per model and per tensor level.
`gguf-split`	CLI to split and merge GGUF files
`gguf`
`gritlm`	Example for Generative Representational Instruction Tuning (GRIT). A gritlm model can generate embeddings as well as "normal" text generation depending on the instructions in the prompt.
`imatrix`	Computes an importance matrix for a model and given text dataset - which Can be used during quantization to enchance the quality of the quantized models (See Pull Request 4861)
`infill`	Using the infill mode with Code Llama models supporting infill mode.
`llava-cli`	LLaVA ?
`lookahead`	Demonstration of lookahead decoding technique, see also Break the Sequential Dependency of LLM Inference Using Lookahead Decoding and Pull Request 4207
`lookup-create`	Demonstration of Prompt Lookup Decoding, see also apoorvumang/prompt-lookup-decoding, Pull Reqeust 4484 and Issue 4226.
`lookup-merge`
`lookup-stats`
`lookup`
`minicpmv-cli`
`parallel`	(Simplified) simulation of erving incoming requests in parallel
`passkey`	A passkey retrieval task is an evaluation method used to measure a language models ability to recall information from long contexts. See also Pull Requests 3856 and 4810.
`perplexity`	A tool for measuring the perplexity (and other quality metrics) of a model over a given text.
`q8dot`
`quantize`	…
`qwen2vl-cli`
`retrieval`	Demonstration of a simple retrieval technique based on cosine similarity.
`run`	A comprehensive example for (minimally) running llama.cpp models. Useful for inferencing. Used with RamaLama
`save-load-state`
`server`	The LLaMA.cpp HTTP Server: A lightweight, OpenAI API compatible, HTTP server for serving LLMs. Based on cpp-httplib and nlohman/json. (`llama-server -m model.gguf --port 11434`). See also `grammars/` below.
`simple-chat`	Demonstration of a simple chat program using the chat template from the GGUF file.
`simple`	A minimal example for implementing apps with llama.cpp. Useful for developers.
`speculative-simple`	Demonstration of basic greedy speculative decoding
`speculative`	Demonstration of speculative decoding and tree-based speculative decoding techniques
`tokenize`	…
`tts`	…
`vdot`

In addition, I also found the following executables (Again, with the common prefix test_ omitted):

`arg-parser`
`autorelease`
`backend-ops`
`barrier`
`c`
`chat-template`
`gguf`
`log`
`model-load-cancel`
`quantize-fns`
`quantize-perf`
`rope`
`tokenizer-0`

Using the deprecated Makefile

cd llama.cpp
make -j

Trying a model

Under models/, I found some *.gguf files which I believed I could use for a first test:

$ build/bin/llama-cli -m models/ggml-vocab-bert-bge.gguf --prompt "tell me a nice story" --predict 100
…
llama_model_load: error loading model: missing tensor 'token_embd.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/ggml-vocab-bert-bge.gguf'
main: error: unable to load model

Apparently, the files under model/ are not real models, just the vocabulary part. The actual models need to be downloaded.

So, I download a couple of (small) models…

$ mkdir ~/llm-models
$ curl -L https://huggingface.co/aisuko/gpt2-117M-gguf/resolve/main/ggml-model-Q4_K_M.gguf                                -o ~/llm-models/ggml-model-Q4_K_M.gguf
$ curl -L https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -o ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf

… show the sizes of the downloaded models …

$ (cd ~/llm-models; stat --printf '%s\t%n\n' *)
4368450304      dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
112858624       ggml-model-Q4_K_M.gguf

… and test them with the CLI tool:

$ ./build/bin/llama-cli -m ~/llm-models/ggml-model-Q4_K_M.gguf                -p "Tell me about programmers and coffee" -n 200
$ ./build/bin/llama-cli -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf  -p "Tell me about programmers and coffee" -n 200

In Windows, with PowerShell, a module can be downloaded with

PS:> $progressPreference = 'SilentlyContinue'
PS:> invoke-webRequest https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -outfile ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf

Testing the webserver

A model can be run in a webserver and then accessed from a browser:

$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf

By default, the server listens on port 8080, but this can be changed like so:

$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf --port 8888

grammars/

GBNF (GGML BNF) as a format to constrain the output produced by llama.cpp (like, for example: only valid JSON, or emojis)

GBNF grammars can be useed in

llama-server (where the grammer is passed in the grammar body field)
llama-cli using the --grammar and --grammar-file flags
llama-gbnf-validator

scripts

`build-info.sh`
`check-requirements.sh`	checks all requirements files for each top-level convert*.py script.
`ci-run.sh`	What's the difference to `ci/run.sh`
`compare-commits.sh`	Checks out two different commits from the repository, builds the project and then runs `compare-llama-bench.py`
`compare-llama-bench.py`
`debug-test.sh`
`gen-authors.sh`	Adds new authors to the `AUTHORS`
`gen-unicode-data.py`
`get-flags.mk`
`get-hellaswag.sh`
`get_chat_template.py`	Fetches the Jinja chat template of a HuggingFace model.
`get-pg.sh`
`get-wikitext-103.sh`
`get-wikitext-2.sh`
`get-winogrande.sh`
`hf.sh`	Dowonload a Hugging Face model (like for example `./llama-cli -m $(./scripts/hf.sh https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf)`)
`install-oneapi.bat`
`qnt-all.sh`
`run-all-perf.sh`
`run-all-ppl.sh`
`sync-ggml-am.sh`	Synchronize ggml changes to llama.cpp
`sync-ggml.last`
`sync-ggml.sh`
`verify-checksum-models.py`
`xxd.cmake`

cmake/x64-windows-llvm.cmake

The cmake directory contains a file named x64-windows-llvm.cmake.

Can this file explicitely be used for an LLVM build?

TODO

llama build number, build commit, build compiler and build target

scripts/build-info.sh determines the following values:

Value	Command	Example value
llama build number	`git rev-list --count HEAD`	4589
build commit	`git rev-parse --short HEAD`	eb7cf15a
build compiler	`$CC --version \| head -1`	gcc (Debian 10.2.1-6) 10.2.1 20210110
build target	`$CC --dumpmachine`	x86_64-linux-gnu

These value seems then to be used to produce common/build-info.cpp from common/build-info.cpp.in.

However, I don't find build-info.sh referenced in any other file except in the Makefile which is deprecated.

Therefore, I now believe hat these figures are determined in cmake/build-info.cmake (which in turn seems to be invoked or referenced in common/cmake/build-info-gen-cpp.cmake).

Environment variables

`LLAMA_ARG_MODEL`	Model path? (for example `models/7B/ggml-model-f16.gguf`)
`GG_BUILD_CUDA`	?
`GG_BUILD_SYCL`	?
`GG_BUILD_VULKAN`	?

models/7B/ggml-model-f16.gguf

models/7B/ggml-model-f16.gguf is the default path where llama.cpp looks for a model if not explicitely specified, see #define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf" in common/common.h.

ci/run.sh

In addition to llama.cpp's github actions, a commit to the repository triggers the execution fo ci/run.sh. on dedicated cloud instances which permits heavier workloads than just Github actions.

See ci/README.md.

What's the difference to scripts/ci-run.sh

examples/jeopardy

examples/jeopardy is pretty much just a straight port of aigoopy/llm-jeopardy with an added graph viewer.

examples/llama.android

…

examples/quantize-stats

…

examples/rpc

The RPC server allows running ggml backend on a remote host.

examples/save-load-state

…

examples/sycl

Example for tools for llama.cpp for SYCL on Intel GPU.

examples/simple-cmake-pkg

…

examples/llama.swiftui

Local inference of llama.cpp on an iPhone.

Windows: PrefetchVirtualMemory unavailable

Apparently, when compiling/running on Windows, _WIN32_WINNT must be at least 0x602, otherwise, the constructor of llama_mmap::impl throws PrefetchVirtualMemory unavailable.

main

main was renamed to llama-cli (See also examples/deprecation-warning/README.md.

Options

		param value
`-h`	`--help`		Show this help message and exit
`-i`	`--interactive`		Run in interactive mode
	`--interactive-first`		Run in interactive mode and wait for input right away
	`-ins`, `--instruct`		Run in instruction mode (use with Alpaca models)
`-r`	`--reverse-prompt`	`PROMPT`	Run in interactive mode and poll user input upon seeing `PROMPT` (can be specified more than once for multiple prompts).
	`--color`		Colorise output to distinguish prompt and user input from generations
`-s`	`--seed`	`SEED`	Seed for random number generator (default: `-1`, use random seed for <= 0)
`-t`	`--threads`	`N`	Number of threads to use during computation (default: 12)
`-p`	`--prompt`	`PROMPT`	Prompt to start generation with (default: empty)
	`--random-prompt`		Start with a randomized prompt.
	`--in-prefix`	`STRING`	String to prefix user inputs with (default: empty)
`-f`	`--file`	`FNAME`	Prompt file to start generation.
`-n`	`--n_predict`	`N`	Number of tokens to predict (default: 128, -1 = infinity)
	`--top_k`	`N`	Top-k sampling (default: 40)
	`--top_p`	`N`	Top-p sampling (default: 0.9)
	`--repeat_last_n`	`N`	Last n tokens to consider for penalize (default: 64)
	`--repeat_penalty`	`N`	Penalize repeat sequence of tokens (default: 1.1)
`-c`	`--ctx_size`	`N`	Size of the prompt context (default: `512`)
	`--ignore-eos`		Ignore end of stream token and continue generating
	`--memory_f32`		Use `f32` instead of `f16` for memory key+value
	`--temp`	`N`	Temperature (default: `0.8`)
	`--n_parts`	`N`	Number of model parts (default: -1 = determine from dimensions)
`-b`	`--batch_size`	`N`	Batch size for prompt processing (default: 8)
	`--perplexity`		Compute perplexity over the prompt
	`--keep`		Number of tokens to keep from the initial prompt (default: 0, -1 = all)
	`--mlock`		Force system to keep model in RAM rather than swapping or compressing
	`--mtest`		Determine the maximum memory usage needed to do inference for the given `n_batch` and `n_predict` parameters (uncomment the `"used_mem"` line in `llama.cpp` to see the results)
	`--verbose-prompt`		Print prompt before generation
`-m`	`--model`	`FNAME`	Model path (default: `models/llama-7B/ggml-model.bin`)

Ollama

Ollama is based on llama.cpp.

(Manual) installation

The instructions for a manual instruction point to a self-contrained binary that can be downloaded and extracted:

$ curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o /tmp/ollama-linux-amd64.tgz

Inspecting the downloaded archive file:

$ tar tf /tmp/ollama-linux-amd64.tgz | grep -v '/$'
./bin/ollama
./lib/ollama/libcudart.so.12
./lib/ollama/libcublasLt.so.12.4.5.8
./lib/ollama/libcudart.so.11.0
./lib/ollama/libcublas.so.11
./lib/ollama/libcublas.so.12
./lib/ollama/libcudart.so.11.3.109
./lib/ollama/libcublas.so.12.4.5.8
./lib/ollama/libcudart.so.12.4.127
./lib/ollama/libcublasLt.so.11.5.1.109
./lib/ollama/libcublasLt.so.11
./lib/ollama/runners/cuda_v11_avx/ollama_llama_server
./lib/ollama/runners/cuda_v11_avx/libggml_cuda_v11.so
./lib/ollama/runners/rocm_avx/ollama_llama_server
./lib/ollama/runners/rocm_avx/libggml_rocm.so
./lib/ollama/runners/cuda_v12_avx/ollama_llama_server
./lib/ollama/runners/cuda_v12_avx/libggml_cuda_v12.so
./lib/ollama/runners/cpu_avx2/ollama_llama_server
./lib/ollama/runners/cpu_avx/ollama_llama_server
./lib/ollama/libcublas.so.11.5.1.109
./lib/ollama/libcublasLt.so.12

These files need to be extracted under /usr:

$ sudo tar -C /usr -xzf /tmp/ollama-linux-amd64.tgz

Testing the installation

Start the Ollama server:

$ ollama serve

In another terminal:

$ ollama -v
ollama version is 0.5.7

As can be seen in the terminal where the server was started, ollama -v basically accesses the APi /api/version, i. e. something like

$ curl -s localhost:11434/api/version

Web API

$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffe"
}' | jq

{
  "error": "model 'llama2' not found"
}

The model is not available, we need to pull it:

$ curl -s http://localhost:11434/api/pull -d '{
  "model": "llama2"
}' | jq

After downloading the model, query the available models:

$ curl -s http://localhost:11434/api/tags | jq -r '.models[].name'

Try again:

$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffee"
}' | jq

{
  "error": "model requires more system memory (8.4 GiB) than is available (1.8 GiB)"
}

Same thing, but on a machine with more memory:

$ curl -s http://localhost:11434/api/generate -d '{
   "model" : "llama2",
   "prompt": "tell me about programmers and coffee"
}' | jq

{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.867393127Z",
  "response": "\n",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.947556123Z",
  "response": "Program",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:46.996489966Z",
  "response": "mers",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:27:47.044551309Z",
  "response": " and",
  "done": false
}

   …

{
  "model": "llama2",
  "created_at": "2025-02-01T14:28:57.985358597Z",
  "response": "!",
  "done": false
}
{
  "model": "llama2",
  "created_at": "2025-02-01T14:28:58.038683996Z",
  "response": "",
  "done": true,
  "done_reason": "stop",
  "context": [
    518,
    25580,
    …
    26529,
    29991
  ],
  "total_duration": 33946247078,
  "load_duration": 7990899,
  "prompt_eval_count": 29,
  "prompt_eval_duration": 558000000,
  "eval_count": 703,
  "eval_duration": 33378000000
}

Checking if GPU is used

While a model is being used, ollama ps lists, among others, if a or the GPU is being used:

$ ollama ps
NAME             ID              SIZE      PROCESSOR    UNTIL              
llama2:latest    78e26419b446    5.6 GB    100% GPU     4 minutes from now