git clone https://github.com/ggerganov/llama.cpp cd llama.cpp
README.md
with cmake -B build cmake --build build
-B
and --build
options specify the build directory (which is created with -B
) -G "MinGW Makefiles"
option: P:\ath\to\llama.cpp> cmake -B build -G "MinGW Makefiles" P:\ath\to\llama.cpp> cmake --build build --config Release
build/bin
(the common prefix llama-
is removed): batched-bench | Benchmark the batched decoding performance |
batched | Demonstration of batched generation from a given prompt. A swift clone is found under examples/batched.swift |
bench | Benchmark the performance of the inference for various parameters. |
cli | A CLI tool for accessing and experimenting with most of llama.cpp's functionality. TODO For the --grammar , --grammar-file and --json-schema command line options see grammars below. |
convert-llama2c-to-ggml | This example reads weights from Andrej Karpathy's llama2.c project and saves them in ggml compatible format. The vocab that is available in models/ggml-vocab.bin is used by default |
cvector-generator | Demonstration of how to generate a control vector using gguf models. |
embedding | Demonstrates the generation of a(?) high-dimensional vector of a given text |
eval-callback | Using callbacks during intference to print all operations and tensor data to the console. |
export-lora | Apply LORA adapters to base models and export the resulting model |
gen-docs | |
gguf-hash | Hash GGUF files to detect difference on a per model and per tensor level. |
gguf-split | CLI to split and merge GGUF files |
gguf | |
gritlm | Example for Generative Representational Instruction Tuning (GRIT). A gritlm model can generate embeddings as well as "normal" text generation depending on the instructions in the prompt. |
imatrix | Computes an importance matrix for a model and given text dataset - which Can be used during quantization to enchance the quality of the quantized models (See Pull Request 4861) |
infill | Using the infill mode with Code Llama models supporting infill mode. |
llava-cli | LLaVA ? |
lookahead | Demonstration of lookahead decoding technique, see also Break the Sequential Dependency of LLM Inference Using Lookahead Decoding and Pull Request 4207 |
lookup-create | Demonstration of Prompt Lookup Decoding, see also apoorvumang/prompt-lookup-decoding, Pull Reqeust 4484 and Issue 4226. |
lookup-merge | |
lookup-stats | |
lookup | |
minicpmv-cli | |
parallel | (Simplified) simulation of erving incoming requests in parallel |
passkey | A passkey retrieval task is an evaluation method used to measure a language models ability to recall information from long contexts. See also Pull Requests 3856 and 4810. |
perplexity | A tool for measuring the perplexity (and other quality metrics) of a model over a given text. |
q8dot | |
quantize | … |
qwen2vl-cli | |
retrieval | Demonstration of a simple retrieval technique based on cosine similarity. |
run | A comprehensive example for (minimally) running llama.cpp models. Useful for inferencing. Used with RamaLama |
save-load-state | |
server | The LLaMA.cpp HTTP Server: A lightweight, OpenAI API compatible, HTTP server for serving LLMs. Based on cpp-httplib and nlohman/json. (llama-server -m model.gguf --port 11434 ). See also grammars/ below. |
simple-chat | Demonstration of a simple chat program using the chat template from the GGUF file. |
simple | A minimal example for implementing apps with llama.cpp. Useful for developers. |
speculative-simple | Demonstration of basic greedy speculative decoding |
speculative | Demonstration of speculative decoding and tree-based speculative decoding techniques |
tokenize | … |
tts | … |
vdot |
test_
omitted): arg-parser | |
autorelease | |
backend-ops | |
barrier | |
c | |
chat-template | |
gguf | |
log | |
model-load-cancel | |
quantize-fns | |
quantize-perf | |
rope | |
tokenizer-0 |
cd llama.cpp make -j
models/
, I found some *.gguf
files which I believed I could use for a first test: $ build/bin/llama-cli -m models/ggml-vocab-bert-bge.gguf --prompt "tell me a nice story" --predict 100 … llama_model_load: error loading model: missing tensor 'token_embd.weight' llama_model_load_from_file_impl: failed to load model common_init_from_params: failed to load model 'models/ggml-vocab-bert-bge.gguf' main: error: unable to load model
model/
are not real models, just the vocabulary part. The actual models need to be downloaded. $ mkdir ~/llm-models $ curl -L https://huggingface.co/aisuko/gpt2-117M-gguf/resolve/main/ggml-model-Q4_K_M.gguf -o ~/llm-models/ggml-model-Q4_K_M.gguf $ curl -L https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -o ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
$ (cd ~/llm-models; stat --printf '%s\t%n\n' *) 4368450304 dolphin-2.2.1-mistral-7b.Q4_K_M.gguf 112858624 ggml-model-Q4_K_M.gguf
$ ./build/bin/llama-cli -m ~/llm-models/ggml-model-Q4_K_M.gguf -p "Tell me about programmers and coffee" -n 200 $ ./build/bin/llama-cli -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -p "Tell me about programmers and coffee" -n 200
PS:> $progressPreference = 'SilentlyContinue' PS:> invoke-webRequest https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf -outfile ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf
$ build/bin/llama-server -m ~/llm-models/dolphin-2.2.1-mistral-7b.Q4_K_M.gguf --port 8888
llama-server
(where the grammer is passed in the grammar
body field)
llama-cli
using the --grammar
and --grammar-file
flags
llama-gbnf-validator
tests/test-json-schema-to-grammar.cpp
(to see which features are likely supported)
-j
(--json-schema
) flag in action: llama-cli \ -hfr bartowski/Phi-3-medium-128k-instruct-GGUF \ -hff Phi-3-medium-128k-instruct-Q8_0.gguf \ -j '{ "type": "array", "items": { "type": "object", "properties": { "name": { "type": "string", "minLength": 1, "maxLength": 100 }, "age": { "type": "integer", "minimum": 0, "maximum": 150 } }, "required": ["name", "age"], "additionalProperties": false }, "minItems": 10, "maxItems": 100 }' \ -p 'Generate a {name, age}[] JSON array with famous actors of all ages.'
examples/json_schema_to_grammar.py name-age-schema.json
build-info.sh | |
check-requirements.sh | checks all requirements files for each top-level convert*.py script. |
ci-run.sh | What's the difference to ci/run.sh |
compare-commits.sh | Checks out two different commits from the repository, builds the project and then runs compare-llama-bench.py |
compare-llama-bench.py | |
debug-test.sh | |
gen-authors.sh | Adds new authors to the AUTHORS |
gen-unicode-data.py | |
get-flags.mk | |
get-hellaswag.sh | |
get_chat_template.py | Fetches the Jinja chat template of a HuggingFace model. |
get-pg.sh | |
get-wikitext-103.sh | |
get-wikitext-2.sh | |
get-winogrande.sh | |
hf.sh | Dowonload a Hugging Face model (like for example ./llama-cli -m $(./scripts/hf.sh https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf) ) |
install-oneapi.bat | |
qnt-all.sh | |
run-all-perf.sh | |
run-all-ppl.sh | |
sync-ggml-am.sh | Synchronize ggml changes to llama.cpp |
sync-ggml.last | |
sync-ggml.sh | |
verify-checksum-models.py | |
xxd.cmake |
cmake
directory contains a file named x64-windows-llvm.cmake
. scripts/build-info.sh
determines the following values: Value | Command | Example value |
llama build number | git rev-list --count HEAD | 4589 |
build commit | git rev-parse --short HEAD | eb7cf15a |
build compiler | $CC --version | head -1 | gcc (Debian 10.2.1-6) 10.2.1 20210110 |
build target | $CC --dumpmachine | x86_64-linux-gnu |
common/build-info.cpp
from common/build-info.cpp.in
. build-info.sh
referenced in any other file except in the Makefile
which is deprecated. cmake/build-info.cmake
(which in turn seems to be invoked or referenced in common/cmake/build-info-gen-cpp.cmake
). LLAMA_ARG_MODEL | Model path? (for example models/7B/ggml-model-f16.gguf ) |
GG_BUILD_CUDA | ? |
GG_BUILD_SYCL | ? |
GG_BUILD_VULKAN | ? |
models/7B/ggml-model-f16.gguf
is the default path where llama.cpp looks for a model if not explicitely specified, see #define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"
in common/common.h
. ci/run.sh
. on dedicated cloud instances which permits heavier workloads than just Github actions. ci/README.md
. scripts/ci-run.sh
examples/jeopardy
is pretty much just a straight port of aigoopy/llm-jeopardy with an added graph viewer. _WIN32_WINNT
must be at least 0x602, otherwise, the constructor of llama_mmap::impl
throws PrefetchVirtualMemory unavailable. param value | |||
-h | --help | Show this help message and exit | |
-i | --interactive | Run in interactive mode | |
--interactive-first | Run in interactive mode and wait for input right away | ||
-ins , --instruct | Run in instruction mode (use with Alpaca models) | ||
-r | --reverse-prompt | PROMPT | Run in interactive mode and poll user input upon seeing PROMPT (can be specified more than once for multiple prompts). |
--color | Colorise output to distinguish prompt and user input from generations | ||
-s | --seed | SEED | Seed for random number generator (default: -1 , use random seed for <= 0) |
-t | --threads | N | Number of threads to use during computation (default: 12) |
-p | --prompt | PROMPT | Prompt to start generation with (default: empty) |
--random-prompt | Start with a randomized prompt. | ||
--in-prefix | STRING | String to prefix user inputs with (default: empty) | |
-f | --file | FNAME | Prompt file to start generation. |
-n | --n_predict | N | Number of tokens to predict (default: 128, -1 = infinity) |
--top_k | N | Top-k sampling (default: 40) | |
--top_p | N | Top-p sampling (default: 0.9) | |
--repeat_last_n | N | Last n tokens to consider for penalize (default: 64) | |
--repeat_penalty | N | Penalize repeat sequence of tokens (default: 1.1) | |
-c | --ctx_size | N | Size of the prompt context (default: 512 ) |
--ignore-eos | Ignore end of stream token and continue generating | ||
--memory_f32 | Use f32 instead of f16 for memory key+value | ||
--temp | N | Temperature (default: 0.8 ) | |
--n_parts | N | Number of model parts (default: -1 = determine from dimensions) | |
-b | --batch_size | N | Batch size for prompt processing (default: 8) |
--perplexity | Compute perplexity over the prompt | ||
--keep | Number of tokens to keep from the initial prompt (default: 0, -1 = all) | ||
--mlock | Force system to keep model in RAM rather than swapping or compressing | ||
--mtest | Determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters (uncomment the "used_mem" line in llama.cpp to see the results) | ||
--verbose-prompt | Print prompt before generation | ||
-m | --model | FNAME | Model path (default: models/llama-7B/ggml-model.bin ) |
$ curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o /tmp/ollama-linux-amd64.tgz
$ tar tf /tmp/ollama-linux-amd64.tgz | grep -v '/$' ./bin/ollama ./lib/ollama/libcudart.so.12 ./lib/ollama/libcublasLt.so.12.4.5.8 ./lib/ollama/libcudart.so.11.0 ./lib/ollama/libcublas.so.11 ./lib/ollama/libcublas.so.12 ./lib/ollama/libcudart.so.11.3.109 ./lib/ollama/libcublas.so.12.4.5.8 ./lib/ollama/libcudart.so.12.4.127 ./lib/ollama/libcublasLt.so.11.5.1.109 ./lib/ollama/libcublasLt.so.11 ./lib/ollama/runners/cuda_v11_avx/ollama_llama_server ./lib/ollama/runners/cuda_v11_avx/libggml_cuda_v11.so ./lib/ollama/runners/rocm_avx/ollama_llama_server ./lib/ollama/runners/rocm_avx/libggml_rocm.so ./lib/ollama/runners/cuda_v12_avx/ollama_llama_server ./lib/ollama/runners/cuda_v12_avx/libggml_cuda_v12.so ./lib/ollama/runners/cpu_avx2/ollama_llama_server ./lib/ollama/runners/cpu_avx/ollama_llama_server ./lib/ollama/libcublas.so.11.5.1.109 ./lib/ollama/libcublasLt.so.12
/usr
: $ sudo tar -C /usr -xzf /tmp/ollama-linux-amd64.tgz
$ ollama serve
$ ollama -v ollama version is 0.5.7
ollama -v
basically accesses the APi /api/version
, i. e. something like $ curl -s localhost:11434/api/version
$ curl -s http://localhost:11434/api/generate -d '{ "model" : "llama2", "prompt": "tell me about programmers and coffe" }' | jq { "error": "model 'llama2' not found" }
$ curl -s http://localhost:11434/api/pull -d '{ "model": "llama2" }' | jq
$ curl -s http://localhost:11434/api/tags | jq -r '.models[].name'
$ curl -s http://localhost:11434/api/generate -d '{ "model" : "llama2", "prompt": "tell me about programmers and coffee" }' | jq { "error": "model requires more system memory (8.4 GiB) than is available (1.8 GiB)" }
$ curl -s http://localhost:11434/api/generate -d '{ "model" : "llama2", "prompt": "tell me about programmers and coffee" }' | jq { "model": "llama2", "created_at": "2025-02-01T14:27:46.867393127Z", "response": "\n", "done": false } { "model": "llama2", "created_at": "2025-02-01T14:27:46.947556123Z", "response": "Program", "done": false } { "model": "llama2", "created_at": "2025-02-01T14:27:46.996489966Z", "response": "mers", "done": false } { "model": "llama2", "created_at": "2025-02-01T14:27:47.044551309Z", "response": " and", "done": false } … { "model": "llama2", "created_at": "2025-02-01T14:28:57.985358597Z", "response": "!", "done": false } { "model": "llama2", "created_at": "2025-02-01T14:28:58.038683996Z", "response": "", "done": true, "done_reason": "stop", "context": [ 518, 25580, … 26529, 29991 ], "total_duration": 33946247078, "load_duration": 7990899, "prompt_eval_count": 29, "prompt_eval_duration": 558000000, "eval_count": 703, "eval_duration": 33378000000 }
ollama ps
lists, among others, if a or the GPU is being used: $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama2:latest 78e26419b446 5.6 GB 100% GPU 4 minutes from now
ggml
is a tensor library, written in C, that is used in llama.cpp
. In fact, the description of ggml
reads: Note that this project is under development and not ready for production use. Some of the development is currently happening in the llama.cpp
and whisper.cpp
repos ctypes
.