Llama.cpp Build and Usage Tutorial
Llama.cpp is a lightweight and fast implementation of LLaMA (Large Language Model Meta AI) models in C++. It is designed to run efficiently even on CPUs, offering an alternative to heavier Python-based implementations.
1. Prerequisites
Before you start, ensure that you have the following installed:
- CMake (version 3.16 or higher)
- A C++ compiler (GCC, Clang, MSVC)
- git (for cloning the repository)
Installing Dependencies
For Linux/macOS, you can install these dependencies via package managers like apt
or brew
. On Windows, ensure you have the necessary tools through Visual Studio or MinGW.
Linux (Ubuntu/Debian)
sudo apt update
sudo apt install cmake g++ git
macOS (using Homebrew)
brew install cmake gcc git
2. Cloning the Repository
First, clone the llama.cpp repository from GitHub:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
3. Building llama.cpp
3.1 Create Build Directory
Create a directory for building the project:
mkdir build
cd build
3.2 Run CMake
Run cmake to configure the build:
cmake ..
3.3 Compile the Code
Once the configuration is done, compile the project:
make -j
4. Download the LLaMA Model Weights
After building, you need to have the LLaMA model weights. These are typically large files (multiple GBs), so you may need to download them from Meta’s official model release or other repositories where the weights are shared.
Steps: Download the LLaMA model weights (such as the 7B, 13B, etc.). Extract the weights and place them in the correct folder (for example, ./models/). Note: Make sure you comply with the terms of service for downloading the model weights.
e.g.
wget https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-q4_k_m.gguf
mkdir ~/models
mv qwen2-0_5b-instruct-q4_k_m.gguf ~/models
5. Running llama.cpp
Once the build is successful, you can run the model using the compiled executable.
5.1 Running the Model with a Text Prompt
The llama.cpp repository typically includes an example executable. For example, you can run it like this:
./bin/llama-cli -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf -p "Hello, what's your name?" -n 128
Where:
-m specifies the model path. -p specifies the text prompt you want to generate from the model. -n specifies the output token length. This command will load the model, perform inference, and print the output of the model’s response to the given prompt.
5.2 Running with Additional Options
There are other options you can use for advanced configuration:
-t: Number of CPU threads to use (useful when OpenMP is enabled). -ngl: Use a GL-style format for model files (for compatibility reasons). -p: Prompt for the model. Example Command
./main -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf -t 4 -p "What is the weather today?"
In this example:
The model is 7B. The number of threads is 4. The prompt is asking about the weather.
5.3 Running with batched-bench
To check LLM inference performance with different batch size:
Example Command
./bin/llama-batched-bench -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf -npp 128 -ntg 128 -npl 1,2,4
Output is like below
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 128 | 128 | 1 | 256 | 0.090 | 1429.62 | 1.692 | 75.63 | 1.782 | 143.67 |
| 128 | 128 | 2 | 512 | 0.171 | 1499.20 | 2.846 | 89.94 | 3.017 | 169.70 |
| 128 | 128 | 4 | 1024 | 0.338 | 1513.83 | 4.948 | 103.48 | 5.286 | 193.72 |
S_PP t/s is prompt process speed (token per second), S_TG t/s is token generation speed (token per second).
6. Common Issues and Troubleshooting
Model loading issues: Ensure that the model file is located in the correct directory and is properly formatted.
Out of memory (OOM): If the model is too large for your available memory, you may need to use a smaller model or run on a machine with more resources.
7. Optimizations
You can optimize llama.cpp for different use cases by adjusting the build options:
Use specific compiler flags for your platform to optimize for performance. For example, on Linux you might want to use -O3 optimization to speed up compilation and execution:
cmake -DCMAKE_CXX_FLAGS="-O3" ..
8. Conclusion
This guide covers the basics of setting up and using llama.cpp. After following these steps, you should be able to build and run LLaMA models efficiently on your machine using this C++ implementation.
If you have any further questions or run into issues, checking the official GitHub repository for updates or opening an issue might help!