Llama.cpp Build and Usage Tutorial

Llama.cpp is a lightweight and fast implementation of LLaMA (Large Language Model Meta AI) models in C++. It is designed to run efficiently even on CPUs, offering an alternative to heavier Python-based implementations.

1. Prerequisites

Before you start, ensure that you have the following installed:

  • CMake (version 3.16 or higher)
  • A C++ compiler (GCC, Clang, MSVC)
  • git (for cloning the repository)

Installing Dependencies

For Linux/macOS, you can install these dependencies via package managers like apt or brew. On Windows, ensure you have the necessary tools through Visual Studio or MinGW.

Linux (Ubuntu/Debian)

sudo apt update
sudo apt install cmake g++ git

macOS (using Homebrew)

brew install cmake gcc git

2. Cloning the Repository

First, clone the llama.cpp repository from GitHub:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

3. Building llama.cpp

3.1 Create Build Directory

Create a directory for building the project:

mkdir build
cd build

3.2 Run CMake

Run cmake to configure the build:

cmake ..

3.3 Compile the Code

Once the configuration is done, compile the project:

make -j

4. Download the LLaMA Model Weights

After building, you need to have the LLaMA model weights. These are typically large files (multiple GBs), so you may need to download them from Meta’s official model release or other repositories where the weights are shared.

Steps: Download the LLaMA model weights (such as the 7B, 13B, etc.). Extract the weights and place them in the correct folder (for example, ./models/). Note: Make sure you comply with the terms of service for downloading the model weights.

e.g.

wget https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-q4_k_m.gguf
mkdir ~/models
mv qwen2-0_5b-instruct-q4_k_m.gguf ~/models

5. Running llama.cpp

Once the build is successful, you can run the model using the compiled executable.

5.1 Running the Model with a Text Prompt

The llama.cpp repository typically includes an example executable. For example, you can run it like this:

./bin/llama-cli -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf -p "Hello, what's your name?" -n 128

Where:

-m specifies the model path. -p specifies the text prompt you want to generate from the model. -n specifies the output token length. This command will load the model, perform inference, and print the output of the model’s response to the given prompt.

5.2 Running with Additional Options

There are other options you can use for advanced configuration:

-t: Number of CPU threads to use (useful when OpenMP is enabled). -ngl: Use a GL-style format for model files (for compatibility reasons). -p: Prompt for the model. Example Command

./main -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf  -t 4 -p "What is the weather today?"

In this example:

The model is 7B. The number of threads is 4. The prompt is asking about the weather.

5.3 Running with batched-bench

To check LLM inference performance with different batch size:

Example Command

./bin/llama-batched-bench -m ~/models/qwen2-0_5b-instruct-q4_k_m.gguf  -npp 128 -ntg 128 -npl 1,2,4

Output is like below

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |    128 |    1 |    256 |    0.090 |  1429.62 |    1.692 |    75.63 |    1.782 |   143.67 |
|   128 |    128 |    2 |    512 |    0.171 |  1499.20 |    2.846 |    89.94 |    3.017 |   169.70 |
|   128 |    128 |    4 |   1024 |    0.338 |  1513.83 |    4.948 |   103.48 |    5.286 |   193.72 |

S_PP t/s is prompt process speed (token per second), S_TG t/s is token generation speed (token per second).

6. Common Issues and Troubleshooting

Model loading issues: Ensure that the model file is located in the correct directory and is properly formatted.

Out of memory (OOM): If the model is too large for your available memory, you may need to use a smaller model or run on a machine with more resources.

7. Optimizations

You can optimize llama.cpp for different use cases by adjusting the build options:

Use specific compiler flags for your platform to optimize for performance. For example, on Linux you might want to use -O3 optimization to speed up compilation and execution:

cmake -DCMAKE_CXX_FLAGS="-O3" ..

8. Conclusion

This guide covers the basics of setting up and using llama.cpp. After following these steps, you should be able to build and run LLaMA models efficiently on your machine using this C++ implementation.

If you have any further questions or run into issues, checking the official GitHub repository for updates or opening an issue might help!