By Bernat Sampera 3 min read Follow:

Migrating local setup from ollama to llama.cpp

Ollama is a great tool to run local llm models, but it's not the fastest and sometimes has some bugs. To run everything smoother you can use llama.cpp. This guide shows how to use existing ollama models for llama.cpp

1. How to match models to blobs in ollama?

On ollama you have all your models here

ls ~/.ollama/models/blobs
sha256-0ba8f0e314b4264dfd19df045cde9d4c394a52474bfasdad6a3de22a4ca31a177	
sha256-11ce4ee3e170f6adebac9a991c22e22ab3f8530e154ee6242454c4bc73061c258	
sha256-1a4c3c319823fdabddb22479d0b10820a7a39fe49e45c40242228fbe83926dc14	
....

To see which model corresponds to which blob you can run the following

ollama show --modelfile llama3.1:latest

And you will see a line like this, here you can find the blob

FROM /Users/your_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

Create and go to a folder where your “store” the models you’ll use in llama.cpp

mkdir -p ~/my-llm-models
cd my-llm-models 

Create a link to the ollama blob found in the previous steps with the extension .gguf, this will be used for llama.cpp

ln -s /Users/you_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef244224224242238798c4b918278320b6fcef18fe Llama-3.1-8B-Instruct.gguf

3. Use llama.cpp in python

3.1 Installing llama.cpp

First install the llama.cpp library.

uv add llama-cpp-python

Alert: Apple Silicon

In apple silicon some things have to be added before installing it

#pyproject.toml
dependencies = [
    "llama-cpp-python>=0.3.16", # Or whatever version
]

[[tool.uv.index]]
name = "llama-cpp-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"

[tool.uv.sources]
llama-cpp-python = { index = "llama-cpp-metal" }

And then

CMAKE_ARGS="-DLLAMA_METAL=on" uv sync  

3.2 Running llama.cpp on python

Finally you can use it like this.

#index.py
from llama_cpp import Llama


def init_model(model_path: str, n_gpu_layers: int = -1, n_ctx: int = 8192) -> Llama:
    """Initializes a model from a GGUF file path.
    This function simply loads the model and returns the Llama object.
    """
    print(f"Loading model from: {model_path}")

    # Directly initialize and return the Llama object.
    # On Apple Silicon, n_gpu_layers=-1 offloads all possible layers to the Metal GPU.
    return Llama(
        model_path=model_path,
        n_gpu_layers=n_gpu_layers,
        n_ctx=n_ctx,
        verbose=False,  # Set to True for more detailed output
    )


# --- Main script ---
# Path to the symbolic link you created
my_model_path = "/Users/bsampera/my-llm-models/Llama-3.1-8B-Instruct.gguf"

# 1. Initialize the model using our simple function
model = init_model(model_path=my_model_path)

# 2. Prepare your chat messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a symbolic link in one short sentence?"},
]

# 3. Call the standard create_chat_completion method
#    Pass generation parameters like `temperature` here.
response = model.create_chat_completion(
    messages=messages,
    temperature=0.1,
    max_tokens=100,  # It's good practice to set a max_tokens limit
)

# 4. Print the result
print(response["choices"][0]["message"]["content"])

Let's connect !!

Get in touch if you want updates, examples, and insights on how AI agents, Langchain and more are evolving and where they’re going next.