Migrating local setup from ollama to llama.cpp samperalabs

1. How to match models to blobs in ollama?

On ollama you have all your models here

ls ~/.ollama/models/blobs
sha256-0ba8f0e314b4264dfd19df045cde9d4c394a52474bfasdad6a3de22a4ca31a177	
sha256-11ce4ee3e170f6adebac9a991c22e22ab3f8530e154ee6242454c4bc73061c258	
sha256-1a4c3c319823fdabddb22479d0b10820a7a39fe49e45c40242228fbe83926dc14	
....

To see which model corresponds to which blob you can run the following

ollama show --modelfile llama3.1:latest

And you will see a line like this, here you can find the blob

FROM /Users/your_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

2. Create the links for llama.cpp

Create and go to a folder where your “store” the models you’ll use in llama.cpp

mkdir -p ~/my-llm-models
cd my-llm-models

Create a link to the ollama blob found in the previous steps with the extension .gguf, this will be used for llama.cpp

ln -s /Users/you_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef244224224242238798c4b918278320b6fcef18fe Llama-3.1-8B-Instruct.gguf

3. Use llama.cpp in python

3.1 Installing llama.cpp

First install the llama.cpp library.

uv add llama-cpp-python

Alert: Apple Silicon

In apple silicon some things have to be added before installing it

#pyproject.toml
dependencies = [
    "llama-cpp-python>=0.3.16", # Or whatever version
]

[[tool.uv.index]]
name = "llama-cpp-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"

[tool.uv.sources]
llama-cpp-python = { index = "llama-cpp-metal" }

And then

CMAKE_ARGS="-DLLAMA_METAL=on" uv sync

3.2 Running llama.cpp on python

Finally you can use it like this.

#index.py
from llama_cpp import Llama


def init_model(model_path: str, n_gpu_layers: int = -1, n_ctx: int = 8192) -> Llama:
    """Initializes a model from a GGUF file path.
    This function simply loads the model and returns the Llama object.
    """
    print(f"Loading model from: {model_path}")

    # Directly initialize and return the Llama object.
    # On Apple Silicon, n_gpu_layers=-1 offloads all possible layers to the Metal GPU.
    return Llama(
        model_path=model_path,
        n_gpu_layers=n_gpu_layers,
        n_ctx=n_ctx,
        verbose=False,  # Set to True for more detailed output
    )


# --- Main script ---
# Path to the symbolic link you created
my_model_path = "/Users/bsampera/my-llm-models/Llama-3.1-8B-Instruct.gguf"

# 1. Initialize the model using our simple function
model = init_model(model_path=my_model_path)

# 2. Prepare your chat messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a symbolic link in one short sentence?"},
]

# 3. Call the standard create_chat_completion method
#    Pass generation parameters like `temperature` here.
response = model.create_chat_completion(
    messages=messages,
    temperature=0.1,
    max_tokens=100,  # It's good practice to set a max_tokens limit
)

# 4. Print the result
print(response["choices"][0]["message"]["content"])

Migrating local setup from ollama to llama.cpp

1. How to match models to blobs in ollama?

2. Create the links for llama.cpp

3. Use llama.cpp in python

3.1 Installing llama.cpp

Alert: Apple Silicon

3.2 Running llama.cpp on python

Related Posts

TranslatePrompt: The AI Translator That Learns From Your Instructions

Create a chatbot: Part 3 - Adding the frontend

Creating custom tools in Langgraph

Let's connect !!