Migrating local setup from ollama to llama.cpp
Ollama is a great tool to run local llm models, but it's not the fastest and sometimes has some bugs. To run everything smoother you can use llama.cpp. This guide shows how to use existing ollama models for llama.cpp
1. How to match models to blobs in ollama?
On ollama you have all your models here
ls ~/.ollama/models/blobs
sha256-0ba8f0e314b4264dfd19df045cde9d4c394a52474bfasdad6a3de22a4ca31a177
sha256-11ce4ee3e170f6adebac9a991c22e22ab3f8530e154ee6242454c4bc73061c258
sha256-1a4c3c319823fdabddb22479d0b10820a7a39fe49e45c40242228fbe83926dc14
....
To see which model corresponds to which blob you can run the following
ollama show --modelfile llama3.1:latest
And you will see a line like this, here you can find the blob
FROM /Users/your_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
2. Create the links for llama.cpp
Create and go to a folder where your “store” the models you’ll use in llama.cpp
mkdir -p ~/my-llm-models
cd my-llm-models
Create a link to the ollama blob found in the previous steps with the extension .gguf, this will be used for llama.cpp
ln -s /Users/you_user/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef244224224242238798c4b918278320b6fcef18fe Llama-3.1-8B-Instruct.gguf
3. Use llama.cpp in python
3.1 Installing llama.cpp
First install the llama.cpp library.
uv add llama-cpp-python
Alert: Apple Silicon
In apple silicon some things have to be added before installing it
#pyproject.toml
dependencies = [
"llama-cpp-python>=0.3.16", # Or whatever version
]
[[tool.uv.index]]
name = "llama-cpp-metal"
url = "https://abetlen.github.io/llama-cpp-python/whl/metal"
[tool.uv.sources]
llama-cpp-python = { index = "llama-cpp-metal" }
And then
CMAKE_ARGS="-DLLAMA_METAL=on" uv sync
3.2 Running llama.cpp on python
Finally you can use it like this.
#index.py
from llama_cpp import Llama
def init_model(model_path: str, n_gpu_layers: int = -1, n_ctx: int = 8192) -> Llama:
"""Initializes a model from a GGUF file path.
This function simply loads the model and returns the Llama object.
"""
print(f"Loading model from: {model_path}")
# Directly initialize and return the Llama object.
# On Apple Silicon, n_gpu_layers=-1 offloads all possible layers to the Metal GPU.
return Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers,
n_ctx=n_ctx,
verbose=False, # Set to True for more detailed output
)
# --- Main script ---
# Path to the symbolic link you created
my_model_path = "/Users/bsampera/my-llm-models/Llama-3.1-8B-Instruct.gguf"
# 1. Initialize the model using our simple function
model = init_model(model_path=my_model_path)
# 2. Prepare your chat messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is a symbolic link in one short sentence?"},
]
# 3. Call the standard create_chat_completion method
# Pass generation parameters like `temperature` here.
response = model.create_chat_completion(
messages=messages,
temperature=0.1,
max_tokens=100, # It's good practice to set a max_tokens limit
)
# 4. Print the result
print(response["choices"][0]["message"]["content"])
Related Posts
TranslatePrompt: The AI Translator That Learns From Your Instructions
Stop making the same corrections. TranslatePrompt is the smart translator that learns from your instructions to build custom glossaries and rules, ensuring perfect translations every time.
Create a chatbot: Part 3 - Adding the frontend
This post explains how to use assistant ui to render a chatbot and handle the calls with the backend
Creating custom tools in Langgraph
How to create your custom tools in a chatbot so you're able to realize specific functions like accessing a database, or calling an api.
Let's connect !!
Get in touch if you want updates, examples, and insights on how AI agents, Langchain and more are evolving and where they’re going next.