DocsOllama
Ollama
In order to run local LLMs, HammerAI uses Ollama. Ollama is a SDK that facilitates local deployment and interaction with Large Language Models (LLMs) on personal devices.
About Ollama
At its core, Ollama integrates with llama.cpp, a project that enables running Large Language Models (LLMs) efficiently on local hardware. This integration allows Ollama to leverage the capabilities of llama.cpp for text generation tasks, providing users with powerful language modeling functionalities directly on their machines.
Originally, HammerAI was built on llama.cpp directly and HammerAI spent significant time optimizing its performance. However, as Ollama continued to evolve, it became clear that their dedicated development teams could maintain and improve the system far more efficiently. As a result, HammerAI transitioned to Ollama, benefiting from continuous updates, improved model handling, and better overall performance.
Local Execution and Privacy
One of Ollama's key advantages is its ability to run LLMs locally. This means all data processing occurs on the user's device, enhancing privacy and security by eliminating the need to send data to external servers. Users can interact with LLMs without concerns about data leakage or unauthorized access.
Performance Optimization & GPU Utilization
Ollama will automatically prioritize using GPU computation whenever possible, which significantly enhances model inference speed. Ollama loads and processes models using GPU resources for optimal performance when a supported GPU with sufficient VRAM is available. However, if VRAM is insufficient, the system will automatically fall back to CPU processing, which may result in reduced performance due to the higher computational burden placed on the processor. The Ollama development team continuously maintains and optimizes GPU acceleration, memory management, and model efficiency to ensure stability and efficiency, improving overall system performance.
GPU vs. CPU Utilization in Model Generation
Unlike some frameworks that allow concurrent GPU and CPU utilization, Ollama operates in a way that selects either the GPU or the CPU for computation based on available system resources. If the GPU has enough VRAM, it will handle the computation entirely. If not, the workload falls back to CPU processing.
However, recent observations suggest that Ollama may sometimes partially allocate a model between the CPU and GPU, resulting in mixed utilization configurations, such as 48% CPU / 52% GPU usage. This suggests some adaptive allocation, though it remains unclear whether Ollama dynamically adjusts computational offloading during runtime or if this allocation remains static once the model is loaded.
The Ollama documentation provides the following information:
- 100% GPU Usage: The model is fully loaded into the GPU.
- 100% CPU Usage: The model is running entirely in system memory.
- Mixed Allocation (48% CPU / 52% GPU): The model is split between GPU and system memory, likely due to VRAM limitations or system heuristics.
Understanding these allocation behaviors can help users optimize performance and adjust system configurations accordingly. Users running larger models that exceed available VRAM may experience bottlenecks when the system partially offloads computation to the CPU, reducing overall speed. Users can optimize performance and improve model responsiveness when running HammerAI with Ollama by leveraging GPU resources efficiently and monitoring model allocation behaviors.
Memory Allocation
Ollama manages models efficiently by retaining the loaded model in memory while clearing only the chat context (e.g., prompt history and token buffers) when switching between characters. This means that when a user switches characters within the same model, the model remains in GPU memory, leading to consistent VRAM usage. However, when switching to a different model, the previous model is unloaded before the new one is loaded, resulting in GPU memory usage fluctuations. This behavior allows faster character switching without unnecessary reloading of models, improving responsiveness while conserving system resources.
Optimizing Performance for Better Speed & Stability
For users experiencing reduced performance, consider the following optimization strategies:
- Ensure the selected model size aligns with available system resources. Running models that exceed system capabilities may lead to slow performance or instability.
- Confirm that Ollama supports your GPU. Not all GPUs are compatible, so verifying support can help optimize acceleration.
- Adjust quantization levels for better speed. Lower precision models (e.g., Q4 instead of Q8) can significantly reduce computational overhead while maintaining reasonable output quality.
Users can achieve faster response times and improved efficiency by optimizing system configurations and leveraging GPU acceleration, ensuring a smoother HammerAI experience when running local language models.
HDD vs. SSD – Does Storage Type Matter?
Yes, the type of storage drive you use can impact HammerAI’s performance, particularly if your system relies on page filing to manage memory usage. When RAM is full, the operating system moves less frequently used data to the page file, allowing active processes to continue functioning. However, if both RAM and the page file reach capacity, system performance will degrade significantly, leading to lag, slow response times, and potential errors due to the disparity between storage drive speed and RAM speed.
Using an SSD or M.2 drive instead of an HDD significantly reduces these performance issues because SSDs have faster read and write speeds than traditional spinning-platter hard drives. While an SSD will not replace the need for sufficient RAM, it will mitigate lag and improve responsiveness when system resources are stretched. For optimal performance, ensure your model’s size does not exceed your available RAM and storage capacity, regardless of the drive type.