Docs/Quantizations

Quantizations

In HammerAI Desktop, you can browse a list of models that are compatible with HammerAI on the "Models" page. Because HammerAI uses Ollama to run LLMs locally, each of these models is a Ollama-compatible LLM which is hosted on a server somewhere. You can choose to download it to your local machine.

Desktop Models

Quantization

Quantization is the process of compressing a LLM to reduce their size compared to the original model. Quantizing a model makes it suitable for running on everyday consumer hardware, like your personal computer. The main drawback is that higher compression (quantization) will harm output quality.

For most users, the “q4_K_M” or “q5_K_M” quantizations are recommended, as they provide a good balance of speed and quality. Here, “q4” refers to the quantization level, “K” indicates it’s a k-quant format, and “M” signifies that it’s a medium-sized k-quant.

Can I run a quantization?

You can read about memory and disk requirements at https://github.com/ggerganov/llama.cpp.

You can read about supported GPUs at https://github.com/ollama/ollama/blob/main/docs/gpu.md.

All models need to be fully loaded into RAM and/or VRAM, which will require both adequate disk space to save them and sufficient RAM to load them. You need to have enough space free in your RAM, not the total amount, to run the models without hang-ups.

As a general guideline, if a model’s file size is 6GB, users should ensure at least 6GB of available system RAM and/or VRAM, along with an additional 10–50% memory buffer to accommodate token generation and processing overhead. Several key factors influence the memory and performance demands of a model:

GPU Acceleration: A supported GPU is highly recommended for improved performance, reducing reliance on system RAM and CPU computation.
Memory Management: The required RAM and VRAM amount depends on the model size and quantization level. Lower quantization levels (e.g., Q4) require less memory, while higher quantization (e.g., Q8) increases memory demands.
Additional Performance Factors: Settings such as batch size, context length, and token generation can further increase memory requirements, impacting model speed and responsiveness.

Example Quantizations

Very Small Models (3B)

Q4 w/ ~2–6GB file size:

VRAM Needed: ~6–10GB
Ideal GPUs: RTX 2060 SUPER (8GB VRAM), RTX 3050 (8GB VRAM)

Q8 w/ ~6–10GB file size:

VRAM Needed: ~8–14GB
Ideal GPUs: RTX 3060 (12GB), RTX 4070 (12GB)

Small Models (7B)

Q4 w/ ~4–8GB file size:

VRAM Needed: ~6–10GB

Ideal GPUs: RTX 3060 (12GB), RTX 4070 (12GB)

Q8 w/ ~8–16GB file size:

VRAM Needed: ~10–18GB
Ideal GPUs: RTX 3090 (24GB), RTX 4080 (16GB)

Mid-Sized Models (13B)

Q4 w/ 10–12GB file size:

VRAM Needed: ~12–14GB
Ideal GPUs: RTX 4070 Ti (12GB+), RTX 4090 (24GB)

Q8 w/ 18–22GB file size:

VRAM Needed: ~20–24GB
Ideal GPUs: RTX 3090 (24GB), RTX A6000 (48GB), multi-GPU setups

Large Models (30B)

Q4 w/ 24–30GB file size:

VRAM Needed: ~26–36GB
Ideal GPUs: RTX A5000 (24GB+), multi-GPU setups

Q8 w/ 40–50GB file size:

VRAM Needed: ~44–60GB
Ideal GPUs: NVIDIA A100 (40GB+), H100 (80GB), multi-GPU setups

Is there an easy way to know if I can run a model?

Yes! While it can be fun to learn about whether your computer will be able to run a model, you can also just test it out:

Install a 2.7B_Q2 or 7B_Q4 language model and see how quickly it outputs text when chatting.
If it works well and generates very fast text for you, install a higher model: 13B_Q4 or 7B_Q8.
If it runs slowly or not at all, move down to find the ideal model for your machine.

Selecting the appropriate model for HammerAI requires consideration of system memory (RAM), GPU VRAM, and storage space to ensure efficient performance. All models must be fully loaded into either system RAM or GPU VRAM before they can function properly. HammerAI prioritizes GPU-based inference whenever possible, as GPU acceleration significantly improves processing speed and responsiveness. However, if GPU memory is insufficient, the system will offload computations to system RAM, which may lead to slower performance due to the increased computational load on the processor.

Downloading the Model

Once you’ve found the model that fits your needs, initiate the download by clicking the download icon next to the model’s entry. During the download process, follow these important guidelines:

Do not navigate away from the Models page.
Avoid closing the application.
Be patient.

Closing the app during this process may lead to potential issues, so keeping the application open is recommended.

Troubleshooting

I see loading dots and get back no response

If you are unable to generate a response, try these troubleshooting steps:

Ensure there aren't any other large applications/programs/games running, as HammerAI takes a lot of system resources
Close and reopen the program
Check for leftover processes
Kill the tasks in Task Manager

+ Close Hammer's app Ctrl + Shift + Esc 
+ Details tab 
+ Find anything named "Ollama" or "Hammer" 
+ Right Click on task 
+ End process

Set a lower "Max Response Tokens" value
Disable the "MLock" and "Context Size" options under the "Settings" menu
Choose a different/smaller language model to download
Delete and reinstall the model
Delete and reinstall the app

The language model will not download

If your internet connection is not great, try to ensure that you are doing nothing else online while downloading. That includes closing Discord and other similar apps that use the internet.
Close and reopen Hammer's app
Try a different language model
Use the "Download Model" function on the Models page up at the top and download it directly Reinstall the app and repeat the steps above, if needed.

The Screen is Blurry While Running the App

Hammer uses GPU acceleration to increase your chat's speed and effectiveness, which will affect your graphics' performance when trying to run multiple apps and programs. Try running HammerAI by itself with no other apps or programs alongside it. Hammer uses many system resources to run the large language models.

Why HammerAI Asks for Network Permissions

It's to run the language model. It spins up a local server using a random port on a separate process (i.e., a server available at http://localhost:9841/), and then we make network requests against this server. The benefit is that the UI still feels snappy even when using much of your computer's processing power and memory. You will only need to allow internal network requests, not public networks.

Models Download Location

In the HammerAI app, left side on the menu, click "Models" then click on the magnifying glass up at the top of the screen and it will take you directly to the where the models are downloaded. Just below that, you can also set/change where the models are downloaded.

Do Language Models Require Updates or Redownloading?

No. Once a language model is downloaded, it is unnecessary to redownload or update it. Language models do not receive automatic updates like traditional software. The only exception would be if the model's author decides to delete and reupload a revised version, which is uncommon. In such a rare case, users must manually download the new version if they wish to use it. Otherwise, once a model is installed, it remains fully functional without further updates.

Do Models Learn as You Chat with Them?

No, they do not.

Language models within HammerAI remember information from earlier messages in the conversation. However, if the chat is closed or not saved, the model resets and reverts to its original state, forgetting any learned details from that session.

Start Guide

Desktop Settings

Feature Guides

Models

Desktop App

Characters

Create Character

My Characters

Chats

Create Image

My Images

Stories

Write Story

My Stories

Settings

Plans

Leaderboard

Feedback

Status

Docs

FAQ