Local AI Engine (llama.cpp)

What is llama.cpp?

llama.cpp is the engine that runs AI models locally on your computer. Think of it as the software that takes an AI model file and makes it actually work on your hardware - whether that's your CPU, graphics card, or Apple's M-series chips.

Originally created by Georgi Gerganov, llama.cpp is designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.

Why This Matters

Privacy: Your conversations never leave your computer Cost: No monthly subscription fees or API costs Speed: No internet required once models are downloaded Control: Choose exactly which models to run and how they behave

Accessing Engine Settings

Find llama.cpp settings at Settings () > Local Engine > llama.cpp:

llama.cpp

These are advanced settings. You typically only need to adjust them if models aren't working properly or you want to optimize performance for your specific hardware.

Engine Management

Feature	What It Does	When You Need It
Engine Version	Shows which version of llama.cpp you're running	Check compatibility with newer models
Check Updates	Downloads newer engine versions	When new models require updated engine
Backend Selection	Choose the version optimized for your hardware	After installing new graphics cards or when performance is poor
Auto Update Engine	Automatically updates llama.cpp to latest version	Enable for automatic compatibility with new models
Auto-Unload Old Models	Unloads unused models to free memory	Enable when running multiple models or low on memory

Hardware Backends

Jan offers different backend versions optimized for your specific hardware. Think of these as different "drivers" - each one is tuned for particular processors or graphics cards.

⚠️

Using the wrong backend can make models run slowly or fail to load. Pick the one that matches your hardware.

NVIDIA Graphics Cards (Recommended for Speed)

Choose based on your CUDA version (check NVIDIA Control Panel):

For CUDA 12.0:

llama.cpp-avx2-cuda-12-0 (most common)
llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

llama.cpp-avx2-cuda-11-7 (most common)
llama.cpp-avx512-cuda-11-7 (newer Intel/AMD CPUs)

CPU Only (No Graphics Card Acceleration)

llama.cpp-avx2 (most modern CPUs)
llama.cpp-avx512 (newer Intel/AMD CPUs)
llama.cpp-avx (older CPUs)
llama.cpp-noavx (very old CPUs)

Other Graphics Cards

llama.cpp-vulkan (AMD, Intel Arc, some others)

Quick Test: Start with avx2-cuda-12-0 if you have an NVIDIA card, or avx2 for CPU-only. If it doesn't work, try the avx variant.

Performance Settings

These control how efficiently models run:

Setting	What It Does	Recommended Value	Impact
Continuous Batching	Process multiple requests at once	Enabled	Faster when using multiple tools or having multiple conversations
Threads	Number of threads for generation	-1 (auto)	-1 uses all logical cores, adjust for specific needs
Threads (Batch)	Threads for batch and prompt processing	-1 (auto)	Usually same as Threads setting
Batch Size	Logical maximum batch size	2048	Higher allows more parallel processing
uBatch Size	Physical maximum batch size	512	Controls memory usage during batching
GPU Split Mode	How to distribute model across GPUs	Layer	Layer mode is most common for multi-GPU setups
Main GPU Index	Primary GPU for processing	0	Change if you want to use a different GPU

Memory Settings

These control how models use your computer's memory:

Setting	What It Does	Recommended Value	When to Change
Flash Attention	More efficient memory usage	Enabled	Leave enabled unless you have problems
Disable mmap	Don't memory-map model files	Disabled	Enable if experiencing crashes or pageouts
MLock	Keep model in RAM, prevent swapping	Disabled	Enable if you have enough RAM and want consistent performance
Context Shift	Handle very long conversations	Disabled	Enable for very long chats or multiple tool calls
Disable KV Offload	Keep KV cache on CPU	Disabled	Enable if GPU memory is limited
KV Cache K Type	Memory precision for keys	f16	Change to q8_0 or q4_0 if running out of memory
KV Cache V Type	Memory precision for values	f16	Change to q8_0 or q4_0 if running out of memory
KV Cache Defragmentation	Threshold for cache cleanup	0.1	Lower values defragment more often

KV Cache Types Explained

f16: Full 16-bit precision, uses more memory but highest quality
q8_0: 8-bit quantized, balanced memory usage and quality
q4_0: 4-bit quantized, uses least memory, slight quality loss

Advanced Settings

These settings are for fine-tuning model behavior and advanced use cases:

Text Generation Control

Setting	What It Does	Default Value	When to Change
Max Tokens to Predict	Maximum tokens to generate	-1 (infinite)	Set a limit to prevent runaway generation
Custom Jinja Chat Template	Override model's chat format	Empty	Only if model needs special formatting

RoPE (Rotary Position Embedding) Settings

Setting	What It Does	Default Value	When to Change
RoPE Scaling Method	Context extension method	None	For models that support extended context
RoPE Scale Factor	Context scaling multiplier	1	Increase for longer contexts
RoPE Frequency Base	Base frequency for RoPE	0 (auto)	Usually loaded from model
RoPE Frequency Scale Factor	Frequency scaling factor	1	Advanced tuning only

Mirostat Sampling

Setting	What It Does	Default Value	When to Change
Mirostat Mode	Alternative sampling method	Disabled	Try V1 or V2 for more consistent output
Mirostat Learning Rate	How fast it adapts	0.1	Lower for more stable output
Mirostat Target Entropy	Target perplexity	5	Higher for more variety

Output Constraints

Setting	What It Does	Default Value	When to Change
Grammar File	Constrain output format	Empty	For structured output (JSON, code, etc.)
JSON Schema File	Enforce JSON structure	Empty	When you need specific JSON formats

Troubleshooting Common Issues

Models won't load:

Try a different backend (switch from CUDA to CPU or vice versa)
Check if you have enough RAM/VRAM
Update to latest engine version

Very slow performance:

Make sure you're using GPU acceleration (CUDA/Metal/Vulkan backend)
Increase GPU Layers in model settings
Close other memory-intensive programs

Out of memory errors:

Reduce Context Size in model settings
Switch KV Cache Type to q8_0 or q4_0
Try a smaller model variant

Random crashes:

Switch to a more stable backend (try avx instead of avx2)
Disable overclocking if you have it enabled
Update graphics drivers

Quick Setup Guide

For most users:

Use the default backend that Jan installs
Enable Auto Update Engine for automatic compatibility
Leave all performance settings at defaults
Only adjust if you experience problems

If you have an NVIDIA graphics card:

Select the appropriate CUDA backend from the dropdown (e.g., avx2-cuda-12-0)
Make sure GPU Layers is set high in model settings
Keep Flash Attention enabled
Set Main GPU Index if you have multiple GPUs

If models are too slow:

Check you're using GPU acceleration (CUDA/Metal/Vulkan backend)
Enable Continuous Batching
Increase Batch Size and uBatch Size
Close other applications using memory

If running out of memory:

Enable Auto-Unload Old Models
Change KV Cache K/V Type to q8_0 or q4_0
Reduce Context Size in model settings
Enable MLock if you have sufficient RAM
Try a smaller model

For advanced users:

Experiment with Mirostat sampling for more consistent outputs
Use Grammar/JSON Schema files for structured generation
Adjust RoPE settings for models with extended context support
Fine-tune thread counts based on your CPU

Most users can run Jan successfully without changing any of these settings. The defaults are chosen to work well on typical hardware.

Assistants Model Parameters