Local AI Engine (llama.cpp)
What is llama.cpp?
llama.cpp is the engine that runs AI models locally on your computer. Think of it as the software that takes an AI model file and makes it actually work on your hardware - whether that's your CPU, graphics card, or Apple's M-series chips.
Originally created by Georgi Gerganov, llama.cpp is designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.
Why This Matters
Privacy: Your conversations never leave your computer Cost: No monthly subscription fees or API costs Speed: No internet required once models are downloaded Control: Choose exactly which models to run and how they behave
Accessing Engine Settings
Find llama.cpp settings at Settings () > Llama.cpp under Model Providers:

Model Management
The Models section at the top of the Llama.cpp settings page lets you manage all your local GGUF models.
Download from Hub
Browse and download models directly from the Hub tab in the left sidebar. Downloaded models will appear here automatically.
Import Local Files
Click Import to link a GGUF model file already on your computer. This is useful for models downloaded via your browser from Hugging Face, or models shared with other apps — Jan links to the file in place without copying it.

Delete a Model
Click the trash icon next to any model to remove it. This deletes the model file from Jan's data folder (linked files leave the original intact).
These are advanced settings. You typically only need to adjust them if models aren't working properly or you want to optimize performance for your specific hardware.
Engine Management
| Feature | What It Does | When You Need It |
|---|---|---|
| Engine Version | Shows which version of llama.cpp you're running | Check compatibility with newer models |
| Check Updates | Downloads newer engine versions | When new models require updated engine |
| Backend Selection | Choose the version optimized for your hardware | After installing new graphics cards or when performance is poor |
| Auto Update Engine | Automatically updates llama.cpp to latest version | Enable for automatic compatibility with new models |
| Auto-Unload Old Models | Unloads unused models to free memory | Enable when running multiple models or low on memory |
Hardware Backends
Jan offers different backend versions optimized for your specific hardware. Think of these as different "drivers" - each one is tuned for particular processors or graphics cards.
Using the wrong backend can make models run slowly or fail to load. Pick the one that matches your hardware.
NVIDIA Graphics Cards (CUDA)
Choose based on your CUDA version (check NVIDIA Control Panel):
CUDA 12.0 (recommended):
win-avx2-cuda-cu12.0-x64(most common)win-avx512-cuda-cu12.0-x64(newer Intel/AMD CPUs)win-avx-cuda-cu12.0-x64(older CPUs)win-cuda-12-common_cpus-x64(broadest CPU compatibility)
CUDA 11.7:
win-avx2-cuda-cu11.7-x64(most common)win-avx512-cuda-cu11.7-x64(newer Intel/AMD CPUs)win-avx-cuda-cu11.7-x64(older CPUs)win-cuda-11-common_cpus-x64(broadest CPU compatibility)
CUDA 13:
win-cuda-13-common_cpus-x64
CPU Only
win-avx2-x64(most modern CPUs, recommended)win-avx512-x64(newer Intel/AMD CPUs with AVX-512)win-avx-x64(older CPUs)win-noavx-x64(very old CPUs)win-common_cpus-x64(broadest compatibility)
Other Graphics Cards (Vulkan — AMD, Intel Arc)
win-vulkan-x64(recommended for GPU acceleration)win-vulkan-common_cpus-x64(Vulkan with broader CPU support)
Quick Test: Start with win-avx2-cuda-cu12.0-x64 for NVIDIA, win-vulkan-x64 for AMD/Intel Arc, or win-avx2-x64 for CPU-only.
Performance Settings
These control how efficiently models run:
| Setting | What It Does | Recommended Value | Impact |
|---|---|---|---|
| Continuous Batching | Process multiple requests at once | Enabled | Faster when using multiple tools or having multiple conversations |
| Threads | Number of threads for generation | -1 (auto) | -1 uses all logical cores, adjust for specific needs |
| Threads (Batch) | Threads for batch and prompt processing | -1 (auto) | Usually same as Threads setting |
| Batch Size | Logical maximum batch size | 2048 | Higher allows more parallel processing |
| uBatch Size | Physical maximum batch size | 512 | Controls memory usage during batching |
| GPU Split Mode | How to distribute model across GPUs | Layer | Layer mode is most common for multi-GPU setups |
| Main GPU Index | Primary GPU for processing | 0 | Change if you want to use a different GPU |
Memory Settings
These control how models use your computer's memory:
| Setting | What It Does | Recommended Value | When to Change |
|---|---|---|---|
| Flash Attention | More efficient memory usage | Enabled | Leave enabled unless you have problems |
| Disable mmap | Don't memory-map model files | Disabled | Enable if experiencing crashes or pageouts |
| MLock | Keep model in RAM, prevent swapping | Disabled | Enable if you have enough RAM and want consistent performance |
| Context Shift | Handle very long conversations | Disabled | Enable for very long chats or multiple tool calls |
| Disable KV Offload | Keep KV cache on CPU | Disabled | Enable if GPU memory is limited |
| KV Cache K Type | Memory precision for keys | f16 | Change to q8_0 or q4_0 if running out of memory |
| KV Cache V Type | Memory precision for values | f16 | Change to q8_0 or q4_0 if running out of memory |
| KV Cache Defragmentation | Threshold for cache cleanup | 0.1 | Lower values defragment more often |
KV Cache Types Explained
- f16: Full 16-bit precision, uses more memory but highest quality
- q8_0: 8-bit quantized, balanced memory usage and quality
- q4_0: 4-bit quantized, uses least memory, slight quality loss
Advanced Settings
These settings are for fine-tuning model behavior and advanced use cases:
Text Generation Control
| Setting | What It Does | Default Value | When to Change |
|---|---|---|---|
| Max Tokens to Predict | Maximum tokens to generate | -1 (infinite) | Set a limit to prevent runaway generation |
| Custom Jinja Chat Template | Override model's chat format | Empty | Only if model needs special formatting |
RoPE (Rotary Position Embedding) Settings
| Setting | What It Does | Default Value | When to Change |
|---|---|---|---|
| RoPE Scaling Method | Context extension method | None | For models that support extended context |
| RoPE Scale Factor | Context scaling multiplier | 1 | Increase for longer contexts |
| RoPE Frequency Base | Base frequency for RoPE | 0 (auto) | Usually loaded from model |
| RoPE Frequency Scale Factor | Frequency scaling factor | 1 | Advanced tuning only |
Mirostat Sampling
| Setting | What It Does | Default Value | When to Change |
|---|---|---|---|
| Mirostat Mode | Alternative sampling method | Disabled | Try V1 or V2 for more consistent output |
| Mirostat Learning Rate | How fast it adapts | 0.1 | Lower for more stable output |
| Mirostat Target Entropy | Target perplexity | 5 | Higher for more variety |
Output Constraints
| Setting | What It Does | Default Value | When to Change |
|---|---|---|---|
| Grammar File | Constrain output format | Empty | For structured output (JSON, code, etc.) |
| JSON Schema File | Enforce JSON structure | Empty | When you need specific JSON formats |
Troubleshooting Common Issues
Models won't load:
- Try a different backend (switch from CUDA to CPU or vice versa)
- Check if you have enough RAM/VRAM
- Update to latest engine version
Very slow performance:
- Make sure you're using GPU acceleration (CUDA/Metal/Vulkan backend)
- Increase GPU Layers in model settings
- Close other memory-intensive programs
Out of memory errors:
- Reduce Context Size in model settings
- Switch KV Cache Type to q8_0 or q4_0
- Try a smaller model variant
Random crashes:
- Switch to a more stable backend (try avx instead of avx2)
- Disable overclocking if you have it enabled
- Update graphics drivers
Quick Setup Guide
For most users:
- Use the default backend that Jan installs
- Enable Auto Update Engine for automatic compatibility
- Leave all performance settings at defaults
- Only adjust if you experience problems
If you have an NVIDIA graphics card:
- Select the appropriate CUDA backend from the dropdown (e.g.,
avx2-cuda-12-0) - Make sure GPU Layers is set high in model settings
- Keep Flash Attention enabled
- Set Main GPU Index if you have multiple GPUs
If models are too slow:
- Check you're using GPU acceleration (CUDA/Metal/Vulkan backend)
- Enable Continuous Batching
- Increase Batch Size and uBatch Size
- Close other applications using memory
If running out of memory:
- Enable Auto-Unload Old Models
- Change KV Cache K/V Type to q8_0 or q4_0
- Reduce Context Size in model settings
- Enable MLock if you have sufficient RAM
- Try a smaller model
For advanced users:
- Experiment with Mirostat sampling for more consistent outputs
- Use Grammar/JSON Schema files for structured generation
- Adjust RoPE settings for models with extended context support
- Fine-tune thread counts based on your CPU
Most users can run Jan successfully without changing any of these settings. The defaults are chosen to work well on typical hardware.