Documentation
Local AI Engine

Local AI Engine (llama.cpp)

What is llama.cpp?

llama.cpp is the engine that runs AI models locally on your computer. Think of it as the software that takes an AI model file and makes it actually work on your hardware - whether that's your CPU, graphics card, or Apple's M-series chips.

Originally created by Georgi Gerganov, llama.cpp is designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.

Why This Matters

Privacy: Your conversations never leave your computer Cost: No monthly subscription fees or API costs Speed: No internet required once models are downloaded Control: Choose exactly which models to run and how they behave

Accessing Engine Settings

Find llama.cpp settings at Settings () > Local Engine > llama.cpp:

llama.cpp

These are advanced settings. You typically only need to adjust them if models aren't working properly or you want to optimize performance for your specific hardware.

Engine Management

FeatureWhat It DoesWhen You Need It
Engine VersionShows which version of llama.cpp you're runningCheck compatibility with newer models
Check UpdatesDownloads newer engine versionsWhen new models require updated engine
Backend SelectionChoose the version optimized for your hardwareAfter installing new graphics cards or when performance is poor
Auto Update EngineAutomatically updates llama.cpp to latest versionEnable for automatic compatibility with new models
Auto-Unload Old ModelsUnloads unused models to free memoryEnable when running multiple models or low on memory

Hardware Backends

Jan offers different backend versions optimized for your specific hardware. Think of these as different "drivers" - each one is tuned for particular processors or graphics cards.

⚠️

Using the wrong backend can make models run slowly or fail to load. Pick the one that matches your hardware.

NVIDIA Graphics Cards (Recommended for Speed)

Choose based on your CUDA version (check NVIDIA Control Panel):

For CUDA 12.0:

  • llama.cpp-avx2-cuda-12-0 (most common)
  • llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

  • llama.cpp-avx2-cuda-11-7 (most common)
  • llama.cpp-avx512-cuda-11-7 (newer Intel/AMD CPUs)

CPU Only (No Graphics Card Acceleration)

  • llama.cpp-avx2 (most modern CPUs)
  • llama.cpp-avx512 (newer Intel/AMD CPUs)
  • llama.cpp-avx (older CPUs)
  • llama.cpp-noavx (very old CPUs)

Other Graphics Cards

  • llama.cpp-vulkan (AMD, Intel Arc, some others)

Quick Test: Start with avx2-cuda-12-0 if you have an NVIDIA card, or avx2 for CPU-only. If it doesn't work, try the avx variant.

Performance Settings

These control how efficiently models run:

SettingWhat It DoesRecommended ValueImpact
Continuous BatchingProcess multiple requests at onceEnabledFaster when using multiple tools or having multiple conversations
ThreadsNumber of threads for generation-1 (auto)-1 uses all logical cores, adjust for specific needs
Threads (Batch)Threads for batch and prompt processing-1 (auto)Usually same as Threads setting
Batch SizeLogical maximum batch size2048Higher allows more parallel processing
uBatch SizePhysical maximum batch size512Controls memory usage during batching
GPU Split ModeHow to distribute model across GPUsLayerLayer mode is most common for multi-GPU setups
Main GPU IndexPrimary GPU for processing0Change if you want to use a different GPU

Memory Settings

These control how models use your computer's memory:

SettingWhat It DoesRecommended ValueWhen to Change
Flash AttentionMore efficient memory usageEnabledLeave enabled unless you have problems
Disable mmapDon't memory-map model filesDisabledEnable if experiencing crashes or pageouts
MLockKeep model in RAM, prevent swappingDisabledEnable if you have enough RAM and want consistent performance
Context ShiftHandle very long conversationsDisabledEnable for very long chats or multiple tool calls
Disable KV OffloadKeep KV cache on CPUDisabledEnable if GPU memory is limited
KV Cache K TypeMemory precision for keysf16Change to q8_0 or q4_0 if running out of memory
KV Cache V TypeMemory precision for valuesf16Change to q8_0 or q4_0 if running out of memory
KV Cache DefragmentationThreshold for cache cleanup0.1Lower values defragment more often

KV Cache Types Explained

  • f16: Full 16-bit precision, uses more memory but highest quality
  • q8_0: 8-bit quantized, balanced memory usage and quality
  • q4_0: 4-bit quantized, uses least memory, slight quality loss

Advanced Settings

These settings are for fine-tuning model behavior and advanced use cases:

Text Generation Control

SettingWhat It DoesDefault ValueWhen to Change
Max Tokens to PredictMaximum tokens to generate-1 (infinite)Set a limit to prevent runaway generation
Custom Jinja Chat TemplateOverride model's chat formatEmptyOnly if model needs special formatting

RoPE (Rotary Position Embedding) Settings

SettingWhat It DoesDefault ValueWhen to Change
RoPE Scaling MethodContext extension methodNoneFor models that support extended context
RoPE Scale FactorContext scaling multiplier1Increase for longer contexts
RoPE Frequency BaseBase frequency for RoPE0 (auto)Usually loaded from model
RoPE Frequency Scale FactorFrequency scaling factor1Advanced tuning only

Mirostat Sampling

SettingWhat It DoesDefault ValueWhen to Change
Mirostat ModeAlternative sampling methodDisabledTry V1 or V2 for more consistent output
Mirostat Learning RateHow fast it adapts0.1Lower for more stable output
Mirostat Target EntropyTarget perplexity5Higher for more variety

Output Constraints

SettingWhat It DoesDefault ValueWhen to Change
Grammar FileConstrain output formatEmptyFor structured output (JSON, code, etc.)
JSON Schema FileEnforce JSON structureEmptyWhen you need specific JSON formats

Troubleshooting Common Issues

Models won't load:

  • Try a different backend (switch from CUDA to CPU or vice versa)
  • Check if you have enough RAM/VRAM
  • Update to latest engine version

Very slow performance:

  • Make sure you're using GPU acceleration (CUDA/Metal/Vulkan backend)
  • Increase GPU Layers in model settings
  • Close other memory-intensive programs

Out of memory errors:

  • Reduce Context Size in model settings
  • Switch KV Cache Type to q8_0 or q4_0
  • Try a smaller model variant

Random crashes:

  • Switch to a more stable backend (try avx instead of avx2)
  • Disable overclocking if you have it enabled
  • Update graphics drivers

Quick Setup Guide

For most users:

  1. Use the default backend that Jan installs
  2. Enable Auto Update Engine for automatic compatibility
  3. Leave all performance settings at defaults
  4. Only adjust if you experience problems

If you have an NVIDIA graphics card:

  1. Select the appropriate CUDA backend from the dropdown (e.g., avx2-cuda-12-0)
  2. Make sure GPU Layers is set high in model settings
  3. Keep Flash Attention enabled
  4. Set Main GPU Index if you have multiple GPUs

If models are too slow:

  1. Check you're using GPU acceleration (CUDA/Metal/Vulkan backend)
  2. Enable Continuous Batching
  3. Increase Batch Size and uBatch Size
  4. Close other applications using memory

If running out of memory:

  1. Enable Auto-Unload Old Models
  2. Change KV Cache K/V Type to q8_0 or q4_0
  3. Reduce Context Size in model settings
  4. Enable MLock if you have sufficient RAM
  5. Try a smaller model

For advanced users:

  1. Experiment with Mirostat sampling for more consistent outputs
  2. Use Grammar/JSON Schema files for structured generation
  3. Adjust RoPE settings for models with extended context support
  4. Fine-tune thread counts based on your CPU

Most users can run Jan successfully without changing any of these settings. The defaults are chosen to work well on typical hardware.