Releases: mudler/LocalAI
v4.4.2
What's Changed
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
ac4cddeb0dbd778f650bf568f6f08344a06abe3aby @localai-bot in #10239 - chore: ⬆️ Update CrispStrobe/CrispASR to
4b27392ffd0991a857594652cbb8b57e585bcd7bby @localai-bot in #10241 - fix(vllm): parse tool_call function arguments before applying the chat template by @pos-ei-don in #10256
- fix(cuda): install cuda-nvrtc-dev alongside the other CUDA dev packages by @pos-ei-don in #10257
Full Changelog: v4.4.1...v4.4.2
v4.4.1
What's Changed
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10245
- chore: ⬆️ Update antirez/ds4 to
8384adf0f9fa0f3bb342dd925372de778b95b263by @localai-bot in #10242 - fix(vllm): restore compatibility with vLLM >= 0.22 (get_tokenizer moved to vllm.tokenizers) by @pos-ei-don in #10252
- feat(realtime): stream the LLM / TTS / transcription pipeline stages by @localai-bot in #10176
- docs: fix broken relref to realtime page by @localai-bot in #10255
New Contributors
- @pos-ei-don made their first contribution in #10252
Full Changelog: v4.4.0...v4.4.1
v4.4.0
🎉 LocalAI 4.4.0 Release! 🚀
LocalAI 4.4.0 is out!
This is a big, multimodal-and-distributed release. Two brand-new audio backends land - parakeet.cpp (NVIDIA NeMo Parakeet ASR) and CrispASR (a multi-architecture ASR and TTS engine) - alongside native object detection + segmentation (rfdetr-cpp), video understanding in llama-cpp, and LTX-2 video generation in stablediffusion-ggml. Distributed mode grows up: prefix-cache-aware routing is on by default, and file transfers become resumable. There's a new intelligent middleware layer for request routing, PII filtering and cloud-model proxying, a security hardening pass that closes a credential-leak class across every outbound HTTP client, an interactive local-ai chat CLI, RAG source citations for agents, and a long run of reasoning / tool-call streaming fixes.
📌 TL;DR
| Area | Summary |
|---|---|
| 🎙️ Two new ASR backends | parakeet-cpp (NeMo FastConformer TDT/CTC/RNNT, streaming, word/segment timestamps) and crispasr (many ASR architectures + TTS in one binary). |
| 🧭 Intelligent Middleware | Capability-based model routing, PII detection/redaction, cloud-model proxies + a MITM proxy for subscription-auth Claude Code / Codex. |
| 🛰️ Distributed v4 | Prefix-cache-aware routing (on by default), NATS JWT auth + TLS/mTLS, worker registration-token enforcement, resumable HTTP file transfers, boot-time model prefetch, ds4 layer-split inference. |
| 🎥 Video, both ways | Video input (understanding) in llama-cpp via mtmd, and video generation via LTX-2 in stablediffusion-ggml. |
| 👁️ Detection + Segmentation | New native rfdetr-cpp backend (RF-DETR), 32 prebuilt GGUFs, bbox + per-detection PNG masks. |
| 🔐 Outbound HTTP hardening | pkg/httpclient refuses cross-host credential-leaking redirects across every outbound client (GHSA-3mj3-57v2-4636). |
| 🗣️ TTS per-request control | instructions + a generic params map plumbed end to end (Qwen3-TTS VoiceDesign / CustomVoice, Chatterbox). |
💻 local-ai chat |
Interactive terminal chat against a running server, with /models, /model, /clear. |
| 📚 RAG citations | Agent answers now append a clickable Sources: block from the Knowledge Base. |
| 🧠 Models | Gemma 4 QAT family + QAT-matched MTP speculative-decoding bundles, Ideogram4, LTX-2.3 22B GGUFs. |
🚀 New Features & Major Enhancements
🎙️ Audio Gets Serious: Two New ASR Backends
This release doubles down on speech-to-text with two independent, cgo-less Go backends (purego, CGO_ENABLED=0), each shipping a full CI matrix, gallery importer and docs.
parakeet-cpp - NVIDIA NeMo Parakeet (#10084). Wraps parakeet.cpp, a C++/ggml port of NeMo Parakeet (FastConformer TDT/CTC/RNNT/hybrid) that matches the upstream PyTorch models on CPU. Text transcription, OpenAI-compatible word timestamps, and cache-aware streaming (16 kHz PCM chunks, <EOU>/<EOB> utterance boundaries). GGUFs for all 10 Parakeet models × 5 quants ship in mudler/parakeet-cpp-gguf. Follow-ups in this cycle made it production-grade:
- Dynamic batching (#10112) - concurrent transcription requests are batched for throughput.
- Real, NeMo-faithful segment timestamps (#10207) - words are grouped into segments exactly like NeMo's
get_segment_offsets(sentence-punctuation boundaries by default, opt-insegment_gap_thresholdsilence splitting in encoder frames). StreamingFinalResultsegments now carrystart/endwhen the library exposes the ABI v4 JSON entry points. nemotron-3.5-asrmultilingual streaming (#10199) + per-request language selection.
crispasr - many architectures + TTS in one backend (#10099). Wraps CrispASR (a whisper.cpp/ggml fork, MIT) through its session C-ABI. One backend serves ASR or TTS depending on the loaded model, with the architecture auto-detected from the GGUF (or forced via backend:). The gallery gains 36 -crispasr entries (32 ASR + 4 TTS):
- ASR (e2e-verified across Whisper / Parakeet / Moonshine): parakeet, canary, cohere, qwen3, voxtral, granite, fastconformer-ctc, wav2vec2, hubert, data2vec, glm-asr, kyutai-stt, firered-asr, moonshine, mimo-asr, and more.
- TTS (all four e2e-verified to valid 24 kHz mono WAV): vibevoice, chatterbox, qwen3-tts CustomVoice, orpheus - via
backend:/codec:/speaker:/voice:model options.
🧭 Intelligent Middleware: Routing, PII Filtering & Cloud Proxies
A new middleware layer (#9802) analyzes, routes, filters and transforms chat requests before they hit a model.
- Capability-based routing. Requests are classified (e.g. via an ArchRouter-style model) and scored across the capabilities they may require, then routed to the smallest model that satisfies them - easy requests go to small specialized models, hard or uncertain ones to larger general-purpose models. Classified embeddings are reused via cosine similarity so similar requests skip re-classification.
- PII filtering. Private information is detected per-pattern and can be redacted, rerouted, or blocked, with a streaming PII filter that preserves a buffered-emit invariant on
/v1/chat/completions, Anthropic/v1/messages, and/v1/completions. A per-model PII pattern editor lives in the model config UI. - Cloud model proxies + MITM. Cloud models and a MITM proxy can take part in routing/filtering - send easy requests to local models and hard ones to the cloud, and use Claude Code / Codex subscriptions (OAuth) through the PII filter via the MITM proxy (subject to provider ToS). Emits
proxy_connect+proxy_trafficaudit events and restores its listener fromruntime_settings.jsonon restart.
Usage stats are recorded end to end and surfaced in REST, the UI, and MCP. Outbound clients used by this path were also the trigger for the security pass below.
🛰️ Distributed Mode v4
Distributed mode keeps maturing across routing, security and resilience.
Prefix-cache-aware routing, on by default (#10071). Routing now biases toward the replica that already holds the relevant KV/prefix cache, as a load-guarded hint that never routes worse than today's round-robin. A generic prefix tree (pkg/radixtree) maps cumulative prompt-prefix hashes to nodes; core/services/nodes/prefixcache turns the rendered prompt into a deterministic xxhash chain and makes a filter-then-score decision (narrow to load-eligible replicas, then prefer the longest-prefix match), feeding a preferredNodeID into the existing atomic SELECT ... FOR UPDATE pick. Observations sync across frontends over NATS. Round-robin is the floor; disable with --distributed-prefix-cache=false.
NATS JWT auth + TLS/mTLS (#10159). Previously anyone with access to the NATS port could publish backend-install messages or agent jobs (an SSRF / accidental-exposure risk). This adds JWT authentication and TLS/mTLS options, with workers acquiring and auto-refreshing their NATS credentials. Complemented by worker file-transfer registration-token enforcement (#10183).
Resumable file transfers (#10109). Large model GGUFs over flaky/throttled links no longer restart from byte 0. The worker's PUT /v1/files/<key> honors Content-Range (308/416 resume semantics, X-Content-SHA256 binding, final-hash verification) and the master-side stager HEAD-probes for the last accepted offset and resumes, switching to an outer time budget (LOCALAI_FILE_TRANSFER_BUDGET, default 1h) with exponential backoff.
ds4 layer-split distributed inference (#10098). Manual layer-split inference for the ds4 backend: a coordinator owns layers 0:K and listens; workers dial in and own higher ranges, each loading only its slice of the GGUF (a new dependency-free ds4-worker binary, driven via local-ai worker ds4-distributed). Fully back-compatible when ds4_role is absent.
Operational glue. Boot-time gallery prefetch via LOCALAI_PREFETCH_MODELS (#10108); a gated X-LocalAI-Node response header for attribution (#9976); plus fixes: self-heal stale "model not loaded" routing (#10181), stage directory-based models to remote nodes (#10175), in-flight tracking for non-LLM methods - VAD, diarize, voice (#10238), reconciler survives frontend restarts (#9981), cross-replica OpCache sync (#9983), and the reinstall/upgrade UI no longer sticks on "reinstalling" (#10214).
🎥 Video, Both Directions
Video input / understanding in llama-cpp (#10216). Video-capable multimodal models (e.g. SmolVLM2-Video) can now be sent a video in a chat request, mirroring the existing image and audio paths. Tracks the upstream mtmd video landing (ggml-org/llama.cpp#24269); grpc-server.cpp forwards request->videos() into the mtmd files vector on both the template and non-template paths, and the React chat UI accepts video/*, renders an inline <video controls> player, and emits video_url content parts. allow_video is auto-gated by whether the loaded mmproj supports it. ffmpeg/ffprobe (already in the runtime image) extract frames.
Video generation via LTX-2 (#9980). stablediffusion-ggml wires audio_vae_path and embeddings_connectors_path through to the upstream LTX-2 fields, with a new gallery/ltx-ggml.yaml template (T2V / I2V / FLF2V recipes) and six LTX-2.3 22B GGUF gallery entries (dev + distilled, UD-Q4_K_M / Q4_K_M / Q8_0), each bundling the text encoder + video VAE + audio VAE + embeddings connectors. Follow-up fixes wi...
v4.3.6
What's Changed
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
22d66b567eef11cf2e9832f04db64ee0323a0fd0by @localai-bot in #10080 - security(http): refuse redirects on outbound clients via hardened pkg/httpclient by @richiejp in #10087
- feat(parakeet-cpp): add NVIDIA NeMo Parakeet ASR backend (parakeet.cpp) by @localai-bot in #10084
- chore: ⬆️ Update antirez/ds4 to
e16ead1e29c81a67bbb64e5b001117679cf9ce6eby @localai-bot in #10076 - chore: ⬆️ Update mudler/parakeet.cpp to
30a307553f1965ceb38a1a922069a71e7dd67bf3by @localai-bot in #10092
Full Changelog: v4.3.5...v4.3.6
v4.3.5
What's Changed
Bug fixes 🐛
- fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052) by @localai-bot in #10057
- fix(openai): stop streaming tool-call double-emission when autoparser is active by @bozhouDev in #10055
- fix(application): stop backend processes synchronously on shutdown by @richiejp in #10058
- fix(functions): validate auto-detected XML tool-call names — robust glm-4.5/Hermes guard (#9722, supersedes #9940) by @localai-bot in #10059
- fix(model): track intentional stops, stop misreading clean shutdowns as crashes by @richiejp in #10060
Exciting New Features 🎉
- feat(reasoning): honor per-request reasoning_effort on chat completions by @localai-bot in #10082
Other Changes
- chore: ⬆️ Update mudler/rf-detr.cpp to
ecf64d7f7f20d73ebd906a983f398ed287256320by @localai-bot in #10035 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10046
- chore: ⬆️ Update antirez/ds4 to
22393e770ea8eb7501d8718d6f66c6374004e03fby @localai-bot in #10047 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
0e4ee04488159b81d95a9ffcd983a077fd5dcb77by @localai-bot in #10048 - chore: ⬆️ Update ggml-org/llama.cpp to
751ebd17a58a8a513994509214373bb9e6a3d66cby @localai-bot in #10049 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
6eff055a0cc0e427a6849cfcb5de531b4b82d667by @localai-bot in #10050 - chore: ⬆️ Update ggml-org/whisper.cpp to
c932729a304f7d9eb5354afa38624cfa86a780cfby @localai-bot in #10051 - test(react-ui): cover models gallery empty-state reset flow by @Oceankj in #10019
- test(utils): cover path verification, sanitization, and unique naming by @TLoE419 in #9978
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10061
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
8960c5ba5ee9db30ba838304373aa4dbec9f7cbdby @localai-bot in #10077 - chore: ⬆️ Update vllm-project/vllm cu130 wheel to
0.22.0by @localai-bot in #10079 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #10081
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #10074
- chore: ⬆️ Update mudler/rf-detr.cpp to
65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7by @localai-bot in #10075 - chore: ⬆️ Update ggml-org/whisper.cpp to
f24588a272ae8e23280d9c220536437164e6ed28by @localai-bot in #10078
New Contributors
- @bozhouDev made their first contribution in #10055
- @Oceankj made their first contribution in #10019
- @TLoE419 made their first contribution in #9978
Full Changelog: v4.3.4...v4.3.5
v4.3.4
What's Changed
Other Changes
- fix(turboquant): guard upstream-only grpc-server fields for fork by @localai-bot in #10043
Full Changelog: v4.3.3...v4.3.4
v4.3.3
What's Changed
Other Changes
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
3bf7e836c2c5a895e8d12d3eb7e398ae7ab2f9ceby @localai-bot in #10037 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #10038
- chore: ⬆️ Update ggml-org/llama.cpp to
aa50b2c2ae91326d5aad956ceeb015d1d48e626bby @localai-bot in #10034 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
29ab511fc75f89fbab148665eab1a8e10a139a72by @localai-bot in #10033 - chore: ⬆️ Update ggml-org/whisper.cpp to
6dcdd6536456158667747f724d6bd3a2ceaa8d88by @localai-bot in #10032 - chore: ⬆️ Update antirez/ds4 to
072bc0feb187be5f374c08b16d0045e1ad7bc9bcby @localai-bot in #10036 - fix(openresponses): populate Content and accept bare {role,content} items (#10039) by @Anai-Guo in #10040
- perf(react-ui): code-split bundle, speed up coverage suite by @richiejp in #10042
Full Changelog: v4.3.2...v4.3.3
v4.3.2
What's Changed
👒 Dependencies
- chore(deps): bump github.com/nats-io/nats.go from 1.50.0 to 1.52.0 by @dependabot[bot] in #10003
- chore(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.19.15 to 1.19.17 by @dependabot[bot] in #10008
- chore(deps): bump actions/stale from 10.2.0 to 10.3.0 by @dependabot[bot] in #10002
- chore(deps): bump sentence-transformers from 5.5.0 to 5.5.1 in /backend/python/transformers by @dependabot[bot] in #10007
- chore(deps): update transformers requirement from >=5.8.1 to >=5.9.0 in /backend/python/transformers by @dependabot[bot] in #10005
- chore(deps): bump protobuf from 6.33.5 to 7.35.0 in /backend/python/transformers by @dependabot[bot] in #10004
Other Changes
- feat(middleware): Model routing, PII filtering, Cloud model proxies by @richiejp in #9802
- fix(intel): VRAM detection by @richiejp in #9944
- feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) by @localai-bot in #9976
- fix(distributed): persist per-model load info so reconciler survives frontend restart by @localai-bot in #9981
- feat(stablediffusion-ggml): LTX-2 support + LTX-2.3 GGUF gallery entries by @localai-bot in #9980
- fix(distributed): sync gallery OpCache + caches across frontend replicas by @localai-bot in #9983
- fix(gallery/ltx-2.3): add diffusion_model flag to all variants by @mudler in #9986
- fix(gallery/ltx-2.3): add vae_decode_only:false for i2v / flf2v by @mudler in #9987
- fix(reasoning): stop leaking into content when autoparser is in pure-content mode by @localai-bot in #9991
- fix(stablediffusion-ggml): mux LTX-2 audio into output MP4 by @localai-bot in #9990
- feat(swagger): update swagger by @localai-bot in #9992
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9993
- fix(streaming/tools): stop healing-marker stubs from gating off content by @localai-bot in #9999
- chore: ⬆️ Update antirez/ds4 to
ad0209f6a4b067574d2b4afe896c08c177156b31by @localai-bot in #9996 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
b4e1d916c5ec7e75ea3c124dd090425a99fc613fby @localai-bot in #9995 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
1ceb5bd9df7784bcdf67dd9ed8bf0198b542ebc9by @localai-bot in #9994 - chore: ⬆️ Update ggml-org/whisper.cpp to
e0fd1f6787a5bd4a4957dd97c5b64df882ee7b0cby @localai-bot in #9997 - fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk by @localai-bot in #10000
- chore: ⬆️ Update ggml-org/llama.cpp to
35c9b1f39ebe5a7bb83986d64415a079218be78dby @localai-bot in #9998 - chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10011
- fix(dockerignore): exclude local-only artifacts from build context by @richiejp in #10015
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #10016
- test: add Go + React UI coverage gates and fill test gaps by @richiejp in #9989
- fix(qwen-asr): enable timestamp output when forced_aligner is configured by @fqscfqj in #10013
- fix(nemo): extract Hypothesis.text for TDT/RNNT ASR models by @fqscfqj in #10012
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
d2da6da05c73aeb658a3d1751f386c24e6963856by @localai-bot in #10020 - chore: ⬆️ Update ggml-org/whisper.cpp to
27101c01dcac1676e2b6422256233cd0f1f9ae28by @localai-bot in #10021 - chore: ⬆️ Update ggml-org/llama.cpp to
0d18aaa9d1a8af3df9abccd828e22eeaac7f840bby @localai-bot in #10022 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
92dc7268fc4ffb0c0cc0bd52dfcefea91326e797by @localai-bot in #10023 - chore: ⬆️ Update antirez/ds4 to
e8e8779b261c10f36ad6270ba732c8f0be5b62e3by @localai-bot in #10024 - UI: add 'Fits in my GPU' filter on Install Models by @siddimore in #10017
- fix(react-ui): share single /api/operations poller across consumers by @localai-bot in #10029
- feat(backend): rfdetr-cpp native object detection + segmentation backend by @localai-bot in #10028
- fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle by @localai-bot in #10030
- fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e by @localai-bot in #10031
New Contributors
Full Changelog: v4.3.1...v4.3.2
v4.3.1
What's Changed
Other Changes
- Fix kokoros backend build break from Backend trait drift by @Copilot in #9972
- chore: ⬆️ Update antirez/ds4 to
f91c12b50a1448527c435c028bfc70d1b00f6c33by @localai-bot in #9975 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
9f7ba245ab41e118f03aa8dd5134d18a81159d02by @localai-bot in #9973 - chore: ⬆️ Update ggml-org/llama.cpp to
549b9d84330c327e6791fa812a7d60c0cf63572eby @localai-bot in #9974
Full Changelog: v4.3.0...v4.3.1
v4.3.0
🎉 LocalAI 4.3.0 Release! 🚀
LocalAI 4.3.0 is out!
This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.
📌 TL;DR
| Feature | Summary |
|---|---|
| 🔐 Signed Backends | Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, not_before revocation, opt-in strict mode. |
| ⚡ Prompt Cache by Default | llama-cpp server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds. |
| 📊 Usage per API Key | New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history. |
| 🛰️ Distributed v3 | Per-request replica routing, cached probeHealth, async per-node installs with streaming progress, unified backend-logs entry point. |
| 🩺 Traces UI Stays Snappy | LOCALAI_TRACING_MAX_BODY_BYTES caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings. |
| 🧊 Nix Flake | Dockerless setup for NixOS users via flake.nix + dev shell. |
| 🦾 Jetson Thor Restored | vllm / sglang / vllm-omni L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix). |
🚀 New Features & Major Enhancements
🔐 Signed Backends with Keyless Cosign
LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.
The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:
verification:
issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
not_before: "2026-05-22T00:00:00Z"- TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_beforeis the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.- Digest pinning closes the TOCTOU window between verify and pull.
- Strict mode:
--require-backend-integrity(orLOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.
Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.
🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).
⚡ Prompt Cache: On by Default
llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.
Two changes, one default flip each:
kv_unified=trueby default ingrpc-server.cpp. The previousfalsewas silently force-disablingcache_idle_slotsat server init (the host prompt cache was being allocated but never written across requests).prompt_cache_alldefaults totrueat the YAML layer, matching upstreamllama.cpp's owncommon.hdefault. The per-requestcache_promptknob is now on out of the box.
You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.
🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).
📊 Per-API-Key Usage Tracking
Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".
usage_recordsgainedSource(apikey/web/legacy),APIKeyID,APIKeyName, plus an idempotent backfill of pre-feature rows onInitDB.- Auth middleware plumbs the resolved
*UserAPIKeyand the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as(revoked)). - New endpoints:
GET /api/auth/usage/sources(self, no legacy) andGET /api/auth/admin/usage/sources(admin, withuser_id/api_key_idfilters, 200-key truncation). - React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
- Admin view (follow-up in #9935) also rolls up
(source, user_id, user_name)so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.
Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.
🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).
🛰️ Distributed Mode v3
Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.
Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:
dgx-spark1 loaded in_flight=6
nvidia-thor1 loaded in_flight=0 (← idle, never gets traffic)
Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.
Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.
Resilient backend installs with streaming progress (#9958). Two phases:
- Phase 1:
LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT/LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUTenv vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomesrunning_on_worker, the queue row stays alive without bumpingAttempts, andListBackendsproactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick). - Phase 2: workers publish debounced (~250ms)
BackendInstallProgressEventvalues on a transientnodes.<nodeID>.backend.install.<opID>.progresssubject. The master subscribes for the duration of the request and forwards each event intoOpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.
Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.
Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.
🔗 P...
