fix(distributed): self-heal stale 'model not loaded' routing by localai-bot · Pull Request #10181 · mudler/LocalAI

localai-bot · 2026-06-04T22:33:52Z

Problem

In distributed mode a model can become unreachable with <backend>: model not loaded and stay broken until the controller is restarted.

Sequence:

A model is loaded on a worker; the registry records it as loaded on that node.
The worker evicts the model (autonomous LRU eviction, an out-of-band unload, memory pressure) but the backend process survives.
The registry still lists the model as loaded there.
On the next request, the router's cached-node check (SmartRouter.probeHealth) only verifies the process is alive — not that the model is loaded — so it routes to that node.
The inference call returns model not loaded. The stale registry row is never cleared, so every subsequent request keeps routing back to the same evicted node.

A controller restart "fixes" it only because it rebuilds the registry from scratch.

Observed live with a realtime pipeline: parakeet-cpp: model not loaded returned instantly with no staging/load attempt on the worker, persisting until the controller was restarted (after which the model re-staged and transcribed fine).

Fix

InFlightTrackingClient now self-heals: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row via RemoveNodeModel, so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned to the caller unchanged — only the registry is corrected.

Applied to every tracked inference method (Predict, PredictStream, Embeddings, TTS, TTSStream, AudioTranscription, AudioTranscriptionStream, image/video/sound/detect/rerank).
InFlightTracker gains RemoveNodeModel (already implemented by NodeRegistry).

This makes the failure transient and self-correcting (one failed request, then automatic recovery) instead of permanent-until-restart.

Tests

TDD — new specs in inflight_test.go: a model not loaded error drops the replica (for unary and streamed calls); an unrelated error and a success do not. Full core/services/nodes suite passes.

In distributed mode the registry can list a model as loaded on a node while the worker has evicted it (autonomous LRU eviction, an out-of-band unload, etc.) yet the backend process survives. The router's cached-node check only verifies the process is alive (probeHealth), so it routes there and inference fails with "<backend>: model not loaded" — and stays broken until the controller restarts and rebuilds its registry. InFlightTrackingClient now reconciles this: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row (RemoveNodeModel) so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned unchanged; only the registry is corrected. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Replace the controller-side error-string match with a shared, code-aware helper. Go error types don't survive the gRPC boundary, so the signal is carried as a status code (FailedPrecondition): - pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor + IsModelNotLoaded(err) checker (status-code first, message fallback for backends not yet migrated). - InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded. - Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy, rfdetr-cpp) to the typed constructor. Acting on a false positive is harmless (the model is just reloaded). Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the fix/distributed-stale-model-reload branch from 0a1fbfa to b2c89e9 Compare June 4, 2026 22:43

mudler added 2 commits June 4, 2026 23:00

mudler force-pushed the fix/distributed-stale-model-reload branch from b2c89e9 to bc33d4f Compare June 4, 2026 23:00

mudler merged commit 858257e into master Jun 5, 2026
68 checks passed

mudler deleted the fix/distributed-stale-model-reload branch June 5, 2026 07:01

localai-bot added the bug Something isn't working label Jun 10, 2026

BrewTestBot mentioned this pull request Jun 10, 2026

localai 4.4.0 Homebrew/homebrew-core#287347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(distributed): self-heal stale 'model not loaded' routing#10181

fix(distributed): self-heal stale 'model not loaded' routing#10181
mudler merged 2 commits into
masterfrom
fix/distributed-stale-model-reload

localai-bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 4, 2026

Problem

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants