Skip to content

fix(distributed): self-heal stale 'model not loaded' routing#10181

Merged
mudler merged 2 commits into
masterfrom
fix/distributed-stale-model-reload
Jun 5, 2026
Merged

fix(distributed): self-heal stale 'model not loaded' routing#10181
mudler merged 2 commits into
masterfrom
fix/distributed-stale-model-reload

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

In distributed mode a model can become unreachable with <backend>: model not loaded and stay broken until the controller is restarted.

Sequence:

  1. A model is loaded on a worker; the registry records it as loaded on that node.
  2. The worker evicts the model (autonomous LRU eviction, an out-of-band unload, memory pressure) but the backend process survives.
  3. The registry still lists the model as loaded there.
  4. On the next request, the router's cached-node check (SmartRouter.probeHealth) only verifies the process is alive — not that the model is loaded — so it routes to that node.
  5. The inference call returns model not loaded. The stale registry row is never cleared, so every subsequent request keeps routing back to the same evicted node.

A controller restart "fixes" it only because it rebuilds the registry from scratch.

Observed live with a realtime pipeline: parakeet-cpp: model not loaded returned instantly with no staging/load attempt on the worker, persisting until the controller was restarted (after which the model re-staged and transcribed fine).

Fix

InFlightTrackingClient now self-heals: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row via RemoveNodeModel, so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned to the caller unchanged — only the registry is corrected.

  • Applied to every tracked inference method (Predict, PredictStream, Embeddings, TTS, TTSStream, AudioTranscription, AudioTranscriptionStream, image/video/sound/detect/rerank).
  • InFlightTracker gains RemoveNodeModel (already implemented by NodeRegistry).

This makes the failure transient and self-correcting (one failed request, then automatic recovery) instead of permanent-until-restart.

Tests

TDD — new specs in inflight_test.go: a model not loaded error drops the replica (for unary and streamed calls); an unrelated error and a success do not. Full core/services/nodes suite passes.

@mudler mudler force-pushed the fix/distributed-stale-model-reload branch from 0a1fbfa to b2c89e9 Compare June 4, 2026 22:43
mudler added 2 commits June 4, 2026 23:00
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.

InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):

- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
  IsModelNotLoaded(err) checker (status-code first, message fallback for
  backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
  rfdetr-cpp) to the typed constructor.

Acting on a false positive is harmless (the model is just reloaded).

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the fix/distributed-stale-model-reload branch from b2c89e9 to bc33d4f Compare June 4, 2026 23:00
@mudler mudler merged commit 858257e into master Jun 5, 2026
68 checks passed
@mudler mudler deleted the fix/distributed-stale-model-reload branch June 5, 2026 07:01
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants