fix(distributed): self-heal stale 'model not loaded' routing#10181
Merged
Conversation
0a1fbfa to
b2c89e9
Compare
In distributed mode the registry can list a model as loaded on a node while the worker has evicted it (autonomous LRU eviction, an out-of-band unload, etc.) yet the backend process survives. The router's cached-node check only verifies the process is alive (probeHealth), so it routes there and inference fails with "<backend>: model not loaded" — and stays broken until the controller restarts and rebuilds its registry. InFlightTrackingClient now reconciles this: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row (RemoveNodeModel) so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned unchanged; only the registry is corrected. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Replace the controller-side error-string match with a shared, code-aware helper. Go error types don't survive the gRPC boundary, so the signal is carried as a status code (FailedPrecondition): - pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor + IsModelNotLoaded(err) checker (status-code first, message fallback for backends not yet migrated). - InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded. - Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy, rfdetr-cpp) to the typed constructor. Acting on a false positive is harmless (the model is just reloaded). Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
b2c89e9 to
bc33d4f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In distributed mode a model can become unreachable with
<backend>: model not loadedand stay broken until the controller is restarted.Sequence:
SmartRouter.probeHealth) only verifies the process is alive — not that the model is loaded — so it routes to that node.model not loaded. The stale registry row is never cleared, so every subsequent request keeps routing back to the same evicted node.A controller restart "fixes" it only because it rebuilds the registry from scratch.
Observed live with a realtime pipeline:
parakeet-cpp: model not loadedreturned instantly with no staging/load attempt on the worker, persisting until the controller was restarted (after which the model re-staged and transcribed fine).Fix
InFlightTrackingClientnow self-heals: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row viaRemoveNodeModel, so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned to the caller unchanged — only the registry is corrected.Predict,PredictStream,Embeddings,TTS,TTSStream,AudioTranscription,AudioTranscriptionStream, image/video/sound/detect/rerank).InFlightTrackergainsRemoveNodeModel(already implemented byNodeRegistry).This makes the failure transient and self-correcting (one failed request, then automatic recovery) instead of permanent-until-restart.
Tests
TDD — new specs in
inflight_test.go: amodel not loadederror drops the replica (for unary and streamed calls); an unrelated error and a success do not. Fullcore/services/nodessuite passes.