Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM
Abstract
Output-entropy profiles computed from final-layer next-token probabilities serve as a scalable signal for monitoring LLM performance and prioritizing data acquisition under domain shifts.
Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
Community
A first exploration of a lightweight, inference-time method for monitoring LLM accuracy under domain drift using output-entropy traces derived from next-token probabilities. This approach demonstrates promising results for slice-level accuracy estimation across STEM reasoning benchmarks, suggesting that entropy-based signals could serve as a practical tool for real-time model monitoring in production. It offers potential utility for both continuous performance tracking and prioritizing data acquisition in dynamic environments.
arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/entropy-sentinel-continuous-llm-accuracy-monitoring-from-decoding-entropy-traces-in-stem-9202-3e444ec7
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning (2025)
- Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling (2026)
- The Illusion of Insight in Reasoning Models (2026)
- AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor (2026)
- DiFR: Inference Verification Despite Nondeterminism (2025)
- LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference (2026)
- SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper