arxiv:2601.09001

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Published on Jan 13

· Submitted by

Luciano Del Corro on Jan 19

Universidad de San Andrés

Upvote

Authors:

Pedro Memoli Buffa ,

Luciano Del Corro

Abstract

Output-entropy profiles computed from final-layer next-token probabilities serve as a scalable signal for monitoring LLM performance and prioritizing data acquisition under domain shifts.

AI-generated summary

Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.

View arXiv page View PDF Add to collection

Community

lucianodelcorro

Paper author Paper submitter about 18 hours ago

A first exploration of a lightweight, inference-time method for monitoring LLM accuracy under domain drift using output-entropy traces derived from next-token probabilities. This approach demonstrates promising results for slice-level accuracy estimation across STEM reasoning benchmarks, suggesting that entropy-based signals could serve as a practical tool for real-time model monitoring in production. It offers potential utility for both continuous performance tracking and prioritizing data acquisition in dynamic environments.