ginipick's picture

ginipick

ginipick

·

AI & ML interests

None yet

Recent Activity

reacted to SeaWolf-AI's post with 👍 16 days ago

FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong. Dataset: [FINAL-Bench/Metacognitive](https://huggingface.co/datasets/FINAL-Bench/Metacognitive) | 100 Tasks | 15 Domains | 8 TICOS Types | Apache 2.0 Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard Article: https://huggingface.co/blog/FINAL-Bench/metacognitive Core Innovation Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) — the ability to say "I might be wrong", and ER (Error Recovery) — the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology. Three Findings Across 9 SOTA Models We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks: 1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning — it is self-correction. 2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct — the most dangerous AI safety profile. 3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001). ```python from datasets import load_dataset dataset = load_dataset("FINAL-Bench/Metacognitive", split="train") ``` Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.

updated a Space about 2 months ago

ginipick/retane

published a Space about 2 months ago

ginipick/retane

View all activity

Organizations

ginipick 's models 11

ginipick/Qwen-Image-Edit-Rapid-AIO

Text-to-Image • Updated Nov 2, 2025 • 1

ginipick/GLM-4.6

Text Generation • 357B • Updated Nov 2, 2025 • 2

ginipick/neutts-air

Text-to-Speech • 0.7B • Updated Nov 2, 2025 • 9 • 1

ginipick/MiniMax-M2

Text Generation • 229B • Updated Nov 2, 2025

ginipick/PaddleOCR-VL

Image-Text-to-Text • 1.0B • Updated Nov 2, 2025 • 4

ginipick/DeepSeek-OCR

Image-Text-to-Text • 3B • Updated Nov 2, 2025 • 2

ginipick/Gemma-3-R1984-4B

Image-Text-to-Text • 4B • Updated Apr 22, 2025 • 2 • 8

ginipick/QwQ-32B-NF4

Text Generation • 33B • Updated Mar 21, 2025 • 2 • 4

ginipick/wan-lora-cat

Text-to-Video • Updated Mar 16, 2025

ginipick/c-bag

Updated Mar 13, 2025

ginipick/flux-lora-eric-cat

Text-to-Image • Updated Dec 2, 2024 • 28 • • 80