papers
updated
Emu3.5: Native Multimodal Models are World Learners
Paper
•
2510.26583
•
Published
•
108
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging
Paper
•
2510.20479
•
Published
•
11
Paper
•
2510.18212
•
Published
•
34
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper
•
2510.20888
•
Published
•
45
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Paper
•
2510.14901
•
Published
•
47
DeepAgent: A General Reasoning Agent with Scalable Toolsets
Paper
•
2510.21618
•
Published
•
99
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring
with Arbitrary Granularity
Paper
•
2510.23603
•
Published
•
22
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with
Free-Form Preferences
Paper
•
2510.23451
•
Published
•
26
ACG: Action Coherence Guidance for Flow-based VLA models
Paper
•
2510.22201
•
Published
•
36
Rethinking Visual Intelligence: Insights from Video Pretraining
Paper
•
2510.24448
•
Published
•
5
Latent Chain-of-Thought for Visual Reasoning
Paper
•
2510.23925
•
Published
•
9
From Spatial to Actions: Grounding Vision-Language-Action Model in
Spatial Foundation Priors
Paper
•
2510.17439
•
Published
•
26
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Paper
•
2510.23763
•
Published
•
53
The Principles of Diffusion Models
Paper
•
2510.21890
•
Published
•
60
Reasoning-Aware GRPO using Process Mining
Paper
•
2510.25065
•
Published
•
42
Scaling Latent Reasoning via Looped Language Models
Paper
•
2510.25741
•
Published
•
221
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
•
2510.23473
•
Published
•
84
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
•
2510.26802
•
Published
•
33
Exploring Conditions for Diffusion models in Robotic Control
Paper
•
2510.15510
•
Published
•
39