Papers
arxiv:2512.14052

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Published on Dec 16
ยท Submitted by
Jinyang Wu
on Dec 18
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

HyperVL, an efficient multimodal large language model for on-device inference, uses image tiling, Visual Resolution Compressor, and Dual Consistency Learning to reduce memory usage, latency, and power consumption while maintaining performance.

AI-generated summary

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

Community

Paper submitter

๐Ÿš€ [New Paper] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Current multimodal large language models (MLLMs) possess strong perceptual and reasoning capabilities, but their high computational and memory requirements make them difficult to deploy directly on edge devices. HyperVL aims to tackle this challenge by introducing an efficient multimodal large language model tailored for on-device inference.

โœจ The Core Intuition:

HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques:

1๏ธโƒฃ Visual Resolution Compressor (VRC): Adaptively predicts optimal encoding resolutions to eliminate redundant computation.

2๏ธโƒฃ Dual Consistency Learning (DCL): Aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM.

๐Ÿ“ˆ Highlights:

  • State-of-the-Art Performance: HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks.
  • Resource Efficient: It significantly reduces latency and power consumption on real mobile devices, demonstrating a 6.8x reduction in peak memory overhead.
  • Quantization Robustness: The model demonstrates exceptional robustness to low-bit precision under W4A16 quantization with negligible performance drops.
  • Broad Applications: HyperVL shows strong generalization for on-device tasks such as UI understanding and parsing, intent recommendation, and image-text creation.

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/hypervl-an-efficient-and-dynamic-multimodal-large-language-model-for-edge-devices-846-98deea02

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.14052 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.14052 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.14052 in a Space README.md to link it from this page.

Collections including this paper 6