MultiModal
updated
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
• 2401.13601
• Published
• 48
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
Paper
• 2402.13144
• Published
• 100
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
Paper
• 2402.13251
• Published
• 14
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published
• 46
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K
Text-to-Image Generation
Paper
• 2403.04692
• Published
• 40
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable
Virtual Try-on
Paper
• 2403.01779
• Published
• 30
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction
Model
Paper
• 2403.05034
• Published
• 21
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
Paper
• 2403.05121
• Published
• 23
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
• 2403.01422
• Published
• 30
DressCode: Autoregressively Sewing and Generating Garments from Text
Guidance
Paper
• 2401.16465
• Published
• 12
Human4DiT: Free-view Human Video Generation with 4D Diffusion
Transformer
Paper
• 2405.17405
• Published
• 16
Looking Backward: Streaming Video-to-Video Translation with Feature
Banks
Paper
• 2405.15757
• Published
• 15
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
• 2405.20204
• Published
• 37
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
• 2407.01449
• Published
• 51
Honeybee: Locality-enhanced Projector for Multimodal LLM
Paper
• 2312.06742
• Published
• 13