AI Papers: Video, World Models & Multimodal LLMs (Nov 2025)
Hey everyone! Check out the freshest AI research papers as of November 5, 2025. This roundup focuses on video understanding, world models, multimodal learning, multimodal LLMs, and video foundation models. For an even better experience, be sure to visit the Github page. Let's dive in!
Video Understanding
Video understanding continues to be a hot topic, pushing the boundaries of what machines can comprehend from visual data. These recent papers explore diverse aspects, from comprehensive evaluation to agentic search and anomaly detection. Understanding video content is crucial for various applications, including surveillance, entertainment, and autonomous systems.
One notable paper, "VidText: Towards Comprehensive Evaluation for Video Text Understanding," focuses on improving the evaluation methods for video text understanding. Evaluating how well AI systems can read and understand text within videos is essential for applications like automated transcription and content analysis. The accuracy of these systems directly impacts their utility in real-world scenarios.
Another fascinating area is explored in "Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding." This paper, accepted to NeurIPS 2025, introduces an agentic approach to searching and understanding long-form videos. Imagine AI agents that can intelligently navigate through lengthy video content, using tools to extract relevant information and answer complex questions. This could revolutionize how we interact with and learn from video data.
"Aligning Effective Tokens with Video Anomaly in Large Language Models" delves into the realm of anomaly detection. This research investigates how to align tokens with video anomalies in large language models. The ability to identify unusual or unexpected events in video streams has critical implications for security and safety applications.
For those interested in retrieval tasks, "LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts" presents a new benchmark for long video retrieval. This benchmark aims to evaluate how well systems can retrieve relevant video segments based on multimodal queries. Multimodal queries can include text, audio, and visual elements, making the retrieval task more challenging and realistic.
"VideoExplorer: Think With Videos For Agentic Long-Video Understanding" introduces a system that enables agents to "think" with videos for long-video understanding. This research aims to create AI systems that can reason about video content and perform complex tasks based on their understanding. Agentic video understanding has the potential to transform various fields, from education to entertainment.
"Symmetric Entropy-Constrained Video Coding for Machines" explores efficient video coding techniques tailored for machines. This paper, submitted to the IEEE Transactions, focuses on developing video compression methods that are optimized for machine consumption. Efficient video coding is crucial for reducing storage and bandwidth requirements, enabling faster and more scalable video processing.
"VRoPE: Rotary Position Embedding for Video Large Language Models" introduces a novel position embedding technique for video large language models. This paper, presented at EMNLP 2025, aims to improve the ability of LLMs to process and understand video sequences. Rotary position embedding helps the model understand the temporal relationships between different frames in a video.
"FOCUS: Efficient Keyframe Selection for Long Video Understanding" presents a method for efficient keyframe selection in long videos. Selecting the most important frames from a video can significantly reduce the computational cost of video analysis. This technique is particularly useful for processing long videos where analyzing every frame is impractical.
"AVA: Towards Agentic Video Analytics with Vision Language Models" explores the use of vision-language models for agentic video analytics. This paper, accepted to NDSI 2026, aims to create AI agents that can analyze video content and perform tasks based on their understanding. Agentic video analytics has the potential to automate various tasks, such as surveillance, traffic monitoring, and retail analytics.
"Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders" focuses on improving temporal understanding in video-LLMs. This paper, accepted to NeurIPS 2025, introduces a stacked temporal attention mechanism that enhances the model's ability to capture temporal relationships in video sequences. Temporal understanding is crucial for tasks such as action recognition and video summarization.
"StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA" introduces a new dataset for temporal dynamics and multimodal chain-of-thought reasoning in streaming video question answering. This dataset aims to challenge AI systems to reason about video content in real-time, using multiple modalities and chain-of-thought reasoning. Streaming video question answering is a challenging task that requires the integration of multiple AI techniques.
"Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models" provides a deep dive into video reasoning with large multimodal models. This research explores the effectiveness of post-training techniques for improving the performance of video-LMMs. Post-training involves fine-tuning a pre-trained model on a specific task or dataset.
"VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations" presents a method for boosting video temporal grounding using curriculum reinforcement learning. Video temporal grounding involves identifying the start and end times of specific events in a video. This technique could improve the accuracy of video search and retrieval systems.
"Evaluation of Vision-LLMs in Surveillance Video" evaluates the performance of vision-LLMs in surveillance video. This paper, accepted as a poster in the NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, assesses the capabilities of these models in real-world surveillance scenarios. Surveillance video analysis presents unique challenges due to the complex and dynamic nature of the environment.
"VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding" focuses on evaluating and mitigating multi-modal hallucinations in synthetic video understanding. Multi-modal hallucinations refer to the tendency of AI systems to generate incorrect or nonsensical outputs when processing multiple modalities, such as video and text. This research aims to improve the reliability of AI systems in multi-modal scenarios.
World Model
World models are gaining traction as a way to enable AI systems to understand and interact with their environments more effectively. These models allow agents to predict the consequences of their actions and plan accordingly. Recent papers showcase advancements in various aspects of world modeling, including causal reasoning, spatial reasoning, and geometric consistency.
"Mapping Overlaps in Benchmarks through Perplexity in the Wild" explores how to map overlaps in benchmarks using perplexity in the wild. This research aims to improve the evaluation of AI models by accounting for the correlations between different benchmarks. Understanding benchmark overlaps can help researchers develop more robust and generalizable models.
"CausalARC: Abstract Reasoning with Causal World Models" introduces a causal approach to abstract reasoning. This peer-reviewed workshop paper explores how causal world models can be used to solve abstract reasoning tasks. Causal reasoning is essential for understanding the relationships between different events and predicting their consequences.
"MindJourney: Test-Time Scaling with World Models for Spatial Reasoning" presents a method for test-time scaling with world models for spatial reasoning. This project, with a dedicated project page at https://umass-embodied-agi.github.io/MindJourney, aims to improve the ability of AI systems to reason about spatial relationships in unseen environments. Spatial reasoning is crucial for tasks such as navigation and object manipulation.
"World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training" explores the use of world models as virtual environments for VLA post-training. This research aims to improve the efficiency of VLA training by using world models to simulate realistic environments. Virtual environments can be used to train AI systems in a safe and cost-effective manner.
"Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model" introduces a dual-stream diffusion model for world-model augmented vision-language-action models. This 20-page paper with 10 figures, focuses on improving the integration of world models with vision-language-action models. Diffusion models have shown promising results in various generative tasks.
"FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction" presents a method for geometry-consistent world modeling via unified video and 3D prediction. This research aims to create world models that are consistent with both video and 3D data. Geometric consistency is essential for creating realistic and accurate world models.
"Jasmine: A Simple, Performant and Scalable JAX-based World Modeling Codebase" introduces a new JAX-based world modeling codebase. This project, with a blog post at https://pdoom.org/jasmine.html, aims to provide a simple, performant, and scalable platform for world modeling research. JAX is a popular framework for numerical computation and machine learning.
"StateSpaceDiffuser: Bringing Long Context to Diffusion World Models" explores how to bring long context to diffusion world models. This research aims to improve the ability of diffusion models to capture long-range dependencies in complex environments. Long context is crucial for understanding the dynamics of complex systems.
"SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting" presents a method for pose-free 4D generation via auto-regressive video inpainting. This 26-page paper with 21 figures and 3 tables, with a dedicated project page at https://see-4d.github.io/, focuses on generating 4D models from video data. 4D models capture the geometry and motion of objects over time.
"Clone Deterministic 3D Worlds with Geometrically-Regularized World Models" explores how to clone deterministic 3D worlds with geometrically-regularized world models. This research aims to create world models that can accurately reproduce the behavior of real-world environments. Geometric regularization helps to ensure that the generated models are realistic and stable.
"Bridge and Bound: A Logic-Based Framework for Abstracting (Preliminary Report)" presents a logic-based framework for abstracting world models. This research aims to develop a formal framework for reasoning about world models and their properties. Logic-based frameworks can provide a rigorous foundation for AI research.
"PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control" introduces a unified diffusion model for robot pose estimation and video-to-action control. This research aims to integrate robot perception and control using a single diffusion model. The paper acknowledges that the experimental setup and metrics lack rigor, affecting the fairness of the comparisons.
"Emu3.5: Native Multimodal Models are World Learners" introduces a new multimodal model that learns about the world. This project, with a dedicated project page at https://emu.world, aims to create AI systems that can understand and interact with the world in a more natural way. Multimodal models can process information from multiple sources, such as images, text, and audio.
"Co-Evolving Latent Action World Models" explores the concept of co-evolving latent action world models. This research aims to develop world models that can adapt to the changing behavior of agents in the environment. Co-evolution can lead to more robust and adaptable AI systems.
"Model Provenance Testing for Large Language Models" focuses on model provenance testing for large language models. This research aims to develop methods for verifying the origin and integrity of LLMs. Model provenance is crucial for ensuring the trustworthiness and reliability of AI systems.
Multimodal
Multimodal learning is becoming increasingly important as AI systems are expected to process and integrate information from multiple sources. These papers cover a range of topics, from efficient multimodal models to risk-adaptive steering and spatial reasoning. The ability to effectively combine different modalities is essential for creating AI systems that can understand and interact with the world in a more comprehensive way.
"ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers" introduces an efficient multimodal large language model. Published as a conference paper at ICCV 2025, with a dedicated project page at https://github.com/icip-cas/ShortV, this research aims to reduce the computational cost of multimodal LLMs by freezing visual tokens in ineffective layers. Freezing layers can significantly reduce the memory and computational requirements of deep learning models.
"Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks" focuses on task-oriented multimodal token transmission in resource-constrained multiuser networks. This research aims to optimize the transmission of multimodal data in environments with limited resources. Resource-constrained environments pose unique challenges for AI systems.
"DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI" presents a deep learning based approach to functionally consistent tractography fiber clustering using multimodal diffusion MRI and functional MRI. This 14-page paper, focuses on improving the accuracy of brain imaging analysis. Multimodal MRI provides complementary information about brain structure and function.
"Risk-adaptive Activation Steering for Safe Multimodal Large Language Models" explores risk-adaptive activation steering for safe multimodal large language models. This research aims to develop methods for preventing LLMs from generating harmful or inappropriate content. Risk-adaptive steering can help to ensure that AI systems are used in a responsible and ethical manner.
"AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning" introduces adaptive intra-network modulation for balanced multimodal learning. This 13-page paper with 7 figures, focuses on improving the performance of multimodal learning by adaptively modulating the flow of information within the network. Balanced multimodal learning aims to prevent one modality from dominating the learning process.
"Learning to Steer: Input-dependent Steering for Multimodal LLMs" presents a method for input-dependent steering for multimodal LLMs. This paper, presented at NeurIPS 2025, aims to improve the ability of LLMs to control their behavior based on the input they receive. Input-dependent steering can help to ensure that AI systems are responsive and adaptable.
"Spatial Knowledge Graph-Guided Multimodal Synthesis" explores spatial knowledge graph-guided multimodal synthesis. Published in IEEE/ACM Transactions on Audio, Speech and Language Processing, this research aims to improve the quality of multimodal synthesis by incorporating spatial knowledge. Spatial knowledge graphs can provide valuable information about the relationships between different objects and locations.
"Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models" introduces decoupling contrastive decoding for robust hallucination mitigation in multimodal large language models. This 17-page paper with 4 figures, focuses on reducing the tendency of LLMs to generate incorrect or nonsensical outputs. Contrastive decoding is a technique that aims to improve the coherence and accuracy of generated text.
"Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios" focuses on eliciting and enhancing multimodal reasoning in medical scenarios. This research aims to improve the ability of AI systems to reason about medical data from multiple sources. Medical reasoning requires the integration of various types of information, such as images, text, and patient history.
"Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input" presents a benchmark for evaluating the robustness of multimodal large language models to dynamic resolution input. The authors have discovered a significant error in the paper subsequent to submission, and are withdrawing the manuscript for substantial correction.
"Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks" provides a survey and benchmarks for multimodal spatial reasoning in the large model era. This research aims to provide a comprehensive overview of the current state of the art in multimodal spatial reasoning. Spatial reasoning is crucial for tasks such as navigation and object manipulation.
"LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts" presents a new benchmark for long video retrieval in multimodal contexts. This benchmark aims to evaluate how well systems can retrieve relevant video segments based on multimodal queries. Multimodal queries can include text, audio, and visual elements, making the retrieval task more challenging and realistic.
"FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs" introduces functionally equivalent sampling for trust assessment of multimodal LLMs. Accepted in the Findings of EMNLP, 2025, this research aims to develop methods for assessing the trustworthiness of multimodal LLMs. Trust assessment is crucial for ensuring that AI systems are used in a responsible and ethical manner.
"3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark" presents a medical multimodal multi-agent dialogue benchmark. Presented at EMNLP 25 (main), this benchmark aims to challenge AI systems to engage in realistic medical dialogues with multiple agents. Multi-agent dialogue requires the ability to coordinate and communicate with other agents.
"CGM-Led Multimodal Tracking with Chatbot Support: An Autoethnography in Sub-Health" explores CGM-led multimodal tracking with chatbot support. Presented at the International Conference on Human-Engaged Computing (ICHEC 2025), Singapore, this research aims to develop a system for tracking and managing sub-health conditions using multimodal data and chatbot support. Multimodal tracking can provide a more comprehensive view of a person's health status.
Multimodal LLM
Multimodal LLMs are a rapidly evolving area of AI research, combining the power of large language models with the ability to process and understand information from multiple modalities. These papers showcase advancements in various aspects of multimodal LLMs, including steering, trust assessment, and programmatic control.
"Learning to Steer: Input-dependent Steering for Multimodal LLMs" presents a method for input-dependent steering for multimodal LLMs. This paper, presented at NeurIPS 2025, aims to improve the ability of LLMs to control their behavior based on the input they receive. Input-dependent steering can help to ensure that AI systems are responsive and adaptable.
"FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs" introduces functionally equivalent sampling for trust assessment of multimodal LLMs. Accepted in the Findings of EMNLP, 2025, this research aims to develop methods for assessing the trustworthiness of multimodal LLMs. Trust assessment is crucial for ensuring that AI systems are used in a responsible and ethical manner.
"Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies" explores multimodal LLM-assisted evolutionary search for programmatic control policies. This research aims to leverage the power of multimodal LLMs to discover effective control policies for complex systems. Evolutionary search is a powerful optimization technique that can be used to find solutions to difficult problems.
"Synergistic Tensor and Pipeline Parallelism" focuses on synergistic tensor and pipeline parallelism for training large models. This research aims to improve the efficiency of training large AI models by combining different parallelization techniques. Parallelism is essential for scaling AI models to larger datasets and more complex architectures.
"SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding" introduces SafePLUG, a system that empowers multimodal LLMs with pixel-level insight and temporal grounding for traffic accident understanding. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG. Traffic accident understanding is a challenging task that requires the integration of multiple sources of information.
"All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles" provides an overview of object detection techniques for autonomous vehicles. This research explores the use of pixels, points, and prompts to improve the accuracy and robustness of object detection systems. Autonomous vehicles require reliable object detection systems to navigate safely.
"Omni-Mol: Multitask Molecular Model for Any-to-any Modalities" introduces Omni-Mol, a multitask molecular model for any-to-any modalities. This 44-page paper with 9 figures and 13 tables, accepted by NeurIPS 2025, focuses on developing a versatile model for molecular modeling. Molecular modeling is used in various fields, such as drug discovery and materials science.
"Revealing Multimodal Causality with Large Language Models" explores how to reveal multimodal causality with large language models. Accepted at NeurIPS 2025, this research aims to leverage the power of LLMs to understand causal relationships between different events in multimodal data. Causality is essential for understanding the underlying mechanisms of complex systems.
"NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables" explores the long-context capability of large language models towards long-structured tables. Accepted by NeurIPS 2025, this research aims to evaluate the ability of LLMs to process and understand long tables. Long-context capability is crucial for tasks that require processing large amounts of information.
"From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes" presents a holistic benchmark for multi-level visual grounding in 3D scenes. Update v3 of the NeurIPS 2025 Datasets and Benchmarks paper (v2), including additional evaluations of state-of-the-art multimodal large language models. Project page: https://anywhere-3d.github.io/. Visual grounding is the task of identifying and localizing objects in an image or video based on a textual description.
"Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier" focuses on emotion-coherent reasoning for multimodal LLMs. This 16-page paper with 11 figures, explores how to improve the ability of LLMs to reason about emotions in multimodal data. Emotion-coherent reasoning is crucial for tasks such as sentiment analysis and emotion recognition.
"FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment" introduces FairJudge, a system for MLLM judging for social attributes and prompt image alignment. This research aims to develop methods for evaluating the fairness and bias of multimodal LLMs. Fairness and bias are important considerations for ensuring that AI systems are used in a responsible and ethical manner.
"LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models" explores layerwise ultra-low bit quantization for multimodal large language models. This research aims to reduce the memory and computational requirements of multimodal LLMs by quantizing their weights. Quantization is a technique that reduces the precision of numerical values.
"EasyUUV: An LLM-Enhanced Universal and Lightweight Sim-to-Real Reinforcement Learning Framework for UUV Attitude Control" introduces EasyUUV, an LLM-enhanced universal and lightweight sim-to-real reinforcement learning framework for UUV attitude control. This 8-page paper with 15 figures, focuses on developing a reinforcement learning framework for controlling underwater vehicles. Reinforcement learning is a powerful technique for training AI systems to perform complex tasks.
"Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning" focuses on evaluating multimodal LLMs on tool-enabled image perception, transformation, and reasoning. This research aims to assess the ability of LLMs to use external tools to process and understand images. Tool-enabled perception can significantly enhance the capabilities of AI systems.
Video Foundation Model
Video foundation models are large-scale models trained on massive amounts of video data, enabling them to perform a wide range of video-related tasks. These papers showcase advancements in various aspects of video foundation models, including geometry-consistent world modeling, data-efficient curation, and video generation.
"FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction" presents a method for geometry-consistent world modeling via unified video and 3D prediction. This research aims to create world models that are consistent with both video and 3D data. Geometric consistency is essential for creating realistic and accurate world models.
"Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model" explores LLM-based curation for a data-efficient audio-video foundation model. This 5-page paper with 5 figures and 2 tables, accepted at EUSIPCO 2025, aims to improve the efficiency of training audio-video foundation models by using LLMs to curate the training data. Data curation is the process of selecting and preparing data for use in AI training.
"GenLit: Reformulating Single-Image Relighting as Video Generation" reformulates single-image relighting as video generation. This research aims to improve the quality of image relighting by leveraging video generation techniques. Image relighting is the task of changing the lighting conditions in an image.
"Breakdance Video classification in the age of Generative AI" focuses on breakdance video classification in the age of generative AI. This 11-page paper, explores the challenges and opportunities of classifying breakdance videos in the context of generative AI. Video classification is the task of assigning a category or label to a video.
"Advances in 4D Representation: Geometry, Motion, and Interaction" provides an overview of advances in 4D representation. This 21-page paper with a dedicated project page at https://mingrui-zhao.github.io/4DRep-GMI/, explores the use of 4D models to represent the geometry, motion, and interaction of objects over time. 4D models can capture the dynamic behavior of objects in a way that 3D models cannot.
"TTOM: Test-Time Optimization and Memorization for Compositional Video Generation" introduces TTOM, a method for test-time optimization and memorization for compositional video generation. With a dedicated project page at https://ttom-t2v.github.io/, this research aims to improve the quality of video generation by optimizing the model at test time. Video generation is the task of creating new videos from scratch or from existing content.
"Inferring Dynamic Physical Properties from Video Foundation Models" explores how to infer dynamic physical properties from video foundation models. This research aims to develop methods for extracting information about the physical properties of objects from video data. Physical properties include mass, friction, and elasticity.
"Can World Models Benefit VLMs for World Dynamics?" explores whether world models can benefit VLMs for world dynamics. With a dedicated project page at https://dyva-worldlm.github.io, this research aims to investigate the potential of using world models to improve the performance of VLMs in understanding world dynamics. World dynamics refers to the way that objects and events interact with each other in the world.
"Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation" introduces Uni3C, a method for unifying precisely 3D-enhanced camera and human motion controls for video generation. Accepted by Siggraph Asian 2025, with a dedicated project page at https://github.com/ewrfcas/Uni3C, this research aims to improve the control and realism of video generation by incorporating 3D information about camera and human motion.
"Simplifying Traffic Anomaly Detection with Video Foundation Models" explores how to simplify traffic anomaly detection with video foundation models. Accepted at ICCVW 2025, with code available at https://github.com/tue-mps/simple-tad, this research aims to improve the accuracy and efficiency of traffic anomaly detection by using video foundation models. Traffic anomaly detection is the task of identifying unusual or unexpected events in traffic video.
That's all for this update, folks! Stay tuned for more AI research news.