NeurIPS 2025


1	How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation	Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Bangalath, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Seungwhan Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Kraehenbuehl, Piotr Dollar, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer	[2506.04088](https://huggingface.co/papers/2506.04088)	110	11	[link](https://foremost-beechnut-8ed.notion.site/WebThinker-Empowering-Large-Reasoning-Models-with-Deep-Research-Capability-d13158a27d924a4b9df7f9ab94066b64)	[link](https://github.com/Intelli-Chip-Lab/enhanced-self-distillation-framework-for-snn)	[ChenDY/NAG_wan2-1-fast](https://huggingface.co/spaces/ChenDY/NAG_wan2-1-fast) [ChenDY/NAG_FLUX.1-Kontext-Dev](https://huggingface.co/spaces/ChenDY/NAG_FLUX.1-Kontext-Dev) [ChenDY/NAG_FLUX.1-dev](https://huggingface.co/spaces/ChenDY/NAG_FLUX.1-dev)	[nathanrchn/zip2zip-test](https://huggingface.co/nathanrchn/zip2zip-test) [Saibo-creator/zip2zip-evqn-7000](https://huggingface.co/Saibo-creator/zip2zip-evqn-7000) [Saibo-creator/zip2zip-evqn-7000-new](https://huggingface.co/Saibo-creator/zip2zip-evqn-7000-new) [Saibo-creator/zip2zip-Phi-3.5-mini-instruct-v0.1](https://huggingface.co/Saibo-creator/zip2zip-Phi-3.5-mini-instruct-v0.1) [Saibo-creator/zip2zip-Llama-3.2-3B-Instruct-v0.1](https://huggingface.co/Saibo-creator/zip2zip-Llama-3.2-3B-Instruct-v0.1) [Saibo-creator/zip2zip-Llama-3.2-1B-Instruct-v0.1](https://huggingface.co/Saibo-creator/zip2zip-Llama-3.2-1B-Instruct-v0.1) [Saibo-creator/zip2zip-Llama-3.1-8B-Instruct-v0.1](https://huggingface.co/Saibo-creator/zip2zip-Llama-3.1-8B-Instruct-v0.1) [epfl-dlab/zip2zip-Llama-3.1-8B-Instruct-v0.1](https://huggingface.co/epfl-dlab/zip2zip-Llama-3.1-8B-Instruct-v0.1) [epfl-dlab/zip2zip-Llama-3.2-1B-Instruct-v0.1](https://huggingface.co/epfl-dlab/zip2zip-Llama-3.2-1B-Instruct-v0.1) [epfl-dlab/zip2zip-Llama-3.2-3B-Instruct-v0.1](https://huggingface.co/epfl-dlab/zip2zip-Llama-3.2-3B-Instruct-v0.1) [epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1](https://huggingface.co/epfl-dlab/zip2zip-Phi-3.5-mini-instruct-v0.1) [epfl-dlab/zip2zip-Phi-3-medium-instruct-v0.1](https://huggingface.co/epfl-dlab/zip2zip-Phi-3-medium-instruct-v0.1)	[WaltonFuture/MMR1-direct-synthesizing](https://huggingface.co/datasets/WaltonFuture/MMR1-direct-synthesizing) [WaltonFuture/geometry3k-in-context-synthesizing](https://huggingface.co/datasets/WaltonFuture/geometry3k-in-context-synthesizing) [WaltonFuture/geometry3k-direct-synthesizing](https://huggingface.co/datasets/WaltonFuture/geometry3k-direct-synthesizing) [WaltonFuture/GeoQA-8K-in-context-synthesizing](https://huggingface.co/datasets/WaltonFuture/GeoQA-8K-in-context-synthesizing) [WaltonFuture/GeoQA-8K-direct-synthesizing](https://huggingface.co/datasets/WaltonFuture/GeoQA-8K-direct-synthesizing) [WaltonFuture/MMR1-in-context-synthesizing](https://huggingface.co/datasets/WaltonFuture/MMR1-in-context-synthesizing)	11/17 ✅	A central question in sensory neuroscience is how much, but also what information neurons transmit about the world. While Shannon’s information theory provides a principled framework to quantify the amount of information neurons encode about all stimuli, it does not reveal which stimuli contribute most, or what stimulus features are encoded. As a concrete example, it is known that neurons in the early visual cortex are 'sensitive' to stimuli in a small region of space (their receptive field). However, it is not clear how such simple intuitions carry to more complex scenarios, e.g. with large, noisy & non-linear population of neurons and high-dimensional stimuli.Several previous measures of neural sensitivity have been proposed. For example, the Fisher information quantifies the sensitivity of neural responses to infinitesimal stimulus perturbations. However, as the Fisher is not a valid decomposition of the mutual information it cannot say how different stimuli contribute to the total encoded information. On the other hand, previous works have proposed stimulus dependent decompositions of mutual information, which define a function $ I(x) $ such that $ I(R; X) = \mathbb{E}[I(x)] $. However, this decomposition is inherently ill-posed: infinitely many functions $I(x)$ satisfy the constraint, with no principled way to select among them. Further, different decompositions behave in qualitatively different ways, making it hard to interpret what are they are telling us. Finally, most proposed decompositions are computationally intractable for the high-dimensional stimuli and non-linear encoding models relevant for neuroscience.To resolve these limitations, we propose a set of axioms that any stimulus specific and feature-specific information decomposition should satisfy in order to serve as a meaningful and interpretable measure of neural sensitivity. These axioms formalize intuitive desiderata: that the information assigned to each stimulus, and stimulus feature, should be non-negative, and additive with respect to repeated measurements. We also require the decomposition to respect a form of locality: changes in how a neuron responds to a stimulus $ x $ should not affect the information attributed to a distant stimulus $ x' $. Finally, the attribution must be insensitive to irrelevant features, which do not contribute to the total information. Together, these constraints ensure that the decomposition is both interpretable and theoretically grounded. We show that existing decompositions violate one or more of these axioms, limiting their interpretability and use as information theoretic measures of neural sensitivity. We then introduce a novel decomposition that satisfies all of our axioms. It generalizes Fisher information by capturing neural sensitivity to both infinitesimal and finite stimulus perturbations. Moreover, it supports further decomposition across individual stimulus features (e.g., image pixels), enabling fine-grained analysis of neural representations.Beyond satisfying our theoretical axioms, our decomposition is computationally tractable for large neural populations and high-dimensional naturalistic stimuli, through the use of diffusion models. We demonstrate the power of our method by quantifying the information encoded by a model of visual neurons about individual images and pixels. Our approach uncovers aspects of the neural code that are not picked up by standard methods, such as the Fisher information, and opens the door to similar analyses in higher-order sensory areas, and artificial neural networks.	1000


1	Feature-aware Modulation for Learning from Temporal Tabular Data	Haorun Cai, Han-Jia Ye			While tabular machine learning has achieved remarkable success, temporal distribution shifts pose significant challenges in real-world deployment, as the relationships between features and labels continuously evolve. Static models assume fixed mappings to ensure generalization, whereas adaptive models may overfit to transient patterns, creating a dilemma between robustness and adaptability.In this paper, we analyze key factors essential for constructing an effective dynamic mapping for temporal tabular data. We discover that evolving feature semantics—particularly objective and subjective meanings—introduce concept drift over time. Crucially, we identify that feature transformation strategies are able to mitigate discrepancies in feature representations across temporal stages.Motivated by these insights, we propose a feature-aware temporal modulation mechanism that conditions feature representations on temporal context, modulating statistical properties such as scale and skewness. By aligning feature semantics across time, our approach achieves a lightweight yet powerful adaptation, effectively balancing generalizability and adaptability.Benchmark evaluations validate the effectiveness of our method in handling temporal shifts in tabular data.	0
2	Multimodal Tabular Reasoning with Privileged Structured Information	Jun-Peng Jiang, Yu Xia, Hai-Long Sun, Shiyin Lu, Qingguo Chen, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye	[2506.04088](https://huggingface.co/papers/2506.04088)	1/9 ✅	Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation (Turbo), a new framework for multimodal tabular reasoning with privileged structured tables. Turbo benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, Turbo repeatedly generates and selects the advantageous reasoning paths, further enhancing the model's tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, Turbo achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.	1
3	Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation	Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan			Autoregressive (AR) image generation models can produce high-fidelity images but often struggle with slow inference due to their token-by-token, sequential decoding. Speculative decoding, which employs a draft model to approximate the AR model’s output, offers a promising way to reduce inference time. While this technique has been successfully applied to accelerate text-based AR models without sacrificing output quality, its application to image generation remains largely unexplored. Directly adapting this method to images is challenging because of the substantially larger sampling space, which complicates alignment between speculative and target model predictions, and the inadequate use of two-dimensional spatial information, which limits the exploitation of local image dependencies. To address these obstacles, we propose Spatial Speculative Decoding, a novel approach that leverages the inherent two-dimensional structure of images to guide a speculative model toward more accurate predictions and faster token generation. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71× speedup over standard AR models, while preserving both image fidelity and diversity.	2
4	AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments	Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang			Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment—moving, examining, and manipulating objects—to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.	3
5	StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold	Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu	[2510.01938](https://huggingface.co/papers/2510.01938)		Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we introduce a geometry-aware extension of LoRA that uses a three-factor decomposition $USV^\top$, separating the adapter's input and output subspaces $V$ and $U$ from the scaling component $S$, in the spirit of singular value decomposition (SVD). Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout training. To optimize on the Stiefel manifold, we employ a flexible geometric optimization design that converts any Euclidean optimizer to a Riemannian optimizer via a modular interface. This enables principled and stable subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the vanilla LoRA and recent state-of-the-art variants.	4
6	Continuous Subspace Optimization for Continual Learning	Quan Cheng, Yuanyu Wan, Lingyu Wu, Chenping Hou, Lijun Zhang	[2505.11816](https://huggingface.co/papers/2505.11816)		Continual learning aims to learn multiple tasks sequentially while preserving prior knowledge, but faces the challenge of catastrophic forgetting when acquiring new knowledge. Recently, approaches leveraging pre-trained models have gained increasing popularity to mitigate this issue, due to the strong generalization ability of foundation models. To adjust pre-trained models for new tasks, existing methods usually employ low-rank adaptation, which restricts parameter updates to a fixed low-rank subspace. However, constraining the optimization space inherently compromises the model's learning capacity, resulting in inferior performance. To address the limitation, we propose Continuous Subspace Optimization for Continual Learning (CoSO) to fine-tune the model in a series of subspaces rather than a single one. These sequential subspaces are dynamically determined through the singular value decomposition of gradients. CoSO updates the model by projecting gradients into these subspaces, ensuring memory-efficient optimization. To mitigate forgetting, the optimization subspaces of each task are set to be orthogonal to the historical task subspace. During task learning, CoSO maintains a task-specific component that captures the critical update directions associated with the current task. Upon completing a task, this component is used to update the historical task subspace, laying the groundwork for subsequent learning. Extensive experiments on multiple datasets demonstrate that CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.	5