中文使用 googletrans 翻译,翻译不对的地方以英文为准

目录

A.S.E:用于评估AI生成代码安全性的存储库级基准

英文摘要

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks often lack relevance to real-world AI programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in production environments. To address this gap, we introduce A.S.E (AI Code Generation Security Evaluation), a repository-level evaluation benchmark designed to closely mirror real-world AI programming tasks, offering a comprehensive and reliable framework for assessing the security of AI-generated code. Our evaluation of leading LLMs on A.S.E reveals several key findings. In particular, current LLMs still struggle with secure coding. The complexity in repository-level scenarios presents challenges for LLMs that typically perform well on snippet-level tasks. Morever, a larger reasoning budget does not necessarily lead to better code generation. These observations offer valuable insights into the current state of AI code generation, assisting developers in selecting the most appropriate models for practical tasks, while laying the foundation for refining LLMs to generate secure and efficient code in real-world applications.

中文摘要

在软件工程中,大型语言模型(LLM)的采用越来越多,就必须对其生成的代码进行严格的安全评估。但是,现有的基准通常与现实世界中的AI编程方案缺乏相关性,从而使它们不足以评估生产环境中与AI生成的代码相关的实际安全风险。为了解决此差距,我们介绍了A.S.E(AI代码生成安全评估),这是一个存储库级评估基准,旨在密切反映现实世界中的AI编程任务,为评估AI生成的代码的安全性提供了全面可靠的框架。我们对A.S.E领先LLM的评估揭示了几个关键发现。特别是,当前的LLM仍在安全编码方面挣扎。存储库级方案的复杂性给通常在摘要级任务上表现良好的LLM带来了挑战。此外,更大的推理预算并不一定会导致更好的代码生成。这些观察结果为AI代码生成的当前状态提供了宝贵的见解,从而帮助开发人员选择最合适的模型来实用任务,同时为精炼LLMS奠定了基础,以在现实世界应用程序中生成安全有效的代码。


Drivel-Orology:挑战LLM,用深度解释胡说八道

英文摘要

We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth”, utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs’ pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

中文摘要

我们介绍了Drivelology,这是一种独特的语言现象,其特征是“无深度”,在语法上是连贯但务实地矛盾的,情感上的载荷或夸张的颠覆性的话语。尽管这种表达可能类似于表面级别的胡说八道,但它们编码了需要上下文推断,道德推理或情感解释的隐性含义。我们发现,尽管在许多自然语言处理(NLP)任务上表现出色,但目前的大型语言模型(LLMS)始终无法掌握Drivelological Text的分层语义。为了进行调查,我们构建了一个小的但多样化的基准数据集,其中包含1200多个精心策划的示例,其中包括英语,普通话,西班牙语,法语,日语和韩文的精选实例。注释特别具有挑战性:每个示例都需要仔细的专家审查,以验证它真正反映了动力学特征。该过程涉及多轮讨论和裁决,以解决分歧,强调了驱动学的微妙和主观性质。我们在分类,生成和推理任务上评估了一系列LLM。我们的结果揭示了LLMS的明显局限性:模型通常会使Drivelology与浅胡说八道混淆,产生不连贯的理由或完全错过了隐含的修辞功能。这些发现突出了LLMS务实理解中更深的代表性差距,并挑战了统计流利性意味着认知理解的假设。我们释放数据集和代码,以促进进一步的研究,以模拟超出表面水平相干性的语言深度。


LLMS的代理增强学习的景观:一项调查

英文摘要

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

中文摘要

代理增强学习的出现(代理RL)标志着从应用于大型语言模型(LLM RL)的常规强化学习的范式转变,将LLM从被动序列发生器转化为嵌入在复杂,动态世界中的自主决策的机构。这项调查通过将LLM-RL的退化单步马尔可夫决策过程(MDP)与定义代理RL的时间扩展的,部分可观察到的马尔可夫决策过程(POMDP)形式化了这一概念转移。在这个基础的基础上,我们提出了一个全面的双重分类法:一个围绕核心代理能力组织,包括计划,工具使用,记忆,推理,自我完善和感知,以及其他围绕各种任务域的应用程序。我们论文的核心是,强化学习是将这些能力从静态,启发式模块转化为适应性,稳健的代理行为的关键机制。为了支持和加速未来的研究,我们将开源环境,基准和框架的景观巩固为实用的汇编。通过综合了五百幅最近的作品,该调查图表了这个快速发展的领域的轮廓,并突出了将影响可扩展的通用AI代理的发展的机遇和挑战。


科学大语言模型的调查:从数据基础到代理前沿

  • 标题: A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

  • 作者: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

  • 日期: 2025-08-28

  • ArXiv主页: https://arxiv.org/abs/2508.21148

  • 论文链接: https://arxiv.org/pdf/2508.21148

  • gitHub仓库: https://github.com/open-sciencelab/Awesome-Scientific-Datasets-and-LLMs

英文摘要

Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands – heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

中文摘要

科学的大语言模型(SCI-LLMS)正在改变科学研究中的知识的代表,整合和应用的方式,但它们的进步是由科学数据的复杂本质所塑造的。这项调查提出了一种全面的,以数据为中心的综合,将SCI-LLMS的开发重新缩放为模型及其基础数据基板之间的共同发展。我们制定了科学数据的统一分类学和科学知识的层次结构模型,强调了将科学语料库与一般自然语言处理数据集区别的多模式,跨尺度和特定领域的挑战。We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands – heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning.在评估时,我们检查了超过190个基准数据集,并追踪了从静态考试转向具有高级评估协议的过程和面向发现和发现的评估。这些以数据为中心的分析强调了科学数据开发中的持久性问题,并讨论了涉及半自动注释管道和专家验证的新兴解决方案。最后,我们概述了向闭环系统的范式转变,在该系统中,基于Sci-llms的自主代理会积极实验,验证并促进生活,不断发展的知识库。总的来说,这项工作为建立可信赖的,不断发展的人工智能(AI)系统提供了路线图,该系统是加速科学发现的真正合作伙伴。


UI-TARS-2技术报告:通过多转弯学习来推进GUI代理

  • 标题: UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
  • 作者: Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
  • 日期: 2025-09-02
  • ArXiv主页: https://arxiv.org/abs/2509.02544
  • 论文链接: https://arxiv.org/pdf/2509.02544
  • 项目链接: https://seed-tars.com/showcase/ui-tars-2/

英文摘要

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2’s potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

中文摘要

用于图形用户界面(GUI)的自主代理的开发提出了人工智能的主要挑战。尽管本地代理模型的最新进展通过终端学习统一,推理,行动和记忆表现出了希望,但在数据可扩展性,多转弯加固学习(RL),GUI-FOLLY操作的局限性和环境稳定性中仍然存在开放性问题。在这份技术报告中,我们提出了一个以GUI为中心的代理模型UI-TARS-2,该模型通过系统培训方法来解决这些挑战:可扩展数据生成的数据飞轮,稳定的多转移RL RL框架,一个集成文件系统和终端的混合GUI环境,以及用于大型大规模推出的统一的沙盒平台。经验评估表明,UI-TARS-2比其前身UI-TARS-1.5取得了重大改进。在GUI基准测试中,它在在线mind2web上达到88.2,在OSWorld上达到47.5,在Windowsagentarena上达到50.6,在Androidworld上达到73.3,超过了Claude和Openai Agents,例如Claude和Openai Agents。在游戏环境中,它在15场套件的60%的人类级别的表现中达到了平均归一化分数59.8,并且在LMGame-Bench上与Frontier专有型号(例如OpenAI O3)保持竞争力。此外,该模型可以推广到长途信息寻求任务和软件工程基准,从而突出了其在各种代理任务中的稳健性。培训动力学的详细分析进一步提供了有关在大规模RL中实现稳定性和效率的见解。这些结果强调了UI-TARS-2推进GUI剂状态并表现出对现实世界互动场景的强烈概括的潜力。


R-4B:通过双模式退火和增强学习激励MLLM中的通用自动思维能力

英文摘要

Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model’s accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

中文摘要

配备逐步思维功能的多模式大语言模型(MLLM)在复杂的推理问题上表现出色。但是,对于无需复杂推理即可解决的简单问题,此思维过程是多余的。为了解决这种效率低下,我们提出了一种自动思想的MLLM R-4B,可以根据问题的复杂性自适应地决定何时思考。R-4B的核心思想是使用双模式退火赋予模型能力和非思维功能,并应用BI-MODE策略优化〜(BPO),以提高模型在确定是否激活思维过程中的准确性。具体来说,我们首先在跨越各种主题的精心策划的数据集上训练该模型,其中包含来自思维和非思维模式的样本。然后,它在改进的GRPO框架下进行了第二阶段的培训,在此策略模型被迫从两种模式中为每个输入查询生成响应。实验结果表明,R-4B在25个具有挑战性的基准中实现最先进的性能。在大多数任务中,它的表现优于QWEN2.5-VL-7B,并且在较大模型(例如Kimi-VL-A3B-Inkinking-2506(16B))上实现的性能在推理密集型基准上具有较低的计算成本。


从编辑器到密集的几何估计器

英文摘要

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor’s native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT’s global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100times data. The project page can be accessed https://amap-ml.github.io/FE2E/{here}.

中文摘要

利用预先训练的文本对图像(T2I)生成模型的视觉先验已显示在密集预测中的成功。但是,密集的预测本质上是图像到图像的任务,表明图像编辑模型而不是T2i生成模型可能是进行微调的更合适的基础。在此激励的情况下,我们对编辑器和发电机的微调行为进行系统分析,以进行密集的几何估计。我们的发现表明,编辑模型具有固有的结构先验,这使他们能够通过完善''的先天特征来更稳定地收敛,并最终取得了比生成性的相比。根据这些发现,我们介绍了FE2E,我们介绍了一个框架,该框架是根据扩散式(DITIT)构造的先进编辑模型的先进编辑模型(DIT)架构的范围。确定性任务,我们将编辑的原始流匹配损失重新制定为一致的速度’'训练目标。而且,我们使用对数量化来解决编辑器的本机Bfloat16格式与我们任务的高精度需求之间的精确冲突。此外,我们利用DIT的全球关注,以在单个前向通行证中对深度和正态的无成本关节估计,从而使他们的监督信号相互增强。FE2E在不扩大训练数据的情况下,可以在多个数据集中实现零拍的单眼和正常估计的令人印象深刻的性能。值得注意的是,它在ETH3D数据集上实现了超过35%的性能增长,并且优于Depthything系列,该系列经过100倍的数据培训。可以访问项目页面https://amap-ml.github.io/fe2e/ {here}。


SimpleTir:多转弯工具集成推理的端到端强化学习

英文摘要

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

中文摘要

大型语言模型(LLMS)可以通过与外部工具的互动,一种称为工具集成推理(TIR)的范式来显着提高其推理能力。但是,使用增强学习(RL)将TIR扩展到多转变场景通常受到训练不稳定和性能崩溃的阻碍。我们确定这种不稳定性主要是由于外部工具反馈的分布漂移引起的,导致产生低概率令牌。这个问题在连续的转弯中加入了混合,导致灾难性的梯度规范爆炸使训练过程脱轨。为了应对这一挑战,我们介绍了SimpleTir,这是一种稳定多转移TIR训练的插件算法。它的核心策略是识别和滤除包含空隙转弯的轨迹,即既不会产生代码块也不产生最终答案的转弯。通过从策略更新中删除这些有问题的轨迹,SimpleTir有效地阻止了有害的高磁性梯度,从而稳定了学习动态。广泛的实验表明,SimpleTir在具有挑战性的数学推理基准上实现了最新的性能,从QWEN2.5-7B基本模型开始时,AIME24分数尤其将AIME24分数从仅文本基线的22.1提高到50.5。此外,通过避免受监督的微调的限制,SimpleTir鼓励模型发现各种而复杂的推理模式,例如自我纠正和交叉验证。


Llava-Critic-R1:您的评论家模型是秘密的强大政策模型

英文摘要

In vision-language modeling, critic models are typically trained to evaluate outputs – assigning scalar scores or pairwise preferences – rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model – matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

中文摘要

在视觉建模中,通常对评论家模型进行培训,以评估输出 - 分配标量分数或成对的偏好 - 而不是生成响应。与产生回应的政策模型的这种分离是如此根深蒂固,以至于很少考虑批评者以供直接政策使用。在这项工作中,我们挑战了这一公约。我们建议将偏好标记的评论家数据集重新组织为可验证的培训信号,并直接在基本生成模型上进行强化学习,从而产生Llava-Critic-R1,这是一个多模式的评论家,该批评家训练有素,可以优化优先判断,同时保持全代能力。令人惊讶的是,LLAVA-CRITIC-R1不仅是表现最好的批评家,而且作为竞争政策模型而出现 - 匹配或超越了专业推理VLM,在26个视觉推理和理解基准中,在26个视觉推理和理解基准的内域数据中训练,平均增长了 +5.7%的基础模型(QWEN-2.5-VL-VL-7B)。将这种方法扩展到现有强大的推理VLM会产生Llava-Critic-R1+,这进一步提高了政策绩效而不会牺牲批评的质量,在7B量表上实现了MMMU的SOTA性能为71.9。最后,我们表明,增强的评论家能力益处推断:在考试时间应用自我评价可以平均在没有额外培训的情况下,平均在五项代表性推理任务上提高 +13.8%。我们的结果表明,对评论家数据的RL培训可以在评估和生成方面产生统一的模型,从而为可扩展,自我提高的多模式系统提供简单的途径。


体现:通用机器人对照的交织视力文本训练

英文摘要

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

中文摘要

人类在开放世界中无缝执行多模式推理和物理互动的能力是通用体现智能系统的核心目标。在大规模机器人和视觉文本数据上共同培训的最近视觉语言操作(VLA)模型已在一般机器人控制中表现出显着的进展。但是,他们仍然无法在交错推理和相互作用中实现人类水平的灵活性。在这项工作中,引入EO-Robotics,由EO-1模型和EO-DATA1.5M数据集组成。EO-1是一个统一的体现基础模型,通过交织的视觉文本训练预训练在多模式体现的推理和机器人控制中实现卓越的性能。EO-1的开发基于两个关键支柱:(i)一个统一的体系结构,该体系结构在多模式输入(图像,图像,文本,视频和动作)以及(ii)巨大的,高质量的多模式体现数据集,EO-DATA1.5m1.5m11.5m15m15m15亿多个较高的样品中,具有超过150万个型号的样品,这些型号是互动的。EO-1通过自动回归解码和在EO-DATA1.5M上的deNoing之间的协同作用进行了训练,从而实现了无缝的机器人动作生成和多模式体现的推理。广泛的实验证明了通过多种实施方案跨多种长马,灵巧的操纵任务验证了交织的视觉文本学习对开放世界的理解和概括的有效性。本文详细介绍了EO-1的架构,EO-DATA1.5M的数据构建策略以及培训方法,为开发高级体现的基础模型提供了宝贵的见解。


Lodlet3D:视频的常识先验有助于3D代

英文摘要

Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.

中文摘要

缩放定律已验证了跨文本,图像和视频域中创意生成中大型培训模型的成功和希望。但是,这种范式在3D域中面临数据稀缺,因为与上述方式相比,Internet上的可用性少得多。幸运的是,存在足够的视频固有地包含常识性先验,并提供了替代的监督信号,以减轻由有限的本机3D数据引起的概括瓶颈。一方面,捕获对象或场景的多个视图的视频提供了3D生成之前的空间一致性。另一方面,视频中包含的丰富语义信息使生成的内容更忠实于文本提示和语义上的合理。本文探讨了如何将视频模式应用于3D资产生成中,将数据集跨越模型。我们介绍了滴滴3D-4M,这是第一个带有多视图级别注释的大型视频数据集,以及Train Droplet3D,这是一个支持图像和密集文本输入的生成模型。广泛的实验验证了我们方法的有效性,证明了其产生空间一致和语义上合理的内容的能力。此外,与流行的3D解决方案相反,我们的方法具有扩展到场景级应用的潜力。这表明来自视频的常识性先验极大地促进了3D的创建。我们已经开源了所有资源,包括数据集,代码,技术框架和模型权重:https://dropletx.github.io/。


在训练后统一的大语言模型的统一观点

英文摘要

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

中文摘要

培训后现代语言模型存在两个主要的培训数据来源:在线(模型生成的推出)数据和离线(人类或其他模型演示)数据。这两种类型的数据通常由加固学习(RL)和监督微调(SFT)等方法使用。在本文中,我们表明这些方法不是矛盾的,而是单个优化过程的实例。我们得出了统一的政策梯度估计器,并介绍了各种培训后方法的计算,这是不同数据分配假设和各种偏见 - 变化权衡取舍的共同目标的梯度。梯度估计器由四个可互换的部分构建:稳定面罩,参考策略分母,优势估计和似然梯度。在我们的理论发现的激励下,我们提出了混合训练后(HPT),该算法动态选择了不同的训练信号。HPT旨在在不牺牲学习的推理模式的情况下进行有效的示范和稳定探索。我们提供广泛的实验和消融研究,以验证我们统一的理论框架和HPT的有效性。在六个数学推理基准和两个分布套件中,HPT始终超过不同规模和家庭模型的强大基线。


Verltool:使用工具使用的整体代理增强学习

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.

中文摘要

具有可验证奖励(RLVR)的增强学习在增强LLM推理能力方面取得了成功,但仍限于无需工具集成的单转交互。尽管最近使用工具使用(ARLT)方法来解决多转弯工具相互作用,但现有的工作开发了特定于任务的代码库,而这些代码群已经出现了,这些方法已经遭受了碎片的损失,同步执行瓶颈和范围内的可扩展性有限。这些效率低下阻碍了更广泛的社区采用和算法创新。我们介绍了Verltool,这是一个统一的模块化框架,通过系统的设计原理来解决这些限制。Verltool提供了四个关键贡献:(1)与Verl的上游对齐,以确保兼容性和简化维护,(2)通过标准化的API进行统一的工具管理,以支持各种模式,包括代码执行,搜索,SQL数据库和视觉处理以及(3)近2Time and Expiration and 2time and Expirative and Expiration(4)跨6个ARLT域的竞争性能。我们的框架将ARLT形式化为具有多模式观察令牌(文本/图像/视频)的多旋转轨迹,超越了单转rlvr范式。我们在数学推理,知识QA,SQL生成,视觉推理,网络搜索和软件工程任务上培训和评估模型,在提供统一的培训基础架构的同时,实现了与专业系统相当的结果。模块化插件体系结构可实现快速的工具集成,仅需要轻巧的Python定义,大大降低了开发开销,并为工具增强的RL研究提供了可扩展的基础。我们的代码在https://github.com/tiger-ai-lab/verl-tool上进行开源。


开放数据合成以进行深入研究

英文摘要

Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in https://github.com/VectorSpaceLab/InfoSeek{this repository}.

中文摘要

越来越多的大型语言模型(LLM)超越了对深层研究任务的简单事实查询,这些问题需要将问题分解为子问题,协调多步推理,并综合了来自不同来源的证据。我们将具有可验证答案的深入研究任务形式化为分层约束满意度问题(HCSP),它们与单构造,多跳或平面CSP配方根本不同。但是,现有的基准测试(例如自然问题,HotPotQA)无法捕获这种复杂性,而最近的合成数据集通常会引入快捷方式推理,知识泄漏或缺乏足够的结构深度。为了解决这一差距,我们介绍了Infoseek,这是一个可扩展的框架,用于综合复杂的深层研究任务。Infoseek使用双重机构系统从大规模网页中递归构建研究树,将中间节​​点模糊为有效的子问题,并将这些树转换为需要遍历完整层次结构的自然语言问题。它还可以快速缩放,产生超过50k的训练示例,一个精选的测试集以及通过拒绝抽样产生的推理轨迹。实验表明,在Infoseek训练的模型始终胜过强大的基线。在具有挑战性的基准Browsecomp-Plus上,Infoseek优化的3B LLM超过了更大的32B型号和轻巧的商业API(例如Gemini2.5-Flash),同时实现了与更强大的API相当的性能(例如Gemini.2.5-Pro)。通过保留诸如中间步骤和检索标签之类的元信息,Infoseek进一步支持高级优化策略,包括复合奖励设计和轨迹级别的探索。我们在https://github.com/vectorspacelab/infoseek {this repository}中提供代码和数据集。


反ifeval:LLMS可以遵循真实指示的顽固培训惯例吗?

英文摘要

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

中文摘要

大型语言模型(LLMS)在各种任务上实现了强大的绩效,但经常表现出认知惯性,努力遵循指示,即在监督微调(SFT)期间学到的标准化模式冲突。为了评估这一限制,我们提出了Inforve Ifeval,这是一种基准,该基准测量了模型的违反直觉能力的能力,可以超越训练引起的偏见并符合对抗性指示。IFEVAL逆向IFEVAL引入了八种类型的挑战,包括问题纠正,有意的文本缺陷,无评论的代码以及反事实答案。我们使用人类的循环管道,在23个领域构建了1012个高质量的中文和英语问题的数据集,并根据优化的LLM-AS-A-A-Gudge框架进行了评估。现有领先LLM的实验证明了我们提出的反ifeval基准的必要性。我们的发现强调,未来的一致性努力不仅应追求流利的和事实的正确性,而且还应说明在非常规背景下的适应性。我们希望倒数ifeval既是诊断工具,也可以作为开发减轻认知惯性的方法的基础,减少过度拟合狭窄的模式,并最终增强LLM在多种多样且无法预测的现实世界中的指导性可靠性。


推理向量:通过任务算术传输经过思考链的功能

英文摘要

Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: v_{reason} = theta_{GRPO} - theta_{SFT}. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector’s strong contribution to the model’s reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.

中文摘要

大型语言模型通常需要昂贵的优化,例如加固学习,以掌握复杂的推理任务。这项工作表明,一旦学会了推理能力,就可以在模型之间提取和传递作为紧凑的任务向量。我们采购了两个公开可用的,相同初始化的QWEN2.5模型,一种通过监督微调(SFT)进行了微调,另一个则在同一数据集上进行了相对策略优化(GRPO)。从这些过程中,我们提取一个推理向量:v_ {quasine} = theta_ {grpo} -theta_ {sft}。我们假设该矢量捕获了通过强化学习所灌输的推理能力,同时从SFT过程中考虑了共同的知识。当通过简单算术添加到兼容的指令调整模型中时,该矢量始终提高各种推理基准的性能:GSM8K(+4.9%),HumaneVal(+4.3%),Sciq(+1.7%)和BigBenchhard(1.5B模型的+12.3%)。在对抗条件下,绩效的改善持续存在。相反,减去向量会导致显着的性能下降(GSM8K的-11.8%),这表明了矢量对模型推理能力的强有力贡献。这项工作表明了通常通过昂贵的培训开发的推理能力如何从现有的开源模型中提取,并通过简单的张量算术重新使用,从而通过回收先前的计算投资来增强模型的实用方法。


Deepresearch Arena:LLMS研究能力的首次考试通过研讨会的任务

  • 标题: DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks
  • 作者: Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou
  • 日期: 2025-09-01
  • ArXiv主页: https://arxiv.org/abs/2509.01396
  • 论文链接: https://arxiv.org/pdf/2509.01396

英文摘要

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

中文摘要

深入的研究代理人吸引了越来越多的关注,因为他们可能会策划多阶段研究工作流程,跨越文献综合,方法论设计和经验验证。尽管有这些进步,但由于很难收集真正吸引研究人员的注意力和智力好奇心的领域研究问题,因此忠实地评估他们的研究能力是一项挑战。为了解决这一差距,我们介绍了Deepresearch Arena,这是一个基于学术研讨会的基准,捕捉了丰富的专家话语和互动,更好地反映了现实世界中的研究环境并降低了数据泄漏的风险。为了自动构建Deepresearch Arena,我们提出了一个多代理层次任务生成(MAHTG)系统,该系统从研讨会成绩单中提取了值得研究的灵感。MAHTG系统进一步将值得研究的灵感转化为高质量的研究任务,从而确保了在过滤噪声时研究任务制定的可追溯性。借助MAHTG系统,我们策划了来自200多个学术研讨会的10,000多个高质量研究任务,涵盖了12个学科,例如文学,历史和科学。我们广泛的评估表明,Deepresearch Arena对当前的最新代理提出了重大挑战,并且在不同模型之间观察到了明显的性能差距。


Elv-Halluc:长期视频理解中的语义聚集幻觉的基准测试

英文摘要

Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model’s ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

中文摘要

视频多模式大型语言模型(视频录音)在视频理解中取得了显着进步。但是,它们仍然容易产生幻觉的内容与视频输入不一致或无关。以前的视频幻觉基准主要集中在短视频上。他们将幻觉归因于诸如视觉编码器引入的强大语言先验,缺少框架或视觉偏见之类的因素。尽管这些原因确实解释了简短视频中的大多数幻觉,但它们仍然过分简化了幻觉的原因。有时,模型会产生不正确的输出,但具有正确的帧级语义。我们将这种类型的幻觉称为语义聚集幻觉(SAH),它在将框架级语义分为事件级语义组的过程中产生。鉴于由于多个事件之间的语义复杂性增加,SAH在长期视频中变得尤为重要,因此必须将这种类型的幻觉的原因分开并彻底研究。为了解决上述问题,我们介绍了Elv-Halluc,这是第一个专门用于长期幻觉的基准,从而实现了SAH的系统调查。我们的实验证实了SAH的存在,并表明它随着语义的复杂性而增加。此外,我们发现模型更容易在迅速变化的语义上进行SAH。此外,我们讨论了减轻SAH的潜在方法。我们证明,位置编码策略有助于减轻SAH,并进一步采用DPO策略来增强模型在事件内部和跨事件中区分语义的能力。为了支持这一点,我们策划了一个8K对抗数据对的数据集,并对Elv-Halluc和Video-MME进行了改进,包括SAH比率降低了27.7%。


点阅读器:文档转换的视觉模型的无蒸馏器改编

英文摘要

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model’s conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

中文摘要

高质量标记的数据对于培训准确的文档转换模型至关重要,尤其是在具有复杂格式的域,例如表,公式和多列文本。但是,手动注释既昂贵又耗时,而使用现有型号的自动标记通常缺乏处理此类挑战的情况的准确性。因此,通过从教师模型中提取产量来培训学生模型可以显着限制其在现实世界应用中的性能。在本文中,我们提出了一个完全自动化的,无蒸馏的框架,其中包括两个阶段,用于构建高质量文档提取数据集和能够处理各种文档格式和布局的模型。在第一阶段,我们引入了一种生成大规模,多样化的合成数据的方法,该方法使模型能够以统一格式提取具有强烈初始性能的统一格式的关键元素。在第二阶段,我们提出了一种自我完善的方法,该方法进一步适应了最初对合成数据训练的模型,并将其转化为现实世界中的文档。具体来说,我们首先使用微型模型注释真实文档,然后应用一套过滤策略来验证注释质量,最后在经过验证的数据集中重新验证该模型。通过迭代重复此过程,我们逐步增强了模型的转换功能和生成数据的质量。我们训练公共点1.5模型以获取点阅读器,该模型超过了许多现有的公共和专有模型,这些模型具有可比或更大的规模。我们的模型可在https://github.com/tencent/points-reader上找到。


Robix:机器人互动,推理和计划的统一模型

英文摘要

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

中文摘要

我们介绍了Robix,Robix是一个统一的模型,该模型将机器人推理,任务计划和自然语言互动整合到单一视觉架构中。Robix充当层次机器人系统中的高级认知层,动态生成了低级控制器的原子命令和人类互动的言语响应,使机器人能够遵循复杂的指示,计划长途指示,并在端端到端到端的框架内与人类自然互动。Robix进一步介绍了新的功能,例如主动对话,实时中断处理以及任务执行过程中的上下文感知的常识性推理。罗伯克斯(Robix)利用了经过思考的推理并采用了三阶段的训练策略:(1)继续预处理以增强基础体现的推理能力,包括3D空间理解,视觉接地和以任务为中心的推理;(2)监督的填充,以将人类机器人的相互作用和任务计划建模为统一的推理行动序列;(3)增强学习以提高推理行动的一致性和长途任务连贯性。广泛的实验表明,Robix在交互式任务执行中均优于开源和商业基线(例如GPT-4O和Gemini 2.5 Pro),在各种教学类型(例如开放式,多阶段,多阶段,多个阶段,约束,无效,无效,中断和中断)和各种用户供您的餐饮任务中,表现出强烈的概括(例如,开放式,多阶段,多阶段,约束,无效)和各种餐具。


封闭的关联内存:有效序列建模的平行O(n)体系结构

英文摘要

The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.

中文摘要

以自我发项机制为基础的变压器结构已成为序列建模任务的事实上的标准。但是,其核心计算原始尺度在序列长度(O(n^2))上二次地量表,为处理长上下文创建了重要的瓶颈。在本文中,我们提出了封闭式的关联记忆(GAM)网络,这是一种用于序列建模的新颖的,完全平行的架构,相对于序列长度表现出线性复杂性(O(n))。GAM块用两种平行途径代替了自我发场层:一种因果卷积,可有效捕获本地,与位置有关的上下文,以及一种并行的关联内存检索机制,以模拟基于内容的全局,基于内容的模式。这些途径使用门控机制动态融合,从而使模型可以灵活地结合每个令牌的本地和全局信息。我们从头开始实施GAM,并针对标准变压器模型和Wikitext-2基准上的现代线性时间基线(MAMBA)进行了严格的比较分析,以及针对Tinystories数据集中的变压器。我们的实验表明,GAM始终更快,在训练速度上的两个基准都优于两个基准,并在所有数据集中实现了卓越或竞争性的最终验证困惑,将其确立为序列建模的有希望和有效的替代方案。


Baichuan-M2:大型验证者系统的扩展医疗能力

  • 标题: Baichuan-M2: Scaling Medical Capability with Large Verifier System
  • 作者: Baichuan-M2 Team, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang
  • 日期: 2025-09-02
  • ArXiv主页: https://arxiv.org/abs/2509.02208
  • 论文链接: https://arxiv.org/pdf/2509.02208

英文摘要

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

中文摘要

随着大型语言模型(LLM)在对话和推理能力方面提高,其在医疗保健中的实际应用已成为关键的研究重点。但是,医学LLM在USMLE等静态基准上的性能与他们在现实世界临床决策中的效用之间存在明显差距。由于传统考试无法捕获医学咨询的动态性,互动性质,因此出现这种差异。为了应对这一挑战,我们引入了一个新颖的动态验证框架,该框架超越了静态答案验证者,建立了一个大规模,高保真的交互式增强学习系统。我们的框架包括两个关键组成部分:一种使用去识别的医疗记录创建现实临床环境的患者模拟器,以及动态生成多维评估指标的临床专栏生成器。在这个基础的基础上,我们开发了Baichuan-M2,这是一种32B参数医学增强的推理模型,该模型通过多阶段的增强学习策略培训,具有改进的小组相对政策优化(GRPO)算法。Baichuan-M2在HealthBench上进行了评估,优于所有其他开源模型和最先进的封闭源同行,在挑战性的HealthBench Hard Bench Marked中,仅超过GPT-5的挑战性健康基准测试,其得分超过32。我们的工作表明,强大的动态验证器系统对于将LLM功能与实用临床应用保持一致,在医疗AI部署的绩效参数折衷方面建立了新的帕累托阵线。


KWAI KEYE-VL 1.5技术报告

  • 标题: Kwai Keye-VL 1.5 Technical Report
  • 作者: Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
  • 日期: 2025-09-01
  • ArXiv主页: https://arxiv.org/abs/2509.01563
  • 论文链接: https://arxiv.org/pdf/2509.01563

英文摘要

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

中文摘要

近年来,大型语言模型(LLMS)的发展已大大提高,通过多模式大语模型(MLLM)将其功能扩展到多模式任务。但是,由于视频的动态和信息密集的性质,视频理解仍然是一个具有挑战性的领域。在处理视频内容时,现有模型在空间分辨率和时间覆盖范围之间的折衷努力。我们提出了Keye-VL-1.5,它通过三项关键创新解决了视频理解中的基本挑战。首先,我们介绍了一种新型的慢速视频编码策略,该策略基于框架间的相似性,动态分配计算资源,以较高分辨率(慢速路径)的重大视觉变化处理关键框架,同时处理相对静态框架,并在较低分辨率(快速途径)处理相对静态框架并增加了时间覆盖率。其次,我们实施了一种渐进的四阶段预训练方法,该方法将模型的上下文长度从8k扩展到128K令牌,从而可以处理更长的视频和更复杂的视觉内容。第三,我们开发了一条全面的培训后管道,重点是推理增强和人类的偏好一致性,并结合了5步的经过精神链的数据构建过程,基于GSPO的迭代基于GSPO的强化学习以及对困难案例的逐步提示以及对齐培训的逐步提示。通过对公共基准和严格的内部人类评估的广泛评估,Keee-Vl-1.5对现有模型显示出显着改进,尤其是在视频理解任务方面出色,同时保持一般多模式基准的竞争性能。


OpenVision 2:一个生成预审预测的视觉编码器,用于多模式学习

英文摘要

This paper provides a simplification on OpenVision’s architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model’s performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

中文摘要

本文简化了OpenVision的体系结构和损失设计,以提高其培训效率。遵循先前的视觉训练预处理作品CAPPA和AIMV2,以及Llava等现代的多模式设计,我们的变化很简单:我们删除了文本编码器(因此以及对比度损失),仅保留字幕损失作为纯粹生成的训练信号。我们命名了这个新版本的OpenVision 2。最初的结果是有希望的:尽管简化了,OpenVision 2在广泛的多模式基准测试中竞争竞争性匹配原始模型的性能,同时大大减少了训练时间和内存消耗。例如,使用VIT-L/14,它将训练时间减少了约1.5倍(从83h到57h),并且记忆使用率减少了约1.8倍(从24.5GB到13.8GB,等效地允许最大批次大小从2K增长到8K)。这种卓越的训练效率还使我们能够扩展到开放式中使用的最大视力编码器,达到了超过10亿个参数。我们坚信,这种轻巧的,仅生成的范式对多模式基础模型中的未来视觉编码器的发展具有吸引力。


PVPO:针对代理推理的预估计基于价值的策略优化

英文摘要

Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

中文摘要

无评论家的强化学习方法,尤其是小组政策,引起了他们在复杂任务中的效率的极大关注。但是,这些方法在很大程度上依赖于策略中的多次抽样和比较来估计优势,这可能会导致策略属于本地最佳限度并增加计算成本。为了解决这些问题,我们提出了PVPO,这是一种有效的增强学习方法,通过优势参考锚定和数据预采样增强。具体来说,我们使用参考模型提前推出并采用计算出的奖励分数作为参考锚。我们的方法有效地纠正了组内比较引入的累积偏见,并显着减少了对推出数量的依赖。同时,参考模型可以评估数据预采样过程中的样本难度,从而有效选择高增益数据以提高训练效率。在两个域上进行的九个数据集进行的实验表明,PVPO实现了最新的(SOTA)性能。我们的方法不仅证明了跨多个任务的强大概括,而且在不同尺度的模型之间表现出可扩展的性能。


Logo

汇聚全球AI编程工具,助力开发者即刻编程。

更多推荐