HuggingFace 每日AI论文速递 - 节目列表

2025.11.13 | 原神数据炼成7B通用AI；零训练轨迹秒变视频遥控器

本期的 9 篇论文如下：[00:19] 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds（Lumine：在3D开放世界中打造通才智能体的开源方案）[00:54] 🎬 Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising（Time-to-Move：无需训练的双时钟去噪运动控制视频生成）[01:31] ⚡ TiDAR: Think in Diffusion, Talk in Autoregression（TiDAR：扩散式思考，自回归式表达）[02:15] 🔄 LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls（LoopTool：闭合数据-训练循环，铸就鲁棒LLM工具调用）[02:51] 🤖 WMPO: World Model-based Policy Optimization for Vision-Language-Action Models（基于世界模型的视觉-语言-动作策略优化）[03:33] 🖥 WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation（WebVIA：可交互可验证的网页端视觉-语言智能体UI代码生成框架）[04:19] 🎯 Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance（迈向对抗式Sinkhorn注意力引导的可靠扩散采样新前沿）[04:55] 🤖 Agentic Refactoring: An Empirical Study of AI Coding Agents（智能体重构：AI编程智能体的大规模实证研究）[05:31] 🛡 Stemming Hallucination in Language Models Using a Licensing Oracle（利用许可证预言机遏制语言模型幻觉）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

6分钟

2025.11.12 | 1.5B小模型反超671B大模型；多智能体质检聊天机器人

本期的 9 篇论文如下：[00:24] 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B（小模型大逻辑：多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力）[00:59] 🤝 Adaptive Multi-Agent Response Refinement in Conversational Systems（对话系统中自适应多智能体响应精炼机制）[01:30] 🧩 Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora（Wasm：构建结构化阿拉伯交错型多模态语料的流水线）[02:17] ⚡ KLASS: KL-Guided Fast Inference in Masked Diffusion Models（KLASS：基于KL散度引导的掩码扩散模型快速采样）[02:53] 🖥 Grounding Computer Use Agents on Human Demonstrations（基于人类演示的计算机使用智能体定位研究）[03:37] 🎥 VideoSSR: Video Self-Supervised Reinforcement Learning（VideoSSR：视频自监督强化学习）[04:19] 🚪 The Path Not Taken: RLVR Provably Learns Off the Principals（未被选择的路径：RLVR确实沿非主方向学习）[05:14] 🔗 BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives（BiCA：面向引文感知难负样本的生物医学稠密检索）[05:56] 🤹 Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective（游走于大型语言模型的钢丝绳——开发者视角的平衡之道）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

6分钟

2025.11.11 | 小窗口勤总结刷新深度研究；先广撒网再啃难题激活代码竞赛

本期的 13 篇论文如下：[00:25] 🧩 IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction（IterResearch：基于马尔可夫状态重构的长程智能体再思考）[01:16] 🏆 DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation（DRIVE：面向可验证奖励强化学习的竞赛级代码生成数据精选最佳实践）[02:03] 🔬 The Station: An Open-World Environment for AI-Driven Discovery（“站”：面向AI驱动科学发现的开放世界环境）[02:43] 🚀 RedOne 2.0: Rethinking Domain-specific LLM Post-Training in Social Networking Services（RedOne 2.0：社交网络场景下领域大模型后训练新范式）[03:15] 🧠 SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization（SofT-GRPO：用Gumbel重参数化软思考策略优化让离散Token强化学习望尘莫及）[03:53] 🧭 Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs（路由流形对齐提升混合专家大语言模型的泛化能力）[04:30] 🔍 Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads（以置信度推理：通过不确定性头高效验证大模型推理步骤）[05:10] 🎬 MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs（MVU-Eval：面向多模态大模型的多视频理解评测基准）[05:50] 🎨 MPJudge: Towards Perceptual Assessment of Music-Induced Paintings（MPJudge：面向音乐诱发绘画的感知评估）[06:57] 🔄 RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization（RLoop：一种通过迭代策略初始化自我提升的强化学习框架）[07:36] 🤖 Robot Learning from a Physical World Model（基于物理世界模型的机器人学习）[08:21] 🛠 NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling（NURBGen：基于大模型驱动NURBS建模的高保真文本转CAD生成）[08:52] 🚀 SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?（SWE-fficiency：语言模型能否在真实工作负载下优化真实仓库性能？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

9分钟

2025.11.10 | DeepEyesV2小模型边看图边写代码；纯数据让AI长出立体眼

本期的 7 篇论文如下：[00:21] 🧠 DeepEyesV2: Toward Agentic Multimodal Model（DeepEyesV2：迈向智能体多模态模型）[01:13] 🧭 Visual Spatial Tuning（视觉空间调优）[01:54] 🦹 Too Good to be Bad: On the Failure of LLMs to Role-Play Villains（过于完美以致无法邪恶：大语言模型反派角色扮演的失败）[02:27] 🧠 Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings（通过精炼文本嵌入减轻大型视觉-语言模型中的幻觉）[03:13] 🪡 Jailbreaking in the Haystack（干草堆中的越狱攻击）[03:48] 🎯 CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?（CritiCal：语言批判能否校准大模型置信度？）[04:23] 🏃 Dense Motion Captioning（密集动作字幕生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

5分钟

【周末特辑】11月第2周最火AI论文 | 视频生成即推理；SVG草图变代码

本期的 5 篇论文如下：[00:31] TOP1(🔥137) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式）[02:43] TOP2(🔥95) | 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation（VCode：以SVG为符号视觉表征的多模态代码评测基准）[05:12] TOP3(🔥90) | 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者）[07:18] TOP4(🔥88) | 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization（别让VLA变盲：对齐视觉表征实现分布外泛化）[09:24] TOP5(🔥79) | 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation（全激活赋能：将通用推理模型扩展到万亿参数的开放语言基座）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

12分钟

2025.11.07 | 视频推理新范式；图像互动促思维

本期的 12 篇论文如下：[00:21] 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式）[00:58] 🧠 V-Thinker: Interactive Thinking with Images（V-Thinker：与图像互动的思维推理）[01:39] 🧠 Scaling Agent Learning via Experience Synthesis（基于经验合成的智能体规模化强化学习）[02:23] 🧠 Cambrian-S: Towards Spatial Supersensing in Video（Cambrian-S：迈向视频中的空间超感）[03:06] 🖥 GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents（GUI-360°：面向计算机使用智能体的大规模综合数据集与评测基准）[03:51] 📄 NVIDIA Nemotron Nano V2 VL（NVIDIA Nemotron Nano V2 VL：面向文档与长视频理解的高效视觉语言模型）[04:28] 🎟 The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms（多头注意力机制的强彩票假设）[05:12] 🕵 Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts（基准设计者应“在测试集上训练”以暴露可利用的非视觉捷径）[05:48] ⚽ Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots（人形机器人视觉驱动反应式足球技能学习）[06:18] 🔍 Contamination Detection for VLMs using Multi-Modal Semantic Perturbation（基于多模态语义扰动的视觉语言模型污染检测）[06:53] 🎧 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics（如何借助源语言感知的神经机器翻译指标评估语音翻译）[07:32] 🚀 RDMA Point-to-Point Communication for LLM Systems（面向LLM系统的RDMA点对点通信）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

8分钟

95

2025.11.06 | 扩散模型省数据；音视频对口型

本期的 9 篇论文如下：[00:17] 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者）[01:06] 🎬 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions（统一音视频生成的不对称跨模态交互方法）[01:42] 🧩 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation（LEGO-Eval：面向具身3D环境合成工具增强细粒度评测）[02:25] 📊 Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning（Orion-MSP：面向表格上下文学习的多尺度稀疏注意力机制）[03:15] 📊 TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models（TabTune：面向表格基础模型推理与微调的一站式统一库）[03:46] 🦾 Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects（Kinematify：开放词汇的高自由度关节物体合成）[04:30] 🧠 MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity（MME-CC：一项面向多模态认知能力的挑战性评测基准）[05:06] 📈 LiveTradeBench: Seeking Real-World Alpha with Large Language Models（LiveTradeBench：用大模型在真实市场里挖掘超额收益）[05:55] 🔍 Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation（多模态嵌入器自适应决定何时增强查询的所罗门方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

7分钟

64

2025.11.05 | 向量草图测代码；先画后想补视觉

本期的 15 篇论文如下：[00:21] 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation（VCode：以SVG为符号视觉表征的多模态代码评测基准）[01:12] 🧠 When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought（当可视化成为推理第一步：MIRA视觉思维链基准测试）[01:48] ⚖ When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs（当模态冲突时：单模态推理不确定性如何左右多模态大模型的偏好）[02:36] 🪙 Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR（更短却更好：用易题作长度正则化实现节俭推理）[03:11] 🧠 Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer（Brain-IT：基于脑交互Transformer的fMRI图像重建）[03:49] 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization（别让VLA变盲：对齐视觉表征实现分布外泛化）[04:33] 🎨 LTD-Bench: Evaluating Large Language Models by Letting Them Draw（LTD-Bench：让大模型画画来测评空间推理力）[05:15] 🤖 TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System（TWIST2：可扩展、便携且全面的人形机器人数据采集系统）[06:01] 🗜 Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models（视觉输入能否被压缩？面向大型多模态模型的视觉Token压缩基准）[06:46] 🏆 CodeClash: Benchmarking Goal-Oriented Software Engineering（CodeClash：面向目标的软件工程基准测试）[07:29] 🎭 VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models（VidEmo：面向情感中心视频基础模型的情感树推理）[08:03] 🧠 BRAINS: A Retrieval-Augmented System for Alzheimer's Detection and Monitoring（BRAINS：用于阿尔茨海默病检测与监测的检索增强系统）[08:42] 📊 ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension（ChartM³：面向图表理解的多维多步视觉推理数据构建的多阶段代码驱动流水线）[09:45] 📊 TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data（TabDSR：表格复杂数值推理的分解-清洗-推理框架）[10:17] 🤖 iFlyBot-VLA Technical Report（iFlyBot-VLA技术报告：大规模视觉-语言-动作模型新框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

2025.11.04 | 超稀疏MoE激活万亿参数；视觉模型看图胜GNN

本期的 15 篇论文如下：[00:23] 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation（全激活赋能：将通用推理模型扩展到万亿参数的开放语言基座）[01:03] 👁 The Underappreciated Power of Vision Models for Graph Structural Understanding（被低估的视觉模型在图结构理解中的强大潜能）[01:38] 💡 UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback（UniLumos：基于物理可信反馈的统一图像与视频快速重打光框架）[02:37] 🕸 Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph（将测试时计算最优扩展泛化为可优化的图）[03:11] 🤖 PHUMA: Physically-Grounded Humanoid Locomotion Dataset（PHUMA：基于物理的人形机器人运动数据集）[03:48] 🔭 ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use（ToolScope：面向视觉引导与长程工具使用的智能体框架）[04:30] 🧠 UniREditBench: A Unified Reasoning-based Image Editing Benchmark（UniREditBench：基于统一推理的图像编辑评测基准）[05:23] 🔄 ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation（ROVER：统一多模态生成中的双向跨模态推理基准测试）[06:04] 🌍 Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum（迈向通用视频检索：通过合成多模态金字塔课程泛化视频嵌入）[06:44] 🌍 World Simulation with Video Foundation Models for Physical AI（基于视频基础模型的物理AI世界仿真）[07:20] 🧠 TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning（TIR-Bench：面向“图像思维”智能体推理的综合评测基准）[08:03] 🧭 NaviTrace: Evaluating Embodied Navigation of Vision-Language Models（NaviTrace：评测视觉-语言模型具身导航能力）[08:45] 📏 Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench（视觉语言模型能否胜任？基于MeasureBench的视觉测量读数基准测试）[09:23] 🧭 Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models（激活多模态大语言模型的空间推理能力）[10:07] 🐱 LongCat-Flash-Omni Technical Report（LongCat-Flash-Omni技术报告：5600亿参数开源全模态实时音视频交互模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

2025.11.03 | OS-Sentinel实时守护手机操作安全；ThinkMorph让小模型边想边画

本期的 15 篇论文如下：[00:21] 🛡 OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows（OS-Sentinel：在真实工作流中通过混合验证提升移动GUI代理安全性）[01:13] 🧠 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning（ThinkMorph：多模态交错思维链中的涌现特性）[01:49] ⚔ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats（INT对决FP：细粒度低比特量化格式的综合研究）[02:38] 🤖 $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models（π_RL：面向流式视觉-语言-动作模型的在线强化学习微调）[03:26] 🚀 Continuous Autoregressive Language Models（连续自回归语言模型）[03:54] 🧭 Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning（Spatial-SSRL：通过自监督强化学习增强空间理解）[04:37] 🎯 HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration（HyperClick：通过不确定性校准推动可靠GUI定位）[05:15] 🎯 Defeating the Training-Inference Mismatch via FP16（用FP16打败训练-推理失配）[05:52] 🪜 Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals（分阶段DMD：在子区间内做分数匹配实现少步分布匹配蒸馏）[06:28] 🧭 Revisiting Multimodal Positional Encoding in Vision-Language Models（再探视觉-语言模型中的多模态位置编码）[07:09] ⚡ Higher-order Linear Attention（高阶线性注意力机制）[07:55] 🌐 Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model（双流扩散助力世界模型增强视觉-语言-动作模型）[08:36] 🔬 The Denario project: Deep knowledge AI agents for scientific discovery（Denario项目：面向科学发现的深度知识AI智能体）[09:14] 🎯 Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning（面向具身决策的多模态大模型视觉后门攻击：对比触发学习方法）[09:51] 🏙 Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery（Mask-to-Height：基于YOLOv11的联合建筑实例分割与高度分类架构）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

【月末特辑】10月最火AI论文 | 幼龙BDH稀疏可解释；迷你递归7兆碾压大模型

本期的 10 篇论文如下：[00:30] TOP1(🔥522) | 🐣 The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain（幼龙破壳： Transformer 与大脑模型之间缺失的环节）[02:31] TOP2(🔥462) | 🧠 Less is More: Recursive Reasoning with Tiny Networks（小而精：用微型网络递归推理）[04:48] TOP3(🔥255) | 🌱 Agent Learning via Early Experience（基于早期经验的主体学习）[07:04] TOP4(🔥182) | 🔄 Scaling Latent Reasoning via Looped Language Models（通过循环语言模型扩展潜在推理能力）[09:11] TOP5(🔥170) | 🔥 MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use（MCPMark：面向真实且全面的MCP应用场景的压力测试基准）[11:18] TOP6(🔥169) | 🚀 QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs（QeRL：超越效率——面向大语言模型的量化增强强化学习）[13:10] TOP7(🔥167) | 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations（Concerto：2D-3D联合自监督学习涌现空间表征）[15:38] TOP8(🔥160) | 🧠 Diffusion Transformers with Representation Autoencoders（基于表示自编码器的扩散Transformer）[17:59] TOP9(🔥144) | 🧠 A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning（大模型推理中内部概率与自洽性桥接的理论研究）[20:09] TOP10(🔥142) | 🎯 Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model（空间强迫：面向视觉-语言-动作模型的隐式空间表征对齐）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

22分钟