节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.11.14 | UniVA四合一开源视频通才；Depth Anything 3单ViT通吃3D

本期的 4 篇论文如下： [00:24] 🎬 UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist（UniVA：面向开源下一代视频通才的通用视频智能体） [00:59] 🌐 Depth Anything 3: Recovering the Visual Space from Any Views（Depth Anything 3：从任意视角恢复视觉空间） [01:50] 🔍 AlphaResearch: Accelerating New Algorithm Discovery with Language Models（AlphaResearch：用语言模型加速全新算法发现） [02:21] 🔍 MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples（MuSc-V2：无需标注样本的零样本多模态工业异常分类与分割）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

91

2025.11.13 | 原神数据炼成7B通用AI；零训练轨迹秒变视频遥控器

本期的 9 篇论文如下： [00:19] 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds（Lumine：在3D开放世界中打造通才智能体的开源方案） [00:54] 🎬 Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising（Time-to-Move：无需训练的双时钟去噪运动控制视频生成） [01:31] ⚡ TiDAR: Think in Diffusion, Talk in Autoregression（TiDAR：扩散式思考，自回归式表达） [02:15] 🔄 LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls（LoopTool：闭合数据-训练循环，铸就鲁棒LLM工具调用） [02:51] 🤖 WMPO: World Model-based Policy Optimization for Vision-Language-Action Models（基于世界模型的视觉-语言-动作策略优化） [03:33] 🖥 WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation（WebVIA：可交互可验证的网页端视觉-语言智能体UI代码生成框架） [04:19] 🎯 Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance（迈向对抗式Sinkhorn注意力引导的可靠扩散采样新前沿） [04:55] 🤖 Agentic Refactoring: An Empirical Study of AI Coding Agents（智能体重构：AI编程智能体的大规模实证研究） [05:31] 🛡 Stemming Hallucination in Language Models Using a Licensing Oracle（利用许可证预言机遏制语言模型幻觉）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.11.12 | 1.5B小模型反超671B大模型；多智能体质检聊天机器人

本期的 9 篇论文如下： [00:24] 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B（小模型大逻辑：多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力） [00:59] 🤝 Adaptive Multi-Agent Response Refinement in Conversational Systems（对话系统中自适应多智能体响应精炼机制） [01:30] 🧩 Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora（Wasm：构建结构化阿拉伯交错型多模态语料的流水线） [02:17] ⚡ KLASS: KL-Guided Fast Inference in Masked Diffusion Models（KLASS：基于KL散度引导的掩码扩散模型快速采样） [02:53] 🖥 Grounding Computer Use Agents on Human Demonstrations（基于人类演示的计算机使用智能体定位研究） [03:37] 🎥 VideoSSR: Video Self-Supervised Reinforcement Learning（VideoSSR：视频自监督强化学习） [04:19] 🚪 The Path Not Taken: RLVR Provably Learns Off the Principals（未被选择的路径：RLVR确实沿非主方向学习） [05:14] 🔗 BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives（BiCA：面向引文感知难负样本的生物医学稠密检索） [05:56] 🤹 Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective（游走于大型语言模型的钢丝绳——开发者视角的平衡之道）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.11.11 | 小窗口勤总结刷新深度研究；先广撒网再啃难题激活代码竞赛

本期的 13 篇论文如下： [00:25] 🧩 IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction（IterResearch：基于马尔可夫状态重构的长程智能体再思考） [01:16] 🏆 DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation（DRIVE：面向可验证奖励强化学习的竞赛级代码生成数据精选最佳实践） [02:03] 🔬 The Station: An Open-World Environment for AI-Driven Discovery（“站”：面向AI驱动科学发现的开放世界环境） [02:43] 🚀 RedOne 2.0: Rethinking Domain-specific LLM Post-Training in Social Networking Services（RedOne 2.0：社交网络场景下领域大模型后训练新范式） [03:15] 🧠 SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization（SofT-GRPO：用Gumbel重参数化软思考策略优化让离散Token强化学习望尘莫及） [03:53] 🧭 Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs（路由流形对齐提升混合专家大语言模型的泛化能力） [04:30] 🔍 Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads（以置信度推理：通过不确定性头高效验证大模型推理步骤） [05:10] 🎬 MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs（MVU-Eval：面向多模态大模型的多视频理解评测基准） [05:50] 🎨 MPJudge: Towards Perceptual Assessment of Music-Induced Paintings（MPJudge：面向音乐诱发绘画的感知评估） [06:57] 🔄 RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization（RLoop：一种通过迭代策略初始化自我提升的强化学习框架） [07:36] 🤖 Robot Learning from a Physical World Model（基于物理世界模型的机器人学习） [08:21] 🛠 NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling（NURBGen：基于大模型驱动NURBS建模的高保真文本转CAD生成） [08:52] 🚀 SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?（SWE-fficiency：语言模型能否在真实工作负载下优化真实仓库性能？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

2025.11.10 | DeepEyesV2小模型边看图边写代码；纯数据让AI长出立体眼

本期的 7 篇论文如下： [00:21] 🧠 DeepEyesV2: Toward Agentic Multimodal Model（DeepEyesV2：迈向智能体多模态模型） [01:13] 🧭 Visual Spatial Tuning（视觉空间调优） [01:54] 🦹 Too Good to be Bad: On the Failure of LLMs to Role-Play Villains（过于完美以致无法邪恶：大语言模型反派角色扮演的失败） [02:27] 🧠 Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings（通过精炼文本嵌入减轻大型视觉-语言模型中的幻觉） [03:13] 🪡 Jailbreaking in the Haystack（干草堆中的越狱攻击） [03:48] 🎯 CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?（CritiCal：语言批判能否校准大模型置信度？） [04:23] 🏃 Dense Motion Captioning（密集动作字幕生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

【周末特辑】10月第4周最火AI论文 | 内部概率+投票剪尾，RPC省样本提精度

本期的 5 篇论文如下： [00:29] TOP1(🔥135) | 🧠 A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning（大模型推理中内部概率与自洽性桥接的理论研究） [03:02] TOP2(🔥104) | 🚀 Efficient Long-context Language Model Training by Core Attention Disaggregation（通过核心注意力拆解实现高效长上下文语言模型训练） [05:29] TOP3(🔥100) | 🧠 LightMem: Lightweight and Efficient Memory-Augmented Generation（LightMem：轻量高效的记忆增强生成框架） [07:33] TOP4(🔥90) | 🧠 Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning（每一种注意力都重要：面向长上下文推理的高效混合架构） [10:18] TOP5(🔥79) | 🤖 DeepAnalyze: Agentic Large Language Models for Autonomous Data Science（DeepAnalyze：面向自主数据科学的智能体大模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

2025.10.27 | DeepAgent一步推理+ToolPO；视频即提示DiT秒控百种语义

本期的 15 篇论文如下： [00:27] 🧠 DeepAgent: A General Reasoning Agent with Scalable Toolsets（DeepAgent：具备可扩展工具集的通用推理智能体） [01:01] 🎬 Video-As-Prompt: Unified Semantic Control for Video Generation（视频即提示：统一语义控制的视频生成新范式） [01:35] 🔧 From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model（从去噪到精修：视觉-语言扩散模型的纠错式生成框架） [02:14] 🧩 Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation（逐段采样、分块优化：面向文本到图像生成的块级GRPO方法） [02:51] 🧠 A Definition of AGI（AGI的量化定义） [03:23] 🧩 Sparser Block-Sparse Attention via Token Permutation（基于Token置换的稀疏块稀疏注意力机制） [04:14] 🧭 UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning（UI-Ins：以“指令即推理”多视角增强GUI定位） [04:57] 🧠 Reasoning with Sampling: Your Base Model is Smarter Than You Think（基于采样的推理：你的基础模型比你想象的更聪明） [05:30] 🧠 RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging（RECALL：基于表示对齐的层级模型融合缓解大模型灾难性遗忘） [06:08] 📐 Visual Diffusion Models are Geometric Solvers（视觉扩散模型是几何求解器） [06:56] 🌍 WorldGrow: Generating Infinite 3D World（无限3D世界生成：WorldGrow） [07:35] 🎬 RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling（RAPO++：面向文生视频的跨阶段提示优化——数据对齐与测试时缩放） [08:14] 🔗 Model Merging with Functional Dual Anchors（基于功能双锚点的模型融合方法） [08:49] 🧭 Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs（揭示VideoLLM隐藏信息通路：视频语言模型内部流动图谱） [09:34] 📊 Document Understanding, Measurement, and Manipulation Using Category Theory（基于范畴论的文档理解、度量与操控）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

【周末特辑】11月第2周最火AI论文 | 视频生成即推理；SVG草图变代码

本期的 5 篇论文如下： [00:31] TOP1(🔥137) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式） [02:43] TOP2(🔥95) | 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation（VCode：以SVG为符号视觉表征的多模态代码评测基准） [05:12] TOP3(🔥90) | 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者） [07:18] TOP4(🔥88) | 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization（别让VLA变盲：对齐视觉表征实现分布外泛化） [09:24] TOP5(🔥79) | 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation（全激活赋能：将通用推理模型扩展到万亿参数的开放语言基座）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2025.10.24 | AdaSPEC挑40% token提速两成；AutoPage 10美分生成交互网页

本期的 15 篇论文如下： [00:23] 🎯 AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders（AdaSPEC：面向高效推测解码的选择性知识蒸馏） [00:57] 🤖 Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1（低成本人机协作论文一键成页：低于0.1美元） [01:35] 🔍 Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence（Open-o3视频：显式时空证据支撑的开放域视频推理） [02:06] 🎬 HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives（HoloCine：端到端生成多镜头长时电影级叙事视频） [02:52] 🌀 Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall（绕过离散扩散采样墙的确定性捷径） [03:33] 💎 Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values（每个问题都有它的价值：显式人类价值驱动的强化学习） [04:06] ⚖ The Massive Legal Embedding Benchmark (MLEB)（大规模法律嵌入评测基准（MLEB）） [04:48] 🔍 DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion（DyPE：面向超高分辨率扩散模型的动态位置外推方法） [05:33] 🕵 Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence（柯南：像侦探一样在多尺度视觉证据上渐进式推理） [06:12] 🤖 Search Self-play: Pushing the Frontier of Agent Capability without Supervision（搜索自博弈：无需监督即可拓展智能体能力边界） [06:56] 🎭 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations（探究大音频语言模型在说话人情绪变化下的安全漏洞） [07:42] 🖼 LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas（LayerComposer：基于空间感知分层画布的交互式个性化文生图） [08:10] 🎧 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models（SAKE：面向大型音频-语言模型听觉属性知识编辑的探索） [08:51] 🖼 ARGenSeg: Image Segmentation with Autoregressive Image Generation Model（ARGenSeg：基于自回归图像生成的图像分割） [09:39] 🧩 Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets（Seed3D 1.0：从单张图像生成高保真、可仿真的3D资产）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

2025.11.07 | 视频推理新范式；图像互动促思维

本期的 12 篇论文如下： [00:21] 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm（用视频思考：视频生成作为统一多模态推理新范式） [00:58] 🧠 V-Thinker: Interactive Thinking with Images（V-Thinker：与图像互动的思维推理） [01:39] 🧠 Scaling Agent Learning via Experience Synthesis（基于经验合成的智能体规模化强化学习） [02:23] 🧠 Cambrian-S: Towards Spatial Supersensing in Video（Cambrian-S：迈向视频中的空间超感） [03:06] 🖥 GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents（GUI-360°：面向计算机使用智能体的大规模综合数据集与评测基准） [03:51] 📄 NVIDIA Nemotron Nano V2 VL（NVIDIA Nemotron Nano V2 VL：面向文档与长视频理解的高效视觉语言模型） [04:28] 🎟 The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms（多头注意力机制的强彩票假设） [05:12] 🕵 Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts（基准设计者应“在测试集上训练”以暴露可利用的非视觉捷径） [05:48] ⚽ Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots（人形机器人视觉驱动反应式足球技能学习） [06:18] 🔍 Contamination Detection for VLMs using Multi-Modal Semantic Perturbation（基于多模态语义扰动的视觉语言模型污染检测） [06:53] 🎧 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics（如何借助源语言感知的神经机器翻译指标评估语音翻译） [07:32] 🚀 RDMA Point-to-Point Communication for LLM Systems（面向LLM系统的RDMA点对点通信）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

95

2025.11.06 | 扩散模型省数据；音视频对口型

本期的 9 篇论文如下： [00:17] 🚀 Diffusion Language Models are Super Data Learners（扩散语言模型是超级数据学习者） [01:06] 🎬 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions（统一音视频生成的不对称跨模态交互方法） [01:42] 🧩 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation（LEGO-Eval：面向具身3D环境合成工具增强细粒度评测） [02:25] 📊 Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning（Orion-MSP：面向表格上下文学习的多尺度稀疏注意力机制） [03:15] 📊 TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models（TabTune：面向表格基础模型推理与微调的一站式统一库） [03:46] 🦾 Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects（Kinematify：开放词汇的高自由度关节物体合成） [04:30] 🧠 MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity（MME-CC：一项面向多模态认知能力的挑战性评测基准） [05:06] 📈 LiveTradeBench: Seeking Real-World Alpha with Large Language Models（LiveTradeBench：用大模型在真实市场里挖掘超额收益） [05:55] 🔍 Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation（多模态嵌入器自适应决定何时增强查询的所罗门方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

64