HuggingFace 每日AI论文速递 - 节目列表

2025.12.30 | ERC耦合路由与专家；LiveTalk实时视频对话

本期的 15 篇论文如下： [00:24] 🔗 Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss（通过辅助损失耦合专家混合模型中的专家与路由器） [01:07] 🎬 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation（LiveTalk：通过改进的策略内蒸馏实现实时多模态交互式视频扩散） [01:55] 🌍 Yume-1.5: A Text-Controlled Interactive World Generation Model（Yume-1.5：一种文本控制的交互式世界生成模型） [02:30] 🔍 SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents（SmartSnap：自验证智能体的主动证据寻求范式） [02:59] 🔮 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation（扩散模型知晓透明度：将视频扩散模型重新用于透明物体的深度与法线估计） [03:40] 🎯 SpotEdit: Selective Region Editing in Diffusion Transformers（SpotEdit：扩散变换器中的选择性区域编辑） [04:23] 🚀 Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone（Dream-VL与Dream-VLA：基于扩散语言模型骨干的开放视觉-语言与视觉-语言-动作模型） [05:09] 🔍 GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models（GRAN-TED：为扩散模型生成鲁棒、对齐且细致的文本嵌入） [05:56] 🤖 Act2Goal: From World Model To General Goal-conditioned Policy（Act2Goal：从世界模型到通用目标条件策略） [06:31] ⚡ Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion（Stream-DiffVSR：基于自回归扩散的低延迟可流式视频超分辨率） [06:59] 🌐 Web World Models（Web世界模型） [07:34] 🚀 DiRL: An Efficient Post-Training Framework for Diffusion Language Models（DiRL：一种高效的扩散语言模型后训练框架） [08:19] 🎬 Video-BrowseComp: Benchmarking Agentic Video Research on Open Web（Video-BrowseComp：面向开放网络的智能体视频研究基准测试） [09:02] 🧠 Training AI Co-Scientists Using Rubric Rewards（使用评分标准奖励训练AI科研助手） [09:39] 🧩 Monadic Context Engineering（单子上下文工程）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

45

2025.12.29 | 鸟瞰式检索提效小模型；4D扩散一键插入逼真物体

本期的 13 篇论文如下： [00:27] 🧠 Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding（面向提升长文本理解的思维景观感知检索增强生成） [01:07] 🎬 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion（InsertAnywhere：连接4D场景几何与扩散模型以实现逼真的视频对象插入） [01:46] 🤖 MAI-UI Technical Report: Real-World Centric Foundation GUI Agents（MAI-UI技术报告：面向真实世界的通用图形用户界面智能体） [02:22] 👁 UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture（UniPercept：迈向跨美学、质量、结构与纹理的统一感知级图像理解） [03:04] 🎨 ProEdit: Inversion-based Editing From Prompts Done Right（ProEdit：基于反演的提示编辑的正确方法） [03:58] ⏱ TimeBill: Time-Budgeted Inference for Large Language Models（TimeBill：面向大语言模型的时间预算推理框架） [04:37] 🧠 See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning（少看，看对：用于多模态推理的双向感知塑造） [05:16] 🌦 Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding（Omni-Weather：用于天气生成与理解的多模态统一基础模型） [05:48] 🧠 SVBench: Evaluation of Video Generation Models on Social Reasoning（SVBench：视频生成模型在社会推理能力上的评估） [06:27] 🔍 InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search（InSight-o3：赋能多模态基础模型实现广义视觉搜索） [07:15] 🎨 SlideTailor: Personalized Presentation Slide Generation for Scientific Papers（SlideTailor：面向科研论文的个性化演示文稿幻灯片生成） [08:11] 🤖 SWE-RM: Execution-free Feedback For Software Engineering Agents（SWE-RM：面向软件工程智能体的无执行反馈机制） [08:48] ⚡ A 58-Addition, Rank-23 Scheme for General 3x3 Matrix Multiplication（一种用于通用3x3矩阵乘法的58次加法、秩23方案）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

88

【周末特辑】12月第5周最火AI论文 | DataFlow炼数工厂上线；AI科学家跑不完闭环

本期的 5 篇论文如下： [00:42] TOP1(🔥188) | ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI（DataFlow：面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架） [02:34] TOP2(🔥105) | 🔬 Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows（通过科学家对齐的工作流程探究大语言模型的科学通用智能） [05:04] TOP3(🔥85) | 🎬 SemanticGen: Video Generation in Semantic Space（SemanticGen：在语义空间中的视频生成） [07:03] TOP4(🔥73) | 🔍 Step-DeepResearch Technical Report（Step-DeepResearch技术报告） [09:31] TOP5(🔥71) | 🧠 PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence（PhysBrain：以人类第一人称数据为桥梁，从视觉语言模型迈向物理智能）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2025.12.26 | 暗号token涨点视觉推理；3D便签本让视频长脑子

本期的 6 篇论文如下： [00:19] 🧠 Latent Implicit Visual Reasoning（潜在隐式视觉推理） [00:56] 🎬 Spatia: Video Generation with Updatable Spatial Memory（Spatia：基于可更新空间记忆的视频生成） [01:36] 🧠 Schoenfeld's Anatomy of Mathematical Reasoning by Language Models（基于舍恩菲尔德理论的语言模型数学推理解剖） [02:11] 🔍 How Much 3D Do Video Foundation Models Encode?（视频基础模型编码了多少3D信息？） [02:58] 🎯 VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation（VA-π：面向像素感知自回归生成的变分策略对齐） [03:36] 🚀 GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training（GTR-Turbo：合并的检查点秘密成为智能体化视觉语言模型训练的免费教师）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

85

2025.12.25 | 四维动态理解刷新VLM；单卡200倍速生成高清视频

本期的 14 篇论文如下： [00:20] 🧠 Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models（学习在四维空间中推理：视觉语言模型的动态空间理解） [01:11] ⚡ TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times（TurboDiffusion：将视频扩散模型加速100-200倍） [01:52] 🧭 T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation（T2AV-Compass：迈向文本到音视频生成的统一评估） [02:38] 🎬 DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation（DreaMontage：基于任意帧引导的单镜头视频生成） [03:21] 🔍 Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models（超越记忆：一个多模态序数回归基准揭示视觉语言模型中的流行度偏差） [04:07] 🎬 HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming（HiStream：通过消除冗余的流式处理实现高效高分辨率视频生成） [04:52] 🚀 Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning（Nemotron 3 Nano：用于智能体推理的开放、高效混合专家Mamba-Transformer模型） [05:38] 🔍 TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior（TokSuite：衡量分词器选择对语言模型行为的影响） [06:12] 🚀 NVIDIA Nemotron 3: Efficient and Open Intelligence（NVIDIA Nemotron 3：高效且开放的智能模型） [06:57] 🎬 Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations（基于下一帧预测的学习：自回归视频建模编码有效表示） [07:27] 🎬 Streaming Video Instruction Tuning（流式视频指令微调） [08:02] 🧠 Multi-hop Reasoning via Early Knowledge Alignment（通过早期知识对齐实现多跳推理） [08:43] 📊 SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios（SWE-EVO：在长周期软件演化场景中评估编码智能体的基准） [09:24] 🏆 LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics（LLM瑞士轮：通过竞争性瑞士制动态聚合多基准性能）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

82

2025.12.24 | 语义蓝图提速视频生成；逐层剖析炼出强策略

本期的 15 篇论文如下： [00:19] 🎬 SemanticGen: Video Generation in Semantic Space（SemanticGen：在语义空间中的视频生成） [01:01] 🔍 Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies（自底向上策略优化：你的语言模型策略中暗含内部策略） [01:48] 🧠 SpatialTree: How Spatial Abilities Branch Out in MLLMs（SpatialTree：多模态大语言模型中的空间能力如何分支发展） [02:23] 🤖 LongVideoAgent: Multi-Agent Reasoning with Long Videos（LongVideoAgent：基于多智能体推理的长视频理解） [03:06] 🧠 MemEvolve: Meta-Evolution of Agent Memory Systems（MemEvolve：智能体记忆系统的元进化） [03:46] 🔍 Step-DeepResearch Technical Report（Step-DeepResearch技术报告） [04:22] 🎧 SAM Audio: Segment Anything in Audio（SAM Audio：音频中的任意分割） [05:00] 🚀 INTELLECT-3: Technical Report（INTELLECT-3：技术报告） [05:30] 🔍 FaithLens: Detecting and Explaining Faithfulness Hallucination（FaithLens：检测与解释忠实性幻觉） [06:07] 🧠 Reinforcement Learning for Self-Improving Agent with Skill Library（基于技能库与强化学习的自进化智能体研究） [06:53] 📊 QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models（QuantiPhy：评估视觉语言模型物理推理能力的定量基准） [07:38] 🔊 Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems（Simulstream：用于流式语音到文本翻译系统评估与演示的开源工具包） [08:18] 🧠 Active Intelligence in Video Avatars via Closed-loop World Modeling（通过闭环世界建模实现视频化身的主动智能） [08:55] 🔬 Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation（基于多LLM与双重可靠性度量的主题分析：结合Cohen's Kappa与语义相似度进行定性研究验证） [09:32] ⚠ Toxicity Ahead: Forecasting Conversational Derailment on GitHub（毒性预警：预测GitHub对话中的脱轨行为）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.12.23 | 数据工厂提效；棱镜假说统合

本期的 15 篇论文如下： [00:22] ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI（DataFlow：面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架） [01:04] 🔍 The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding（棱镜假说：通过统一自编码协调语义与像素表示） [01:50] 🎬 Region-Constraint In-Context Generation for Instructional Video Editing（区域约束的上下文生成用于教学视频编辑） [02:33] 🎥 Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation（无限单应性变换作为相机控制视频生成的鲁棒条件） [03:08] 🔍 QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation（QuCo-RAG：基于预训练语料的动态检索增强生成不确定性量化） [03:58] 🤔 Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction（大型语言模型能否评估学生困境？基于能力模拟的人机难度对齐用于试题难度预测） [04:35] 🧭 LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry（LoGoPlanner：基于定位与度量感知视觉几何的导航策略） [05:13] 🎬 WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion（WorldWarp：利用异步视频扩散传播三维几何） [06:08] 🔍 UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models（UCoder：通过内部探测大语言模型实现无监督代码生成） [06:45] 🧬 GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators（GenEnv：基于难度对齐的大语言模型智能体与环境模拟器协同进化框架） [07:22] 🎨 Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs（推理调色板：通过潜在情境化调节推理以实现（视觉）语言模型的可控探索） [07:56] ⚡ LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding（LoPA：通过前瞻并行解码扩展扩散大语言模型推理） [08:38] 📱 MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments（MobileWorld：在智能体-用户交互与MCP增强环境中评测自主移动智能体） [09:20] ⚖ Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital（它能对上吗？迈向风险投资领域的自主法律智能体） [10:00] 🎬 StoryMem: Multi-shot Long Video Storytelling with Memory（StoryMem：基于记忆的多镜头长视频故事讲述）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.12.22 | PhysBrain用第一人称视频让AI学会动手；大模型离科学家AI还差得远

本期的 15 篇论文如下： [00:24] 🧠 PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence（PhysBrain：以人类第一人称数据为桥梁，从视觉语言模型迈向物理智能） [01:05] 🔬 Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows（通过科学家对齐的工作流程探究大语言模型的科学通用智能） [01:34] 🧠 When Reasoning Meets Its Laws（当推理遇见其定律） [02:16] 🧠 Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience（Seed-Prover 1.5：通过经验学习掌握本科级定理证明） [03:02] 🧠 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation（4D-RGPT：通过感知蒸馏实现区域级4D理解） [03:51] 🎨 Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing（语义与重建皆重要：让表征编码器为文本到图像生成与编辑做好准备） [04:30] ⚖ Are We on the Right Way to Assessing LLM-as-a-Judge?（我们评估LLM作为评判者的方法正确吗？） [05:05] 📡 RadarGen: Automotive Radar Point Cloud Generation from Cameras（RadarGen：基于摄像头的汽车雷达点云生成） [05:54] 🔬 Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers（语言模型的物理学：第4.1部分，架构设计与Canon层的魔力） [06:41] 🎬 HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering（HERBench：视频问答中多证据整合的基准测试） [07:26] 🔍 GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation（GroundingME：通过多维评估揭示MLLMs中的视觉基础能力差距） [08:06] ⚙ SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories（SWE-Bench++：一种从开源仓库可扩展生成软件工程基准的框架） [08:39] 🧠 Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs（Turn-PPO：基于回合级优势估计与PPO的智能体大语言模型多轮强化学习优化） [09:14] ⚡ StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models（StageVAR：面向视觉自回归模型的阶段感知加速） [09:48] 🤖 An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges（视觉-语言-动作模型剖析：从模块、里程碑到挑战）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

【周末特辑】12月第4周最火AI论文 | 全能生成Kling-Omni秒出4K影片；Step-GUI让手机代理本地跑

本期的 5 篇论文如下： [00:37] TOP1(🔥117) | 🎬 Kling-Omni Technical Report（Kling-Omni技术报告） [02:55] TOP2(🔥116) | 🤖 Step-GUI Technical Report（Step-GUI技术报告） [05:54] TOP3(🔥112) | 🧠 MMGR: Multi-Modal Generative Reasoning（MMGR：多模态生成式推理评估与基准） [08:04] TOP4(🔥97) | 🎥 EgoX: Egocentric Video Generation from a Single Exocentric Video（EgoX：从单视角外中心视频生成自我中心视频） [10:35] TOP5(🔥96) | 🧠 QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management（QwenLong-L1.5：实现长上下文推理与记忆管理的后训练方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

2025.12.19 | Kling-Omni一统视频生成；LLaDA2.0百亿扩散模型

本期的 14 篇论文如下： [00:26] 🎬 Kling-Omni Technical Report（Kling-Omni技术报告） [01:02] 🚀 LLaDA2.0: Scaling Up Diffusion Language Models to 100B（LLaDA2.0：将扩散语言模型扩展至1000亿参数） [01:41] 🔮 Next-Embedding Prediction Makes Strong Vision Learners（下一嵌入预测构建强大的视觉学习器） [02:27] 👓 StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors（StereoPilot：通过生成先验学习统一且高效的立体转换） [02:58] 🎬 Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model（Seedance 1.5 pro：一个原生音视频联合生成基础模型） [03:34] 🔭 Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation（全景深度估计基础模型：深度任意全景） [04:11] 📸 Generative Refocusing: Flexible Defocus Control from a Single Image（生成式重聚焦：从单张图像实现灵活散焦控制） [04:56] 🤖 Adaptation of Agentic AI（智能体人工智能的适应性研究） [05:36] ⚗ Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection（炼金术士：通过元梯度数据选择提升文本到图像模型训练效率） [06:12] 🛡 DeContext as Defense: Safe Image Editing in Diffusion Transformers（以去上下文为防御：扩散变换器中的安全图像编辑） [06:58] 🧭 N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models（N3D-VLM：原生3D基础实现视觉语言模型中的精确空间推理） [07:49] 🎨 The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text（世界即画布：用参考图像、轨迹和文本绘制可提示事件） [08:30] 🔧 AdaTooler-V: Adaptive Tool-Use for Images and Videos（AdaTooler-V：面向图像与视频的自适应工具使用） [09:19] 🤔 Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward（探索与利用之辩：通过裁剪、熵与虚假奖励重新审视RLVR）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.12.18 | 校准步长奖励砍成本；扩散草稿自回归验证提速

本期的 14 篇论文如下： [00:25] 🤖 Step-GUI Technical Report（Step-GUI技术报告） [00:59] ⚡ DEER: Draft with Diffusion, Verify with Autoregressive Models（DEER：基于扩散模型生成草稿，基于自回归模型验证） [01:31] ⚡ Fast and Accurate Causal Parallel Decoding using Jacobi Forcing（使用雅可比强制实现快速准确的因果并行解码） [02:10] 🚀 HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices（HyperVL：面向边缘设备的高效动态多模态大语言模型） [02:48] 🎬 IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning（IC-Effect：基于上下文学习的精确高效视频特效编辑） [03:30] 🔍 Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning（Skyra：基于可感知视觉伪影推理的AI生成视频检测） [04:03] 🧠 Universal Reasoning Model（通用推理模型） [04:45] 🔍 Robust and Calibrated Detection of Authentic Multimedia Content（鲁棒且可校准的真实多媒体内容检测） [05:33] 🧭 Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning（大型语言模型能否引导自身探索？基于梯度引导的强化学习用于LLM推理） [06:14] 🌍 FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition（FiNERweb：用于可扩展多语言命名实体识别的数据集与工具集） [06:54] 📊 MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence（MMSI-Video-Bench：一个面向视频空间智能的综合性基准测试） [07:47] 🔄 DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models（DiffusionVL：将任意自回归模型转化为扩散视觉语言模型） [08:24] 🧠 SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning（SAGE：通过强化学习训练智能任意时域代理以进行长视频推理） [09:02] 🎬 End-to-End Training for Autoregressive Video Diffusion via Self-Resampling（通过自重采样实现自回归视频扩散模型的端到端训练）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递