HuggingFace 每日AI论文速递 - 节目列表

【周末特辑】4月第1周最火AI论文 | FIPO破推理长度瓶颈；CARLA-Air空地仿真合一

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 5 篇论文如下： [00:40] TOP1(🔥309) | 🧠 FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization（FIPO：通过未来KL影响策略优化引导深度推理） [02:58] TOP2(🔥302) | 🚁 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence（CARLA-Air：在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施） [05:23] TOP3(🔥170) | 🛡 ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers（ClawKeeper：通过技能、插件和监视器为OpenClaw代理提供全面的安全保护） [07:56] TOP4(🔥151) | 🎬 ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling（ShotStream：用于交互式叙事的多镜头流式视频生成） [10:17] TOP5(🔥147) | 🧠 Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models（视野之外，记忆犹在：用于动态视频世界模型的混合记忆）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.04.03 | DataFlex让数据像乐高；潜在空间成AI新地图

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:41] 🔄 DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models（DataFlex：面向大语言模型数据中心化动态训练的统一框架） [01:48] 🧠 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook（潜在空间：基础、演进、机制、能力与展望） [02:45] 🧠 SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization（SKILL0：用于技能内化的上下文智能体强化学习） [03:22] 🎮 Generative World Renderer（生成式世界渲染器） [04:09] 👁 EgoSim: Egocentric World Simulator for Embodied Interaction Generation（EgoSim：面向具身交互生成的第一人称世界模拟器） [05:24] 🧠 LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model（LatentUM：通过潜在空间统一模型释放交错跨模态推理的潜力） [06:06] 🧠 Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory（Omni-SimpleMem：基于自主研究引导的终身多模态智能体记忆发现） [06:47] 🚗 UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving（UniDriveVLA：统一自动驾驶中的理解、感知与动作规划） [07:35] 🎯 Steerable Visual Representations（可操控的视觉表示） [08:12] 🎬 VOID: Video Object and Interaction Deletion（VOID：视频对象与交互删除） [09:06] 🤖 Investigating Autonomous Agent Contributions in the Wild: Activity Patterns and Code Change over Time（探究自主编码代理在真实项目中的贡献：活动模式与代码随时间的变化） [09:47] 🚀 ASI-Evolve: AI Accelerates AI（ASI-Evolve：人工智能加速人工智能发展） [10:50] 🎭 Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models（Tex3D：通过对抗性3D纹理将物体作为视觉-语言-动作模型的攻击面） [11:36] 🤖 GPA: Learning GUI Process Automation from Demonstrations（GPA：通过演示学习图形用户界面流程自动化） [12:24] 🔍 VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification（VideoZeroBench：通过时空证据验证探究视频多模态大语言模型的极限）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

99

2026.04.02 | ClawKeeper三层守护智能体安全；终端智能体轻量API夺冠

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:27] 🛡 ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers（ClawKeeper：通过技能、插件和监视器为OpenClaw代理提供全面的安全保护） [01:20] 💻 Terminal Agents Suffice for Enterprise Automation（终端智能体足以实现企业自动化） [02:03] 📊 MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome（MiroEval：面向过程和结果的多模态深度研究智能体基准测试） [02:54] 🧠 ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?（ViGoR-Bench：视觉生成模型距离零样本视觉推理器还有多远？） [03:40] 🔬 Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification（Vision2Web：基于智能体验证的视觉网站开发分层基准） [04:26] 📊 QuitoBench: A High-Quality Open Time Series Forecasting Benchmark（QuitoBench：一个高质量开放时间序列预测基准） [05:12] 🧠 Reasoning Shift: How Context Silently Shortens LLM Reasoning（推理偏移：上下文如何悄然缩短大语言模型的推理过程） [05:59] 📊 HippoCamp: Benchmarking Contextual Agents on Personal Computers（HippoCamp：在个人计算机上评估情境智能体的基准） [06:52] 🧠 PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning（PerceptionComp：面向复杂感知推理的视频基准测试） [07:34] ⚡ Universal YOCO for Efficient Depth Scaling（通用YOCO：实现高效深度扩展） [08:12] 🔄 Brevity Constraints Reverse Performance Hierarchies in Language Models（简洁性约束逆转语言模型的性能层级） [08:48] 🧠 GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation（GaussianGPT：迈向自回归3D高斯场景生成） [09:25] 📝 Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers（论文重构评估：评估AI撰写论文的呈现质量与幻觉问题） [10:11] 🚀 Embarrassingly Simple Self-Distillation Improves Code Generation（极其简单的自蒸馏提升代码生成能力） [10:54] 🤖 Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants（主动式智能体研究环境：通过模拟主动用户来评估主动式助手）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.04.01 | FIPO用KL引导深度推理；LongCat统一多模态token

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:30] 🧠 FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization（FIPO：通过未来KL影响策略优化引导深度推理） [01:12] 🧩 LongCat-Next: Lexicalizing Modalities as Discrete Tokens（LongCat-Next：将多模态信息离散化为标记） [01:48] 🚁 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence（CARLA-Air：在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施） [02:31] 🧬 Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells（Lingshu-Cell：一种用于转录组建模的生成式细胞世界模型，迈向虚拟细胞） [03:33] 🤖 GEMS: Agent-Native Multimodal Generation with Memory and Skills（GEMS：具备记忆与技能的智能体原生多模态生成框架） [04:12] 🎬 VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward（VGGRPO：迈向具有4D潜在奖励的世界一致性视频生成） [05:04] 🤖 Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis（Unify-Agent：面向世界接地的图像合成的统一多模态智能体） [05:45] 🔬 daVinci-LLM:Towards the Science of Pretraining（daVinci-LLM：迈向预训练的科学） [06:19] 🎬 CutClaw: Agentic Hours-Long Video Editing via Music Synchronization（CutClaw：通过音乐同步实现代理式数小时视频编辑） [07:10] 🔍 MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models（MonitorBench：大型语言模型中思维链可监控性的综合基准） [07:58] 🧬 FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration（FlowPIE：基于流引导文献探索的测试时科学思想演化） [08:46] 🏙 Extend3D: Town-Scale 3D Generation（Extend3D：城镇尺度的三维生成） [09:28] 💭 Think Anywhere in Code Generation（代码生成中的随处思考） [10:18] ⚙ OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training（OptiMer：最优分布向量合并优于数据混合用于持续预训练） [11:03] 🎨 VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing（VectorGym：面向SVG代码生成、绘制与编辑的多任务基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

93

2026.03.31 | 任务对齐提效TAPS；AI科研自治写医学论文

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:30] 🚀 TAPS: Task Aware Proposal Distributions for Speculative Sampling（TAPS：面向推测采样的任务感知提议分布） [01:11] 🔬 Towards a Medical AI Scientist（迈向医学AI科学家） [02:03] 🔍 Gen-Searcher: Reinforcing Agentic Search for Image Generation（Gen-Searcher：强化图像生成的代理搜索） [02:43] ⚠ Emergent Social Intelligence Risks in Generative Multi-Agent Systems（生成式多智能体系统中的涌现社会智能风险） [03:22] ⚙ EpochX: Building the Infrastructure for an Emergent Agent Civilization（EpochX：构建涌现性智能体文明的基础设施） [04:01] 📊 GEditBench v2: A Human-Aligned Benchmark for General Image Editing（GEditBench v2：一个面向人类对齐的通用图像编辑基准） [05:00] 🧠 On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models（论令牌的困境：用于大型视觉语言模型持续学习的、具有漂移感知令牌分配能力的动态混合专家模型） [05:56] 🔬 PRBench: End-to-end Paper Reproduction in Physics Research（PRBench：物理学研究中的端到端论文复现基准） [06:37] 🧠 Make Geometry Matter for Spatial Reasoning（让几何信息在空间推理中发挥作用） [07:28] 🖼 ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks（ImagenWorld：基于可解释人类评估对开放世界任务进行图像生成模型的压力测试） [08:18] 🎨 On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers（基于上下文空间即时排斥的扩散变换器多样性增强研究） [09:11] 🧠 MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences（MuSEAgent：一种具备状态化经验的多模态推理智能体） [09:55] ⚡ Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization（Kernel-Smith：进化式内核优化的统一方案） [10:55] 🎯 ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning（ResAdapt：面向高效多模态推理的自适应分辨率） [12:07] 🔍 Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design（Marco DeepResearch：通过以验证为中心的设计解锁高效深度研究智能体）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.03.30 | ShotStream流式生成多镜头；PackForcing短视频训出长片

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 10 篇论文如下： [00:28] 🎬 ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling（ShotStream：用于交互式叙事的多镜头流式视频生成） [01:07] 🎬 PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference（PackForcing：短视频训练足以实现长视频采样与长上下文推理） [01:54] 🧠 Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills（Trace2Skill：将轨迹局部经验提炼为可迁移的智能体技能） [02:43] 📊 RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation（RealChart2Code：基于真实数据与多任务评估推进图表到代码生成） [03:53] 🚗 LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset（带有推理轨迹的长尾驾驶场景：KITScenes长尾数据集） [04:42] 🧠 Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models（Know3D：利用视觉语言模型知识驱动的3D生成提示） [05:25] 🛠 Natural-Language Agent Harnesses（自然语言智能体控制框架） [06:10] 🎤 Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models（侍酒师：面向全双工语音语言模型的可扩展开放多轮音频预处理） [06:59] 🔬 MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies（MedOpenClaw：基于未整理完整研究的可审计医学影像智能体推理） [07:46] 🚀 Diffutron: A Masked Diffusion Language Model for Turkish Language（Diffutron：面向土耳其语的掩码扩散语言模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

【周末特辑】3月第5周最火AI论文 | 扩散OCR逆向渲染；世界模型交互大考

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 5 篇论文如下： [00:49] TOP1(🔥124) | 🔍 MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding（MinerU-Diffusion：将文档OCR重新思考为通过扩散解码的逆向渲染） [03:11] TOP2(🔥122) | 🧪 Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models（Omni-WorldBench：迈向面向世界模型的全面交互中心化评估） [05:47] TOP3(🔥114) | 🚀 Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model（速度源于简洁：用于快速音视频生成基础模型的单流架构） [07:54] TOP4(🔥104) | 🎬 Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models（Astrolabe：面向蒸馏自回归视频模型的前向过程强化学习引导框架） [10:09] TOP5(🔥104) | 🔗 HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning（HopChain：用于可泛化视觉语言推理的多跳数据合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.03.27 | PixelSmile微调表情；万亿模型通吃科学

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:35] 😊 PixelSmile: Toward Fine-Grained Facial Expression Editing（PixelSmile：面向细粒度面部表情编辑） [01:27] 🚀 Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale（Intern-S1-Pro：万亿参数规模的科学多模态基础模型） [02:10] 🖼 RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models（RealRestorer：基于大规模图像编辑模型实现可泛化的真实世界图像复原） [02:52] 🖼 MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data（MACRO：利用结构化长上下文数据推进多参考图像生成） [03:42] ⚙ Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration（Calibri：通过参数高效校准增强扩散变换器） [04:25] 🗣 Voxtral TTS（Voxtral TTS：基于混合架构的富有表现力多语言文本转语音模型） [05:03] 📉 SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks（SlopCodeBench：基准测试编码智能体在长视野迭代任务中的性能退化） [05:49] 🧠 MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens（MSA：内存稀疏注意力机制，实现端到端内存模型高效扩展至1亿词元） [06:39] 🎬 AVControl: Efficient Framework for Training Audio-Visual Controls（AVControl：用于训练视听控制的高效框架） [07:23] 🎨 Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting（更少的高斯，更多的纹理：4K前馈纹理化高斯泼溅） [08:10] 🔍 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models（MuRF：解锁视觉基础模型的多尺度潜力） [09:12] 🔍 Representation Alignment for Just Image Transformers is not Easier than You Think（表征对齐对于纯图像Transformer而言并非易事） [10:06] ⚡ S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation（S2D2：基于免训练自推测的扩散大语言模型快速解码方法） [10:46] 📊 FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol（FinMCP-Bench：基于模型上下文协议的真实世界金融工具使用场景下大语言模型智能体基准测试） [11:35] 🔬 BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment（BioVITA：面向视觉-文本-声学对齐的生物数据集、模型与基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.03.26 | CUA-Suite攒600万帧操作视频；EVA三阶段训练砍七成令牌

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:27] 🎬 CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents（CUA-Suite：用于计算机使用代理的大规模人工标注视频演示集） [01:24] 🎬 EVA: Efficient Reinforcement Learning for End-to-End Video Agent（EVA：面向端到端视频智能体的高效强化学习框架） [02:05] 🛡 T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search（T-MAP：基于轨迹感知进化搜索的LLM智能体红队测试） [02:50] 🤖 UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience（UI-Voyager：一种通过失败经验学习的自进化图形用户界面代理） [03:33] 🤔 Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?（自蒸馏为何（有时）会削弱大语言模型的推理能力？） [04:20] 🎮 GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents（GameplayQA：面向决策密集型第一人称同步多视频理解的3D虚拟智能体基准测试框架） [05:13] 🧠 When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning（当模型自我评判时：多模态推理的无监督自我进化） [06:11] 🤖 CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare（CarePilot：面向医疗领域长周期计算机任务自动化的多智能体框架） [07:13] 🌀 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video（4DGS360：基于单视频的动态物体360度高斯重建） [07:54] 🎬 OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning（OmniWeaving：面向自由组合与推理的统一视频生成） [08:38] 🚗 Toward Physically Consistent Driving Video World Models under Challenging Trajectories（面向挑战性轨迹下物理一致性驾驶视频世界模型的研究） [09:18] 📊 Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments（LLM智能体能否胜任CFO？动态企业环境中资源分配的基准测试） [10:10] 🧠 Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning（通过文本表征引导推理释放多模态大语言模型的空间推理能力） [10:53] 🤖 StreamingClaw Technical Report（StreamingClaw技术报告） [11:30] 🔍 LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis（LagerNVS：基于潜在几何的全神经实时新视角合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

88

2026.03.25 | 扩散OCR并行降噪；WildWorld动作数据集测AI一致性

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:29] 🔍 MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding（MinerU-Diffusion：将文档OCR重新思考为通过扩散解码的逆向渲染） [01:18] 🎮 WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG（WildWorld：面向生成式动作角色扮演游戏的大规模动态世界建模数据集，包含动作与显式状态） [02:10] ⚡ SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning（SpecEyes：通过推测式感知与规划加速智能体多模态大语言模型） [02:59] 🎥 PEARL: Personalized Streaming Video Understanding Model（PEARL：个性化流式视频理解模型） [03:46] 🔍 DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models（DA-Flow：基于扩散模型的退化感知光流估计） [04:30] 📊 From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents（从静态模板到动态运行时图：LLM智能体工作流优化综述） [05:13] 🤖 SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM（SIMART：通过大语言模型将整体网格分解为仿真就绪的关节化资产） [05:52] 🧠 UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation（UniGRPO：面向推理驱动视觉生成的统一策略优化） [06:45] 🎬 RealMaster: Lifting Rendered Scenes into Photorealistic Video（RealMaster：将渲染场景提升为逼真视频） [07:32] 🤖 2Xplat: Two Experts Are Better Than One Generalist（2Xplat：两个专家胜过一个通才） [08:15] 🔍 Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought（重新思考多模态思维链的令牌级策略优化） [09:03] 👁 Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing（先注视后注意：通过自回归凝视实现高效可扩展的视频理解） [09:57] 🎯 VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models（VP-VLA：视觉提示作为视觉-语言-动作模型的接口） [10:48] 🧠 ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model（ThinkJEPA：赋能潜在世界模型的大型视觉-语言推理模型） [11:40] 🤖 AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AI（AgentSLR：基于智能体人工智能的流行病学系统文献综述自动化）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2026.03.24 | 世界模型交互评估短板；单流架构极速生成

【赞助商】通勤路上就听AI每周谈。AI每周谈，每周带你回顾上周AI大事传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】本期的 15 篇论文如下： [00:32] 🧪 Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models（Omni-WorldBench：迈向面向世界模型的全面交互中心化评估） [01:13] 🚀 Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model（速度源于简洁：用于快速音视频生成基础模型的单流架构） [01:55] 🧠 LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning（LongCat-Flash-Prover：通过智能体工具集成强化学习推进原生形式推理） [02:42] 🔍 VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding（VideoDetective：基于外部查询与内部相关性的线索搜寻用于长视频理解） [03:30] 🧠 SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning（SpatialBoost：通过语言引导推理增强视觉表征） [04:10] 🎯 F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting（F4Splat：用于前馈3D高斯泼溅的前馈预测性致密化） [05:03] 🎬 Manifold-Aware Exploration for Reinforcement Learning in Video Generation（面向视频生成的强化学习中的流形感知探索） [05:56] ⚖ mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT（mSFT：解决多任务监督微调中数据集混合的异质过拟合问题） [06:46] 🧠 Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection（Group3D：基于多模态大语言模型的语义分组开放词汇3D物体检测） [07:35] 🔄 Repurposing Geometric Foundation Models for Multi-view Diffusion（几何基础模型在多视角扩散中的再利用） [08:21] 🤖 RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models（RoboAlign：学习视觉-语言-动作模型中语言-动作对齐的测试时推理） [09:15] 🔍 OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis（OpenResearcher：一个完全开源的深度研究长程轨迹合成流程） [10:02] 💭 BubbleRAG: Evidence-Driven Retrieval-Augmented Generation for Black-Box Knowledge Graphs（BubbleRAG：面向黑盒知识图谱的证据驱动检索增强生成） [10:54] ⚖ SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models（SEM：用于视觉语言模型事后去偏的稀疏嵌入调制） [11:43] 🧭 On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation（论RLVR更新方向对LLM推理的影响：识别与利用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递