HuggingFace 每日AI论文速递 - 节目列表

2026.03.23 | 多跳合成提推理;前向强化快视频

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:31] 🔗 HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning(HopChain:用于可泛化视觉语言推理的多跳数据合成) [01:28] 🎬 Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models(Astrolabe:面向蒸馏自回归视频模型的前向过程强化学习引导框架) [02:06] 🛰 TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation(TerraScope:面向地球观测的像素级视觉推理) [02:56] 🔍 ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models(ProactiveBench:多模态大语言模型主动性能力评测基准) [03:45] 🎬 LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation(LumosX:通过属性关联任意身份实现个性化视频生成) [04:50] 🏠 FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow(FlowScene:基于多模态图整流流的风格一致室内场景生成) [05:35] 🧠 The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus(面向大语言模型的Y组合子:用λ演算解决长上下文困境) [06:20] 🎯 A Subgoal-driven Framework for Improving Long-Horizon LLM Agents(一种用于改进长视野LLM智能体的子目标驱动框架) [07:02] 🔍 How Well Does Generative Recommendation Generalize?(生成式推荐模型的泛化能力究竟如何?) [07:48] 🌍 WorldAgents: Can Foundation Image Models be Agents for 3D World Models?(WorldAgents:基础图像模型能否成为3D世界模型的智能体?) [08:24] ⚡ BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection(BEAVER:一种基于结构感知页面选择的免训练分层提示压缩方法) [09:05] 🚀 Hyperagents(超智能体:可自我编辑的元认知自改进智能体) [09:54] 🎬 HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering(HiMu:面向长视频问答的分层多模态帧选择方法) [10:37] 🎬 EgoForge: Goal-Directed Egocentric World Simulator(EgoForge:目标导向的自我中心世界模拟器) [11:50] 🎬 Versatile Editing of Video Content, Actions, and Dynamics without Training(无需训练的通用视频内容、动作与动态编辑) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.20 | 生成模型解锁3D空间理解;SAMA零试指令编辑追平Kling

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:29] 🧠 Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding(生成模型懂空间:释放隐式3D先验用于场景理解) [01:09] 🎬 SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing(SAMA:基于分解式语义锚定与运动对齐的指令引导视频编辑) [01:45] ⚡ FASTER: Rethinking Real-Time Flow VLAs(FASTER:重新思考实时流视觉语言动作模型) [02:30] 🎬 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model(3DreamBooth:高保真三维主体驱动视频生成模型) [03:31] 🤖 Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer(基于扩散的离散运动分词器:连接语义与运动学条件) [04:21] 🤖 MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction(MonoArt:基于渐进式结构推理的单目铰接三维重建) [05:13] 🧩 Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens(立方离散扩散:基于高维表示令牌的离散视觉生成) [05:47] 📊 LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs(LVOmniBench:面向全模态大语言模型的长音频视频理解评估新基准) [06:42] 🧠 Memento-Skills: Let Agents Design Agents(Memento-Skills:让智能体设计智能体) [07:18] 🌍 F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World(F2LLM-v2:面向多语言世界的包容性、高性能且高效的嵌入模型) [08:00] 🧠 Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation(Nemotron-Cascade 2:通过级联强化学习和多领域同策略蒸馏进行大语言模型的后训练) [08:54] 🧠 Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding(多模态大语言模型在离散符号理解中的认知错配) [09:45] 🎬 EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing(EffectErase:面向高质量效果擦除的视频对象联合移除与插入) [10:58] 🔧 VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining(VTC-Bench:通过组合式视觉工具链评估代理式多模态模型) [11:39] 🗣 MOSS-TTS Technical Report(MOSS-TTS技术报告) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

12分钟
99+
1个月前

2026.03.19 | 事件链预演视频未来;在线进化模型不掉队

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:30] 🔮 Video-CoE: Reinforcing Video Event Prediction via Chain of Events(Video-CoE:通过事件链强化视频事件预测) [01:13] 🧬 MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild(MetaClaw:只需对话——一种在真实环境中元学习与进化的智能体) [02:01] 🧠 MosaicMem: Hybrid Spatial Memory for Controllable Video World Models(MosaicMem:用于可控视频世界模型的混合空间记忆) [02:55] ⚖ Alignment Makes Language Models Normative, Not Descriptive(对齐使语言模型趋于规范性,而非描述性) [03:42] 🧠 Complementary Reinforcement Learning(互补强化学习) [04:33] 🤖 Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models(先看后动:增强视觉-语言-动作模型的视觉基础表征) [05:24] 🤖 GigaWorld-Policy: An Efficient Action-Centered World--Action Model(GigaWorld-Policy:一种高效的动作中心化世界-动作模型) [06:07] 🎬 Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models(时间增益,空间代价:重新审视多模态大语言模型中的视频微调) [06:54] 🤖 When AI Navigates the Fog of War(当AI穿越战争迷雾:基于2026年中东冲突早期阶段的时序性案例研究) [07:49] 🧩 LoST: Level of Semantics Tokenization for 3D Shapes(LoST:面向三维形状的语义层级分词方法) [08:21] 🧠 BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs(BenchPreS:面向持久记忆大语言模型上下文感知个性化偏好选择性的基准测试) [09:09] 🧠 ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models(ESPIRE:面向视觉语言模型的具身空间推理诊断基准) [09:47] 🤖 Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting(通过后验转移重加权实现保守的离线机器人策略学习) [10:46] 🎥 Stereo World Model: Camera-Guided Stereo Video Generation(立体世界模型:相机引导的立体视频生成) [11:32] 🧠 AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents(AdaMem:面向长程对话代理的自适应用户中心记忆) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

12分钟
99+
1个月前

2026.03.18 | 验证求精代理破局;工业代码模型一次过

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:29] 🤖 MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification(MiroThinker-1.7与H1:通过验证迈向重型研究智能体) [01:10] 🏭 InCoder-32B: Code Foundation Model for Industrial Scenarios(InCoder-32B:面向工业场景的代码基础模型) [02:08] 🧠 Qianfan-OCR: A Unified End-to-End Model for Document Intelligence(千帆OCR:一个用于文档智能的统一端到端模型) [02:50] 🤖 Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation(Kinema4D:面向时空具身仿真的运动学4D世界建模) [03:28] 🧠 Demystifing Video Reasoning(揭秘视频推理机制) [04:26] 🎮 WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation(WorldCam:以相机位姿为统一几何表示的交互式自回归3D游戏世界) [05:26] 🧠 TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas(TRUST-SQL:面向未知模式的文本到SQL工具集成多轮强化学习) [06:12] 🤔 Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding(在不确定性中思考:通过潜在熵感知解码缓解多模态大推理模型的幻觉问题) [07:02] 🔄 Online Experiential Learning for Language Models(语言模型的在线体验式学习) [07:54] 📊 FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use(FinToolBench:评估面向现实世界金融工具使用的大语言模型智能体) [08:47] 🚀 Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training(重新思考统一多模态模型视觉生成:基于掩码建模的高效纯图像预训练) [09:30] 🧭 WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation(WiT:基于轨迹冲突导航的路径点扩散Transformer) [10:20] 🔍 AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents(AgentProcessBench:诊断工具使用智能体的步骤级过程质量) [11:03] 🎨 SegviGen: Repurposing 3D Generative Model for Part Segmentation(SegviGen:重新利用3D生成模型进行部件分割) [11:59] 🗣 SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models(SocialOmni:全模态模型中视听社交交互能力的基准测试) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.17 | AI学会科学审美;开源数据打破搜索垄断

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:29] 🧠 AI Can Learn Scientific Taste(AI可以学习科学品味) [01:13] 🔍 OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data(OpenSeeker:通过完全开源训练数据实现前沿搜索代理的民主化) [02:06] 🏢 EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings(EnterpriseOps-Gym:企业环境中状态感知的智能体规划与工具使用评估环境) [03:00] 🌆 Grounding World Simulation Models in a Real-World Metropolis(将世界仿真模型锚定于真实大都市) [03:53] 🤖 HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions(HSImul3R:基于物理闭环的仿真就绪人-场景交互重建) [04:39] 🧠 Attention Residuals(注意力残差) [05:38] 🧠 Mixture-of-Depths Attention(混合深度注意力机制) [06:44] 🧠 Effective Distillation to Hybrid xLSTM Architectures(面向混合xLSTM架构的高效知识蒸馏) [07:23] 🔍 Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models(谎言剖析:追踪视觉语言模型幻觉的多阶段诊断框架) [08:14] 🎬 ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer(ViFeEdit:一种无需视频数据的视频扩散变换器调谐器) [08:54] 🚀 POLCA: Stochastic Generative Optimization with LLM(POLCA:基于大语言模型的随机生成优化) [10:00] 🤖 Safe and Scalable Web Agent Learning via Recreated Websites(通过重建网站实现安全且可扩展的网页智能体学习) [10:45] 🔍 Make it SING: Analyzing Semantic Invariants in Classifiers(使其SING:分析分类器中的语义不变量) [11:28] ⏱ TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning(终结者:学习链式思维推理中提前停止的最优退出点) [12:30] 🎬 WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics(WebVR:基于人类对齐视觉量表的视频到网页重建多模态大语言模型评测基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.16 | LMEB填补长记忆评测盲区;Cheers解耦语义与细节实现多模态统一

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:28] 🧠 LMEB: Long-horizon Memory Embedding Benchmark(LMEB:长时程记忆嵌入基准) [01:12] 🔄 Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation(Cheers:通过解耦补丁细节与语义表征实现统一的多模态理解与生成) [01:59] 🐳 daVinci-Env: Open SWE Environment Synthesis at Scale(daVinci-Env:大规模开源软件工程环境合成) [02:46] 🔍 Can Vision-Language Models Solve the Shell Game?(视觉语言模型能破解“猜球游戏”吗?) [03:26] ⚡ OmniForcing: Unleashing Real-time Joint Audio-Visual Generation(OmniForcing:释放实时联合视听生成) [04:14] 🎯 Visual-ERM: Reward Modeling for Visual Equivalence(Visual-ERM:面向视觉等价性的奖励建模) [05:11] 🔍 MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning(MM-CondChain:一个经程序验证的视觉基础深度组合推理基准) [06:18] 🌉 V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration(V-Bridge:将视频生成先验桥接至通用少样本图像复原) [07:05] 🔍 Multimodal OCR: Parse Anything from Documents(多模态OCR:从文档中解析一切) [07:49] 🧠 Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously(视频流式思考:VideoLLMs能够边观看边推理) [08:22] ⚠ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios(HomeSafe-Bench:评估视觉语言模型在家庭场景具身智能体不安全动作检测中的表现) [09:13] 🔍 From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space(从稀疏到稠密:通过增强条件空间实现流模型的多视图GRPO) [09:59] ⚡ HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration(HybridStitch:用于扩散加速的像素与时间步级别模型拼接) [11:04] 🧠 Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation(史蒂夫进化:通过细粒度诊断与双轨知识蒸馏实现开放世界具身自我进化) [11:54] 🎬 VQQA: An Agentic Approach for Video Evaluation and Quality Improvement(VQQA:一种用于视频评估与质量提升的智能体方法) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.13 | 流式空间记忆2B小模型逆袭;AI“蛮力”翻页不敌人类策略

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 🧠 Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training(Spatial-TTT:基于测试时训练的流式视觉空间智能) [01:17] 🤔 Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections(策略性导航还是随机搜索?智能体与人类在文档集合上的推理方式研究) [02:11] ⚡ IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse(IndexCache:通过跨层索引复用加速稀疏注意力) [02:54] 🎬 Video-Based Reward Modeling for Computer-Use Agents(基于视频的计算机使用智能体奖励建模) [03:55] 🎬 DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning(DreamVideo-Omni:基于潜在身份强化学习的全运动控制多主体视频定制) [04:46] 🎯 Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation(信任你的评判者:用于忠实图像编辑与生成的鲁棒奖励建模与强化学习) [05:40] 🎬 DVD: Deterministic Video Depth Estimation with Generative Priors(DVD:基于生成先验的确定性视频深度估计) [06:29] 🖼 WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing(WeEdit:面向文本中心图像编辑的数据集、基准与字形引导框架) [07:29] 🎬 ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation(ShotVerse:面向文本驱动多镜头视频创作的电影级摄像机控制技术) [08:24] 🧠 GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing(GRADE:基准测试学科知识驱动的图像编辑推理能力) [09:08] 🎬 EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation(EVATok:面向高效视觉自回归生成的自适应长度视频分词) [09:55] ⚡ One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers(一模型,多预算:用于扩散变换器的弹性潜在接口) [10:46] 🤖 OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams(OmniStream:在连续流中掌握感知、重建与行动) [11:29] 🧠 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models(EndoCoT:在扩散模型中扩展内生思维链推理) [12:37] 🧠 XSkill: Continual Learning from Experience and Skills in Multimodal Agents(XSkill:多模态智能体从经验与技能中的持续学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.12 | 边聊边训智能体;GPU秒解亿级K均

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:29] 🤖 OpenClaw-RL: Train Any Agent Simply by Talking(OpenClaw-RL:通过对话训练任意智能体) [01:17] ⚡ Flash-KMeans: Fast and Memory-Efficient Exact K-Means(Flash-KMeans:快速且内存高效的精确K-Means算法) [02:01] 👁 MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents(MA-EgoQA:基于多具身智能体第一人称视角视频的问答) [02:43] 🧠 In-Context Reinforcement Learning for Tool Use in Large Language Models(大语言模型中工具使用的上下文强化学习) [03:19] 🧠 ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning(ReMix:基于强化学习的LoRA混合路由用于大语言模型微调) [04:10] 📊 Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams(大型语言模型能否跟上?在线适应持续知识流的基准测试) [05:00] 🧠 RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback(RetroAgent:通过回顾性双重内在反馈实现从解决问题到持续进化) [05:50] 🔬 CodePercept: Code-Grounded Visual STEM Perception for MLLMs(CodePercept:基于代码的多模态大语言模型视觉STEM感知) [06:44] 🎯 Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models(Prism-Δ:面向大语言模型提示高亮的差分子空间导向方法) [07:31] 🧠 LLM2Vec-Gen: Generative Embeddings from Large Language Models(LLM2Vec-Gen:基于大语言模型的生成式嵌入方法) [08:22] ⚖ $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts(V_{0.5}:作为稀疏强化学习rollouts先验的通用价值模型) [09:05] ⚡ Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers(即时:无需训练的空间加速方法用于扩散Transformer) [09:47] 🧠 Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning(强化学习中利用群体级自然语言反馈引导探索) [10:39] 💬 RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation(RbtAct:以反驳作为监督的可操作审稿反馈生成) [11:14] 🧠 Hindsight Credit Assignment for Long-Horizon LLM Agents(面向长视野LLM智能体的后见之明信用分配) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

12分钟
99+
1个月前

2026.03.11 | 几何强化3D编辑;掩码扩散多模态

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 🎨 Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing(几何引导的强化学习用于多视角一致的3D场景编辑) [01:11] 🔄 Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion(Omni-Diffusion:基于掩码离散扩散的统一多模态理解与生成) [02:06] 🧠 Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs(思考以回忆:推理如何解锁大语言模型中的参数化知识) [02:55] 🚀 MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data(MM-Zero:从零数据自演进的多模态视觉语言模型) [03:41] 🧠 InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing(InternVL-U:民主化统一多模态模型,实现理解、推理、生成与编辑) [04:34] 🏸 Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports(让视觉语言模型踏上赛场:体育场景空间智能基准测试) [05:15] 🔍 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs(阅读而非思考:理解并弥合多模态大语言模型中文本像素化时的模态鸿沟) [06:01] 🗣 Fish Audio S2 Technical Report(Fish Audio S2 技术报告) [06:48] 🎧 Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering(音频语言模型在聆听吗?用于自适应音频引导的音频专家注意力头) [07:45] 📱 MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants(MiniAppBench:评估LLM驱动助手中从文本到交互式HTML响应的转变) [08:48] 🔍 VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?(VLM-SubtleBench:视觉语言模型距离人类级别的细微比较推理还有多远?) [09:34] 🗣 Do What I Say: A Spoken Prompt Dataset for Instruction-Following(按我说的做:一个用于指令跟随的语音提示数据集) [10:20] 🎬 Streaming Autoregressive Video Generation via Diagonal Distillation(通过对角线蒸馏实现流式自回归视频生成) [11:08] 🧪 Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications(测试驱动AI智能体定义(TDAD):从行为规范编译工具使用型智能体) [11:58] ⚖ Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards(解耦推理与置信度:在可验证奖励的强化学习中重建校准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

2026.03.10 | 长故事一致性漏洞扫描;零人工3D空间智能标注

HuggingFace 每日AI论文速递

【赞助商】 通勤路上就听AI每周谈。AI每周谈,每周带你回顾上周AI大事 传送门 🔗https://www.xiaoyuzhoufm.com/podcast/688a34636f5a275f1cba40fd 【目录】 本期的 15 篇论文如下: [00:32] 📖 Lost in Stories: Consistency Bugs in Long Story Generation by LLMs(迷失于故事:大语言模型生成长篇故事中的一致性错误) [01:16] 🧠 Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence(Holi-Spatial:将视频流演化为整体的3D空间智能) [02:17] 📈 How Far Can Unsupervised RLVR Scale LLM Training?(无监督强化学习验证奖励能将LLM训练扩展到何种程度?) [03:11] 📊 Believe Your Model: Distribution-Guided Confidence Calibration(相信你的模型:基于分布引导的置信度校准) [04:12] 🧠 LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory(LoGeR:基于混合内存的长上下文几何重建) [05:07] 🎨 CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing(CARE-Edit:基于条件感知专家路由的上下文图像编辑) [05:51] 💻 CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation(CoCo:将代码作为思维链用于文本到图像预览与稀有概念生成) [06:30] 🎬 HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising(HiAR:通过分层去噪实现高效的自回归长视频生成) [07:36] 📊 \$OneMillion-Bench: How Far are Language Agents from Human Experts?(OneMillion-Bench:语言智能体距离人类专家还有多远?) [08:24] ⚡ NLE: Non-autoregressive LLM-based ASR by Transcript Editing(NLE:基于转录编辑的非自回归大语言模型语音识别) [09:17] 🧠 Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness(概念引导的微调:引导视觉Transformer远离虚假相关性以提升鲁棒性) [10:03] 🚀 TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward(TDM-R1:利用不可微奖励增强少步扩散模型) [11:02] 📈 Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training(解锁金融数据价值:关于蒸馏与难度感知训练的研究) [11:40] 🤖 Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces(扩展智能体能力,而非上下文:面向大规模工具空间的高效强化微调) [12:36] 🔍 PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents(PIRA-Bench:从反应式GUI代理到基于GUI的主动意图推荐代理的转变) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
99+
1个月前

加入我们的 Discord

与播客爱好者一起交流

立即加入

扫描微信二维码

添加微信好友,获取更多播客资讯

微信二维码

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧