节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.11.28 | 潜在奖励模型提速降显存；画布多模态生成碾压SOTA

本期的 6 篇论文如下： [00:19] 🎬 Video Generation Models Are Good Latent Reward Models（视频生成模型是优秀的潜在奖励模型） [01:07] 🎨 Canvas-to-Image: Compositional Image Generation with Multimodal Controls（画布到图像：基于多模态控制的组合式图像生成） [01:49] 🎨 MIRA: Multimodal Iterative Reasoning Agent for Image Editing（MIRA：多模态迭代推理代理用于图像编辑） [02:30] 📊 Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following（多准则：多模态评估器在多元化标准遵循上的基准测试） [03:12] 🧠 What does it mean to understand language?（理解语言意味着什么？） [03:47] 🧠 Agentic Learner with Grow-and-Refine Multimodal Semantic Memory（具有生长与精炼多模态语义记忆的自主学习者）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

97

2025.11.27 | 俄语多模态评测补空白；潜协作提速14%

本期的 15 篇论文如下： [00:22] 🔍 Multimodal Evaluation of Russian-language Architectures（俄语多模态架构的评估框架） [01:15] 🧠 Latent Collaboration in Multi-Agent Systems（多智能体系统中的潜在协作） [01:47] 🌍 Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation（Inferix：基于块扩散的新一代世界模拟推理引擎） [02:18] 🎭 Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy（和谐：通过跨任务协同实现音频与视频生成的统一） [03:10] 📄 NVIDIA Nemotron Parse 1.1（英伟达Nemotron解析1.1） [03:46] 🧠 Monet: Reasoning in Latent Visual Space Beyond Images and Language（Monet：超越图像与语言的潜在视觉空间推理） [04:25] ⚡ Terminal Velocity Matching（终端速度匹配） [05:03] 📊 Revisiting Generalization Across Difficulty Levels: It's Not So Easy（重新审视跨难度级别的泛化能力：并非易事） [05:42] 🤖 MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots（MobileVLA-R1：强化移动机器人的视觉-语言-动作能力） [06:25] ⚡ Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs（基于轨迹采样对的连续时间一致性图像自由时间步蒸馏） [06:59] 🎮 UniGame: Turning a Unified Multimodal Model Into Its Own Adversary（UniGame：将统一多模态模型转化为其自身的对抗者） [07:47] 🧩 SPHINX: A Synthetic Environment for Visual Perception and Reasoning（SPHINX：用于视觉感知与推理的合成环境） [08:33] ⚡ Block Cascading: Training Free Acceleration of Block-Causal Video Models（块级联：免训练的块因果视频模型加速） [09:12] 🏙 RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale（RAISECity：面向城市尺度的现实对齐三维世界生成多模态智能体框架） [09:58] 📊 I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation（I-GLIDE：基于输入组的退化估计潜在健康指标）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

87

2025.11.26 | 大模型育种进化框架开源；MedSAM-3听懂临床精准分割

本期的 15 篇论文如下： [00:17] 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms（GigaEvo：基于大语言模型与进化算法的开源优化框架） [00:57] 🔬 MedSAM3: Delving into Segment Anything with Medical Concepts（MedSAM3：深入探索基于医学概念的通用分割模型） [01:34] 🔍 Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning（Agent0-VL：探索工具集成视觉语言推理的自进化智能体） [02:03] 🎨 iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation（iMontage：统一、通用、高度动态的多对多图像生成） [02:38] 🕺 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation（SteadyDancer：基于首帧保持的协调连贯人体图像动画） [03:18] 🔍 Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward（理解是否真正指导统一多模态模型的生成？从分析到前进路径） [04:04] 🤖 GigaWorld-0: World Models as Data Engine to Empower Embodied AI（GigaWorld-0：世界模型作为数据引擎赋能具身AI） [04:44] 🎯 Soft Adaptive Policy Optimization（软自适应策略优化） [05:14] 🎬 UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers（UltraViCo：突破视频扩散变换器的外推极限） [05:55] 🎯 SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space（SSA：通过特征空间中对齐全注意力和稀疏注意力输出的稀疏稀疏注意力） [06:51] 🎨 OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation（OmniAlpha：面向统一多任务RGBA生成的序列到序列框架） [07:41] 🎬 ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding（ReDirector：使用旋转相机编码创建任意长度视频重拍） [08:13] 🖼 VQ-VA World: Towards High-Quality Visual Question-Visual Answering（VQ-VA世界：迈向高质量视觉问题-视觉回答） [09:06] 🔍 HunyuanOCR Technical Report（幻方OCR技术报告） [09:48] 🏙 MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts（MajutsuCity：语言驱动美学自适应城市生成与可控3D资产及布局）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

95

2025.11.25 | 即时编译让记忆无损；AutoEnv自动挑环境提两成

本期的 15 篇论文如下： [00:25] 🧠 General Agentic Memory Via Deep Research（通过深度研究的通用代理记忆） [00:52] 🧪 AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning（AutoEnv：用于跨环境智能体学习的自动化环境测量） [01:24] 🤖 Computer-Use Agents as Judges for Generative User Interface（以计算机使用代理作为生成式用户界面的评判者） [01:55] 🎨 DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation（DeCo：用于端到端图像生成的频率解耦像素扩散） [02:24] 🎨 UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios（UltraFlux：面向高质量原生4K文本到图像跨多样宽高比的数据-模型协同设计） [03:10] 🔍 DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research（DR Tulu：基于演化评分标准的深度研究强化学习） [03:46] 🎬 In-Video Instructions: Visual Signals as Generative Control（视频内指令：视觉信号作为生成控制） [04:24] 📊 Budget-Aware Tool-Use Enables Effective Agent Scaling（预算感知的工具使用实现有效的智能体扩展） [05:12] 🎬 Plan-X: Instruct Video Generation via Semantic Planning（Plan-X：通过语义规划指导视频生成） [05:54] 🧪 M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark（M3-Bench：多模态、多跳、多线程工具使用MLLM智能体基准） [06:25] 🤖 Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO（多智能体深度研究：使用M-GRPO训练多智能体系统） [07:24] 🎬 HunyuanVideo 1.5 Technical Report（混元视频1.5技术报告） [07:56] 🧠 Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens（视觉思维链：通过连续视觉标记教导视觉语言模型更好地观察与思考） [08:36] 🧠 MIST: Mutual Information Via Supervised Training（MIST：通过监督训练实现互信息估计） [09:07] 🎨 Controllable Layer Decomposition for Reversible Multi-Layer Image Generation（可控层分解用于可逆多层图像生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.11.24 | 开源7B模型刷新多模态推理；GeoVista小模型精准地理定位

本期的 15 篇论文如下： [00:21] 🧠 OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe（OpenMMReasoner：以开放通用方案推动多模态推理前沿） [01:04] 🌍 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization（GeoVista：用于地理定位的Web增强智能视觉推理） [01:41] 🎯 SAM 3: Segment Anything with Concepts（SAM 3：基于概念的通用分割模型） [02:31] 📊 Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story（揭示文本的内在维度：从学术摘要到创意故事） [03:09] 🧠 O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents（O-Mem：面向个性化、长周期、自进化智能体的全能记忆系统） [03:43] 🦜 Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs（鹦鹉：输出真相的说服与一致性鲁棒性评级——一个面向大语言模型的谄媚鲁棒性基准） [04:26] 🧠 RynnVLA-002: A Unified Vision-Language-Action and World Model（RynnVLA-002：统一的视觉-语言-动作与世界模型） [05:19] 🧠 VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models（VisMem：潜在视觉记忆解锁视觉语言模型潜力） [05:51] 🌍 WorldGen: From Text to Traversable and Interactive 3D Worlds（WorldGen：从文本到可遍历交互式3D世界） [06:34] 🎨 Loomis Painter: Reconstructing the Painting Process（Loomis Painter：重建绘画过程） [07:06] 🔮 Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight（Mantis：具有解耦视觉预测能力的多功能视觉-语言-动作模型） [07:48] 🎨 InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization（InstructMix2Mix：通过多视图模型个性化实现一致的稀疏视图编辑） [08:21] 🔬 OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists（全能科学家：迈向人类与AI科学家共同进化的生态系统） [09:07] 🧬 MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging（MergeDNA：基于动态标记化的上下文感知基因组建模） [09:41] 🔍 Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination（Video-R4：通过视觉反刍增强文本丰富视频推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【周末特辑】11月第4周最火AI论文 | Kandinsky 5.0开源全家桶；MiroThinker开源智能体

本期的 5 篇论文如下： [00:41] TOP1(🔥171) | 🎨 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation（Kandinsky 5.0：用于图像和视频生成的基础模型家族） [02:02] TOP2(🔥150) | 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling（MiroThinker：通过模型、上下文与交互扩展，将开源研究智能体性能推向新边界） [04:31] TOP3(🔥127) | 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning（用强化学习攻克物理奥赛） [06:43] TOP4(🔥126) | 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance（“汤”级模型：简单加权平均即可让大语言模型性能跃升） [08:09] TOP5(🔥104) | 🧠 VIDEOP2R: Video Understanding from Perception to Reasoning（VIDEOP2R：从感知到推理的视频理解）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4周前

2025.11.21 | V-ReasonBench考视频模型推理；Step-Audio-R1让语音越“想”越强

本期的 15 篇论文如下： [00:22] 📊 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models（V-ReasonBench：面向视频生成模型的统一推理基准套件） [01:06] 🧠 Step-Audio-R1 Technical Report（Step-Audio-R1技术报告） [01:48] 🧭 Scaling Spatial Intelligence with Multimodal Foundation Models（通过多模态基础模型扩展空间智能） [02:18] 🎬 First Frame Is the Place to Go for Video Content Customization（首帧是实现视频内容定制化的关键所在） [02:49] 🎬 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO（视频即答案：使用联合GRPO预测并生成下一视频事件） [03:29] 🔮 SAM 3D: 3Dfy Anything in Images（SAM 3D：图像中任意物体的三维化） [04:03] 🚀 MiMo-Embodied: X-Embodied Foundation Model Technical Report（MiMo-Embodied：跨具身基础模型技术报告） [04:38] 🧠 Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation（边生成边思考：在视觉生成中交织文本推理） [05:10] 🏆 TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval（TurkColBERT：土耳其信息检索中稠密与延迟交互模型的基准研究） [05:53] 🌀 Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs（Nemotron Elastic：迈向高效多合一推理大语言模型） [06:26] 🚀 SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models（自参考策略优化：面向视觉-语言-动作模型） [07:09] 🎬 TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding（TimeViper：一种用于高效长视频理解的混合Mamba-Transformer视觉语言模型） [07:46] 🔬 SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking（SAM2S：通过语义长期跟踪实现手术视频中的任意分割） [08:23] 🎨 NaTex: Seamless Texture Generation as Latent Color Diffusion（NaTex：作为潜在颜色扩散的无缝纹理生成） [08:58] 📐 PartUV: Part-Based UV Unwrapping of 3D Meshes（PartUV：基于部件分割的3D网格UV展开方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

83

4周前

2025.11.20 | 视频模型拍推理链，迷宫百发百中；无标注左右互搏，视觉模型自学跃升

本期的 4 篇论文如下： [00:23] 🎬 Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks（通过视频进行推理：基于走迷宫任务对视频模型推理能力的首次评测） [01:17] 🔄 VisPlay: Self-Evolving Vision-Language Models from Images（VisPlay：基于无标注图像自我进化的视觉-语言模型） [01:54] 📚 ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries（ARC-Chapter：将超长视频结构化导航章节与分层摘要） [02:45] 🦴 MHR: Momentum Human Rig（MHR：动量人体绑定模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

84

2025.11.19 | 像素演员难推理；视觉误导测真章

本期的 11 篇论文如下： [00:23] 🧠 Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark（世界模拟器会推理吗？Gen-ViRe生成式视觉推理基准） [01:03] 🕵 MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs（MVI-Bench：评估大型视觉语言模型对误导性视觉输入鲁棒性的综合基准） [01:49] 🎞 REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding（REVISOR：超越文本反思，迈向长视频理解中的多模态内省推理） [03:02] 🧪 ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning（ATLAS：面向通用人工智能的高难度跨学科科学推理基准） [03:43] 🔍 Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework（大语言模型遇上极端多标签分类：可扩展多模态框架） [04:16] 🤖 Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning（Agent-R1：以端到端强化学习训练强大语言模型智能体） [05:02] 🤖 Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution（Orion：统一视觉智能体，实现多模态感知、高级视觉推理与执行） [05:32] ⚖ Mitigating Label Length Bias in Large Language Models（缓解大语言模型中的标签长度偏差） [06:14] 🧠 Agent READMEs: An Empirical Study of Context Files for Agentic Coding（智能体README：面向代理编程的上下文文件实证研究） [06:49] 🎧 Proactive Hearing Assistants that Isolate Egocentric Conversations（主动式听力助手：以自我为中心的对话自动分离技术） [07:20] 🎯 Error-Driven Scene Editing for 3D Grounding in Large Language Models（面向3D大模型的误差驱动场景编辑实现精准视觉定位）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

2025.11.18 | RL奥赛夺金；Uni-MoE 2.0全能跃升

本期的 14 篇论文如下： [00:17] 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning（用强化学习攻克物理奥赛） [00:56] 🌐 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data（Uni-MoE 2.0 Omni：以语言为中心的万模态大模型，通过先进MoE、训练与数据实现规模跃升） [01:42] 🧩 Part-X-MLLM: Part-aware 3D Multimodal Large Language Model（Part-X-MLLM：面向部件感知的3D多模态大语言模型） [02:22] 🧠 TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models（TiViBench：视频生成模型思维推理基准测试） [03:08] 🚀 GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning（GroupRank：一种由强化学习驱动的分组重排范式） [03:49] 🧩 PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image（PhysX-Anything：单张图像生成可仿真物理3D资产） [04:28] 🌌 UFO$^3$: Weaving the Digital Agent Galaxy（UFO³：编织数字智能体银河） [04:59] 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance（“汤”级模型：简单加权平均即可让大语言模型性能跃升） [05:38] 🌍 OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation（OlmoEarth：面向多模态地球观测的稳定潜变量图像建模） [06:19] 🔄 Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?（Live-SWE-agent：软件工程智能体能否实时自我进化？） [06:51] 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling（MiroThinker：通过模型、上下文与交互扩展，将开源研究智能体性能推向新边界） [07:36] 🎯 Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models（测试时谱感知潜变量引导实现视觉-语言模型零样本泛化） [08:19] 🧠 WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance（WebCoach：具备跨会话记忆引导的自进化网页智能体） [09:10] 🧬 Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs（进化方法而非提示：面向大模型的越狱攻击演化合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2025.11.17 | RoPE去噪救长文本；AI速筛离子液体

本期的 13 篇论文如下： [00:24] 🧹 DoPE: Denoising Rotary Position Embedding（DoPE：面向旋转位置嵌入的去噪处理） [00:58] 🧪 AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery（AIonopedia：面向离子液体发现的LLM智能体多模态学习编排） [01:44] 🖼 UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation（UI2Code^N：面向测试时可扩展交互式UI转代码生成的视觉语言模型） [02:20] 🚀 Virtual Width Networks（虚拟宽度网络） [02:56] ⚡ LiteAttention: A Temporal Sparse Attention for Diffusion Transformers（LiteAttention：面向扩散Transformer的时序稀疏注意力机制） [03:32] 🌐 Simulating the Visual World with Artificial Intelligence: A Roadmap（用人工智能模拟视觉世界：路线图） [04:12] 📐 GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models（GGBench：面向统一多模态模型的几何生成推理基准） [05:00] 🧏 HI-TransPA: Hearing Impairments Translation Personal Assistant（HI-TransPA：面向听障者的语音-唇形翻译个人助手） [05:35] 🚀 MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism（MarsRL：基于智能体流水线并行强化学习的多智能体推理系统进阶研究） [06:38] 🎭 EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation（EmoVid：面向情感中心视频理解与生成的大规模多模态情感视频数据集） [07:18] 🧭 SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards（SpatialThinker：用空间奖励强化多模态大模型的3D推理） [07:55] 📊 Workload Schedulers -- Genesis, Algorithms and Differences（工作负载调度器——起源、算法与差异） [08:51] 🚗 CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios（CATS-V2V：面向复杂恶劣交通场景的真实车车协同感知数据集）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递