节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

【月末特辑】7月最火AI论文 | GSPO稳训练；序列级裁剪降方差；上下文工程综述，动态拼装信息流

本期的 10 篇论文如下： [00:30] TOP1(🔥257) | 🚀 Group Sequence Policy Optimization（组序列策略优化） [02:21] TOP2(🔥227) | 🧮 A Survey of Context Engineering for Large Language Models（大型语言模型上下文工程综述） [03:33] TOP3(🔥207) | 🧠 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning（GLM-4.1V-Thinking：基于可扩展强化学习的通用多模态推理） [05:02] TOP4(🔥151) | 🎬 Scaling RL to Long Videos（强化学习驱动视觉语言模型扩展至长视频） [06:57] TOP5(🔥144) | 🧠 MemOS: A Memory OS for AI System（MemOS：面向人工智能系统的内存操作系统） [08:47] TOP6(🔥126) | 🎬 Kwai Keye-VL Technical Report（Kwai Keye-VL 技术报告） [10:41] TOP7(🔥126) | 🎯 GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding（GUI-G$^2$: 基于高斯奖励模型的GUI定位） [12:38] TOP8(🔥121) | 🤖 Agentic Reinforced Policy Optimization（智能体强化策略优化） [14:21] TOP9(🔥120) | 🧮 MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization（MiroMind-M1：通过上下文感知多阶段策略优化实现数学推理的开源进展） [15:53] TOP10(🔥118) | ⚡ $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention（邻域自适应块级注意力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

18分钟

【周末特辑】8月第1周最火AI论文 | ARPO用高熵分叉省预算；混元世界一句话生成可编辑3D场景

本期的 5 篇论文如下： [00:32] TOP1(🔥114) | 🤖 Agentic Reinforced Policy Optimization（智能体强化策略优化） [02:17] TOP2(🔥94) | 🌍 HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels（混元世界 1.0：从文字或像素生成沉浸式、可探索、可交互的3D世界） [05:04] TOP3(🔥76) | 🏆 Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving（Seed-Prover：自动化定理证明的深度与广度推理） [07:08] TOP4(🔥73) | 💻 ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents（ScreenCoder：模块化多模态智能体赋能前端视觉代码生成） [09:09] TOP5(🔥70) | 🧬 A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence（自我进化智能体综述：迈向人工超智能之路）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.08.01 | Seed-Prover融合LLM解决IMO数学题；Phi-Ground提升GUI感知精度。

本期的 15 篇论文如下： [00:22] 🏆 Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving（Seed-Prover：自动化定理证明的深度与广度推理） [01:04] 🎯 Phi-Ground Tech Report: Advancing Perception in GUI Grounding（Phi-Ground 技术报告：提升 GUI 接地感知能力） [01:30] 🤔 C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations（C3：探索复杂对话挑战的双语口语对话模型基准） [02:07] 🚀 RecGPT Technical Report（RecGPT 技术报告） [02:36] 🤖 villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models（villa-X：增强视觉-语言-动作模型中的潜在动作建模） [03:14] 🤖 Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents（可扩展的多任务强化学习，赋能视觉运动智能体可泛化空间智能） [04:07] ⚖ Persona Vectors: Monitoring and Controlling Character Traits in Language Models（人格向量：语言模型中性格特征的监测与控制） [04:41] 🚀 iLRM: An Iterative Large 3D Reconstruction Model（iLRM：迭代式大型3D重建模型） [05:32] ✅ TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs（TARS：多模态大语言模型幻觉抑制的最小最大词元自适应偏好策略） [06:02] 💡 On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective（Softmax注意力机制的表达能力：循环神经网络视角） [06:29] 🤝 NeRF Is a Valuable Assistant for 3D Gaussian Splatting（NeRF 是 3D Gaussian Splatting 的得力助手） [07:05] 🌾 AgroBench: Vision-Language Model Benchmark in Agriculture（AgroBench：农业视觉-语言模型基准） [07:36] 🎨 Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification（超越线性瓶颈：基于样条的知识蒸馏用于文化多样性艺术风格分类） [08:15] 🔎 Enhanced Arabic Text Retrieval with Attentive Relevance Scoring（采用注意力相关性评分的增强型阿拉伯语文本检索） [08:45] 🌊 Flow Equivariant Recurrent Neural Networks（流等变循环神经网络）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

79

2025.07.31 | ScreenCoder自动化UI转代码；Falcon-H1混合架构，提升长序列效率。

本期的 9 篇论文如下： [00:22] 💻 ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents（ScreenCoder：模块化多模态智能体赋能前端视觉代码生成） [01:02] 🚀 Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance（Falcon-H1：重塑效率与性能的混合架构语言模型系列） [01:33] 💥 BANG: Dividing 3D Assets via Generative Exploded Dynamics（BANG：基于生成式爆炸动态的三维资产分解） [02:17] 🧠 VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning（VL-Cogito：面向高级多模态推理的渐进式课程强化学习） [02:51] 🚁 Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision（弱监督下航空影像车辆检测器在未知领域的适配） [03:34] 🧩 Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation（迈向指代性音视频分割中的全模态表达与推理） [04:04] 🚀 Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning（基于强化学习的大语言模型高效差分隐私微调） [04:56] 🛠 Repair-R1: Better Test Before Repair（Repair-R1：修复前先测试，效果更佳） [05:33] 🌍 MetaCLIP 2: A Worldwide Scaling Recipe（MetaCLIP 2：全球规模化训练方案）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

81

2025.07.30 | 混元世界从文字像素生成沉浸3D世界；X-Omni用强化学习提升图像生成质量。

本期的 8 篇论文如下： [00:23] 🌍 HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels（混元世界 1.0：从文字或像素生成沉浸式、可探索、可交互的3D世界） [00:56] ✨ X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again（X-Omni：强化学习让离散自回归图像生成模型再展辉煌） [01:59] 🚀 CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning（CUDA-L1：通过对比强化学习改进CUDA优化） [02:43] ✨ MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge（MaPPO：结合先验知识的最大后验偏好优化） [03:32] 🐾 AnimalClue: Recognizing Animals by their Traces（AnimalClue：通过痕迹识别动物） [04:04] 🏃 MOVE: Motion-Guided Few-Shot Video Object Segmentation（MOVE：运动引导的少样本视频目标分割） [04:31] 🤥 MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions（MoHoBench：通过无法回答的视觉问题评估多模态大语言模型的诚实性） [04:59] 🐘 Evaluating Deep Learning Models for African Wildlife Image Classification: From DenseNet to Vision Transformers（评估用于非洲野生动物图像分类的深度学习模型：从DenseNet到视觉Transformer）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

94

2025.07.29 | ARPO提升LLM工具交互性能；ARC-Hunyuan-Video-7B深耕短视频理解。

本期的 15 篇论文如下： [00:23] 🤖 Agentic Reinforced Policy Optimization（智能体强化策略优化） [00:55] 🧠 ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts（ARC-Hunyuan-Video-7B：真实世界短视频的结构化理解） [01:35] 🚀 Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning（Rep-MTL：释放表示层任务显著性在多任务学习中的力量） [02:03] 🌐 Reconstructing 4D Spatial Intelligence: A Survey（重建4D空间智能：一项综述） [02:55] 💡 SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment（SmallThinker：原生为本地部署而训练的高效大型语言模型家族） [03:35] 🚀 A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence（自进化智能体综述：通往人工超级智能之路） [04:17] ⚖ Geometric-Mean Policy Optimization（几何平均策略优化） [04:59] 🎯 Region-based Cluster Discrimination for Visual Representation Learning（面向视觉表征学习的区域聚类判别） [05:38] ✨ GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset（GPT-IMAGE-EDIT-1.5M：一个百万规模的GPT生成图像数据集） [06:18] 🚀 UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities（UloRL：一种提升大型语言模型推理能力的超长输出强化学习方法） [06:47] ⚡ Met$^2$Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems（Met$^2$Net：一种针对复杂气象系统的解耦两阶段时空预测模型） [07:18] ✨ ForCenNet: Foreground-Centric Network for Document Image Rectification（ForCenNet：面向前景的文档图像矫正网络） [07:52] 🎨 ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment（ScenePainter：基于概念关系对齐的语义一致永续三维场景生成） [08:43] 🏆 Music Arena: Live Evaluation for Text-to-Music（Music Arena：文本到音乐的实时评估） [09:13] 🎶 JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment（JAM：一个具有细粒度可控性和审美对齐的微型基于流的歌曲生成器）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

90

2025.07.28 | GPTQ揭示为Babai算法，保障精度；TTD-DR以扩散模型生成高质量研究报告。

本期的 5 篇论文如下： [00:25] 💡 The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm（LLM 量化的几何学：GPTQ 作为 Babai 最近平面算法） [00:52] ✨ Deep Researcher with Test-Time Diffusion（基于测试时扩散的深度研究智能体） [01:40] 🔧 Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement（规范自校正：通过测试时细化缓解上下文奖励破解） [02:12] 🚗 PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving（PRIX：从原始像素学习规划实现端到端自动驾驶） [03:07] 🤖 Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI（与AI聊天：实时视频通信从人到AI的惊人转变）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

73

【周末特辑】7月第4周最火AI论文 | GUI-G2：高斯奖励提升GUI定位；MiroMind-M1：开源数学推理LLM

本期的 5 篇论文如下： [00:36] TOP1(🔥118) | 🎯 GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding（GUI-G$^2$: 基于高斯奖励模型的GUI定位） [02:14] TOP2(🔥108) | 🧮 MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization（MiroMind-M1：通过上下文感知多阶段策略优化实现数学推理的开源进展） [05:19] TOP3(🔥96) | ♾ Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning（超越上下文限制：用于长程推理的潜意识线索） [08:51] TOP4(🔥85) | ⚡ $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention（邻域自适应块级注意力） [11:59] TOP5(🔥73) | ⛓ The Invisible Leash: Why RLVR May Not Escape Its Origin（隐形束缚：RLVR为何难以摆脱其起源）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

15分钟

2025.07.25 | GSPO解决大模型训练崩溃；MUR提升LLM推理效率。

本期的 15 篇论文如下： [00:24] 🚀 Group Sequence Policy Optimization（组序列策略优化） [00:53] 🧠 MUR: Momentum Uncertainty guided Reasoning for Large Language Models（MUR：面向大型语言模型的动量不确定性引导推理） [01:30] 🧠 LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization（LAPO：内化推理效率的长度自适应策略优化） [02:09] 🎬 Captain Cinema: Towards Short Movie Generation（电影队长：迈向短片电影生成） [02:58] 📈 TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation（TTS-VAR：一种用于视觉自回归生成的测试时缩放框架） [03:36] 🌍 EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion（EarthCrafter：通过双稀疏潜在扩散实现可扩展三维地球生成） [04:23] 💡 Hierarchical Budget Policy Optimization for Adaptive Reasoning（用于自适应推理的分层预算策略优化） [04:48] 🔄 DriftMoE: A Mixture of Experts Approach to Handle Concept Drifts（DriftMoE：一种处理概念漂移的混合专家方法） [05:17] 🚀 Technical Report of TeleChat2, TeleChat2.5 and T1（TeleChat2、TeleChat2.5和T1技术报告） [06:00] 📈 DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis（DMOSpeech 2：度量优化语音合成中时长预测的强化学习） [06:31] ✨ A New Pair of GloVes（新一代GloVe模型） [07:10] 🚀 GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface（GLiNER2：一个高效多任务模式驱动的信息抽取系统） [07:38] ⚡ TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance（TeEFusion：融合文本嵌入以蒸馏无分类器引导） [08:22] ⚕ SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging（SegDT：一个基于扩散Transformer的医学影像分割模型） [08:52] 🧩 Discovering and using Spelke segments（发现与应用 Spelke 分割）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

92

2025.07.24 | MLLMs视觉感知仍不足；Yume模型可生成交互虚拟世界。

本期的 9 篇论文如下： [00:23] 👁 Pixels, Patterns, but No Poetry: To See The World like Humans（像素、模式，却无诗意：像人类一样感知世界） [00:56] 🌌 Yume: An Interactive World Generation Model（Yume：交互式世界生成模型） [01:29] ✨ DesignLab: Designing Slides Through Iterative Detection and Correction（DesignLab：通过迭代检测与修正进行幻灯片设计） [02:14] 🧠 Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning（一个领域能否助益其他领域？一项以数据为中心的多领域强化学习推理研究） [02:59] ✅ Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny（Re:Form：在LLM中利用强化学习减少可扩展形式化软件验证中的人类先验——基于Dafny的初步研究） [03:35] 🔍 RAVine: Reality-Aligned Evaluation for Agentic Search（RAVine：面向代理式搜索的现实对齐评估） [04:13] ⚡ Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention（Ultra3D：采用部分注意力的高效高保真3D生成） [04:59] ✨ Elevating 3D Models: High-Quality Texture and Geometry Refinement from a Low-Quality Model（提升3D模型：从低质量模型实现高质量纹理与几何精修） [05:31] 🔍 Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed（寻找多莉：文本到图像扩散模型中的记忆化比假设的局部性更低）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.07.23 | TIM模型突破LLM上下文限制；Step-Audio 2提升多模态语音对话。

本期的 15 篇论文如下： [00:24] ♾ Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning（超越上下文限制：用于长程推理的潜意识线索） [01:05] 🔊 Step-Audio 2 Technical Report（Step-Audio 2 技术报告） [01:41] 🚀 MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning（MegaScience：推动科学推理后训练数据集的前沿） [02:23] ⚡ Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers（上采样重要区域：用于加速扩散Transformer的区域自适应潜在采样） [03:17] 🧠 Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning（面向视觉-语言慢思考推理的半离线策略强化学习） [03:56] 🧩 Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning（Zebra-CoT：一个用于交错式视觉语言推理的数据集） [04:36] 🤔 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning（ThinkAct：基于强化视觉潜在规划的视觉-语言-动作推理） [05:03] 🤖 Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory（经验是最好的老师：通过自生成记忆将视觉语言模型应用于机器人领域） [05:56] ✨ HOComp: Interaction-Aware Human-Object Composition（HOComp：交互感知的人物-物体合成） [06:54] 🧐 RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback（RefCritic：利用精炼反馈训练长思维链评论模型） [07:36] 🚀 Task-Specific Zero-shot Quantization-Aware Training for Object Detection（面向目标检测的任务特异性零样本量化感知训练） [08:06] 🔍 SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search（SPAR: 基于LLM代理的学术论文检索，增强学术搜索能力） [08:35] ⚠ Does More Inference-Time Compute Really Help Robustness?（推理时计算量增加真的有助于提升鲁棒性吗？） [09:16] 🧭 Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning（概念消融微调：引导域外泛化） [10:02] 🧠 ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting（ObjectGS：基于高斯泼溅的对象感知场景重建与场景理解）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟