节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

【周末特辑】1月第3周最火AI论文 | DeepSeek-R1强化学习提升LLM推理能力，进化搜索优化复杂任务解决。

本期的 5 篇论文如下： [00:37] TOP1(🔥167) | 🧠 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning（DeepSeek-R1：通过强化学习激励大语言模型的推理能力） [02:59] TOP2(🔥95) | 🧠 Evolving Deeper LLM Thinking（演化更深层次的LLM思维） [05:07] TOP3(🔥73) | 🤔 Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training（Agent-R：通过迭代自训练使语言模型代理具备反思能力） [07:15] TOP4(🔥73) | 🎥 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding（MMVU：专家级多学科视频理解的测量） [09:29] TOP5(🔥64) | 👁 VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding（VideoLLaMA 3：面向图像与视频理解的前沿多模态基础模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2025.01.24 | SRMT提升多智能体协作能力，VideoReward优化视频生成质量。

本期的 15 篇论文如下： [00:26] 🧠 SRMT: Shared Memory for Multi-agent Lifelong Pathfinding（SRMT：多智能体终身路径规划中的共享记忆） [01:05] 🎥 Improving Video Generation with Human Feedback（利用人类反馈改进视频生成） [01:40] ⚡ Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models（Sigma：查询、键和值的差分重缩放以实现高效语言模型） [02:20] 🖼 Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step（能否通过思维链生成图像？逐步验证和强化图像生成） [02:55] 🖼 IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models（IMAGINE-E：最先进文本到图像模型的图像生成智能评估） [03:32] 📚 Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos（Video-MMMU：评估从多学科专业视频中获取知识的能力） [04:14] 🎥 DiffuEraser: A Diffusion Model for Video Inpainting（DiffuEraser：基于扩散模型的视频修复） [04:50] 🎥 Temporal Preference Optimization for Long-Form Video Understanding（长视频理解中的时序偏好优化） [05:29] 🎨 One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt（一提示一故事：使用单一提示实现免费午餐式一致的文本到图像生成） [06:07] 🎥 EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion（EchoVideo：基于多模态特征融合的身份保持人类视频生成） [06:42] 🧠 Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback（Step-KTO：通过逐步二元反馈优化数学推理） [07:17] 🧠 Debate Helps Weak-to-Strong Generalization（辩论助力弱到强泛化） [07:53] 🤔 Evolution and The Knightian Blindspot of Machine Learning（进化与机器学习的奈特盲点） [08:30] 🧪 Hallucinations Can Improve Large Language Models in Drug Discovery（幻觉可以提升大语言模型在药物发现中的表现） [09:10] 🌀 GSTAR: Gaussian Surface Tracking and Reconstruction（GSTAR：高斯曲面跟踪与重建）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

77

2025.01.23 | DeepSeek-R1强化学习提升推理能力，多智能体框架实现虚拟电影自动化

本期的 9 篇论文如下： [00:24] 🧠 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning（DeepSeek-R1：通过强化学习激励大语言模型的推理能力） [01:07] 🎬 FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces（FilmAgent：虚拟3D空间中的端到端电影自动化多智能体框架） [01:48] 🔄 Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback（测试时偏好优化：通过迭代文本反馈实现即时对齐） [02:25] 👁 VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding（VideoLLaMA 3：面向图像与视频理解的前沿多模态基础模型） [03:03] 🚀 Kimi k1.5: Scaling Reinforcement Learning with LLMs（Kimi k1.5：利用大语言模型扩展强化学习） [03:40] 🧠 Autonomy-of-Experts Models（专家自主模型） [04:18] 🏆 Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament（成对奖励模型：通过淘汰赛进行最佳N采样） [05:01] ✂ O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning（O1-Pruner：基于长度协调的微调用于O1类推理剪枝） [05:34] 🤖 IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems（IntellAgent：用于评估对话AI系统的多智能体框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.01.22 | Agent-R提升语言模型实时纠错能力，MMVU评估多学科视频理解专家级表现。

本期的 16 篇论文如下： [00:24] 🤔 Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training（Agent-R：通过迭代自训练使语言模型代理具备反思能力） [00:59] 🎥 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding（MMVU：专家级多学科视频理解的测量） [01:35] ⚖ Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models（细节中的魔鬼：实现负载均衡损失以训练专业化专家混合模型） [02:17] 🤖 UI-TARS: Pioneering Automated GUI Interaction with Native Agents（UI-TARS：开创性的原生GUI交互自动化代理） [02:55] 🤖 Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks（Mobile-Agent-E：面向复杂任务的自我进化移动助手） [03:31] 🎨 TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space（TokenVerse：基于令牌调制空间的多概念个性化方法） [04:14] 🏆 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model（InternLM-XComposer2.5-Reward：一种简单而有效的多模态奖励模型） [04:57] 🎥 Video Depth Anything: Consistent Depth Estimation for Super-Long Videos（视频深度任意：超长视频的一致性深度估计） [05:39] 🤖 Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments（通过交互学习：现实环境中自适应代理的数据中心框架） [06:18] 🧠 Reasoning Language Models: A Blueprint（推理语言模型：蓝图） [06:58] 🎨 Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation（Hunyuan3D 2.0：扩展扩散模型以生成高分辨率纹理3D资产） [07:40] 🧠 Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement（Condor：通过知识驱动的数据合成与精炼增强大语言模型的对齐能力） [08:21] 🎥 EMO2: End-Effector Guided Audio-Driven Avatar Video Generation（EMO2：基于末端执行器引导的音频驱动虚拟形象视频生成） [08:55] 🎥 Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise（随流而动：使用实时扭曲噪声实现运动可控的视频扩散模型） [09:32] 🌍 GPS as a Control Signal for Image Generation（GPS作为图像生成的控制信号） [10:11] ⚠ MSTS: A Multimodal Safety Test Suite for Vision-Language Models（MSTS：面向视觉-语言模型的多模态安全测试套件）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.01.21 | GameFactory实现多样化游戏生成，VideoWorld通过视频学习复杂知识。

本期的 2 篇论文如下： [00:27] 🎮 GameFactory: Creating New Games with Generative Interactive Videos（GameFactory：利用生成式交互视频创造新游戏） [01:00] 🎥 VideoWorld: Exploring Knowledge Learning from Unlabeled Videos（VideoWorld：从未标注视频中探索知识学习）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

1分钟

88

2025.01.20 | 思维进化提升LLM推理能力，PaSa优化学术搜索效率。

本期的 9 篇论文如下： [00:28] 🧠 Evolving Deeper LLM Thinking（演化更深层次的LLM思维） [01:04] 🔍 PaSa: An LLM Agent for Comprehensive Academic Paper Search（PaSa：基于大语言模型的全面学术论文搜索代理） [01:41] 🎨 Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions（Textoon：基于文本描述生成生动的2D卡通角色） [02:18] 🤔 Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong（多项选择题：推理使大型语言模型（LLMs）更加自信，即使它们是错误的） [02:53] 🌍 Bridging Language Barriers in Healthcare: A Study on Arabic LLMs（跨越医疗语言障碍：阿拉伯语大语言模型研究） [03:28] 🎬 X-Dyna: Expressive Dynamic Human Image Animation（X-Dyna：基于扩散模型的动态人体图像动画生成） [04:04] 🎙 HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution（HiFi-SR：一种用于高保真语音超分辨率的统一生成式Transformer-卷积对抗网络） [04:43] 🔍 ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario（ComplexFuncBench：探索长上下文场景下的多步和约束函数调用） [05:23] 🎭 GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor（高斯头像编辑器：可动画化的高斯头部头像编辑器）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

【周末特辑】1月第2周最火AI论文 | MiniMax-01扩展长上下文处理，数学推理PRM提升过程监督。

本期的 5 篇论文如下： [00:35] TOP1(🔥258) | ⚡ MiniMax-01: Scaling Foundation Models with Lightning Attention（MiniMax-01：基于闪电注意力机制扩展基础模型） [02:52] TOP2(🔥77) | 📊 The Lessons of Developing Process Reward Models in Mathematical Reasoning（数学推理中过程奖励模型开发的经验教训） [05:06] TOP3(🔥66) | 🧠 Tensor Product Attention Is All You Need（张量积注意力机制是关键） [06:49] TOP4(🔥64) | 🧠 Enabling Scalable Oversight via Self-Evolving Critic（通过自进化批评实现可扩展监督） [08:58] TOP5(🔥61) | 🎥 VideoRAG: Retrieval-Augmented Generation over Video Corpus（VideoRAG：基于视频语料库的检索增强生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.01.17 | OmniThink提升机器写作深度与新颖性，扩散模型推理扩展提升生成质量。

本期的 12 篇论文如下： [00:26] 🧠 OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking（OmniThink：通过思考扩展机器写作的知识边界） [01:06] 🔍 Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps（扩散模型推理时扩展：超越去噪步骤的扩展） [01:37] 🩺 Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators（探索高级患者模拟器中的问诊与诊断关系） [02:09] 🎨 SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces（SynthLight：基于扩散模型的人像重光照技术——通过重新渲染合成人脸学习） [02:48] 🤖 FAST: Efficient Action Tokenization for Vision-Language-Action Models（FAST：视觉-语言-动作模型的高效动作标记化方法） [03:23] 🔍 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation（从视觉分词器的扩展中学习重建与生成） [04:01] 🧠 Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models（迈向大型推理模型：基于大语言模型的强化推理研究综述） [04:35] 🧹 The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models（堆：一个无污染的多语言代码数据集用于评估大型语言模型） [05:15] 🤖 RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation（RLHS：通过事后模拟缓解RLHF中的错位问题） [05:54] 🎨 AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation（AnyStory：面向统一单主体与多主体个性化的文本到图像生成） [06:36] 🎨 CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation（CaPa：用于高效4K纹理网格生成的雕刻与绘制合成框架） [07:18] 🎥 Do generative video models learn physical principles from watching videos?（生成视频模型是否通过观看视频学习物理原理？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

89

2025.01.16 | MMDocIR推动多模态检索标准化，CityDreamer4D创新4D城市生成模型。

本期的 9 篇论文如下： [00:25] 📊 MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents（MMDocIR：长文档多模态检索的基准测试） [01:06] 🏙 CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities（CityDreamer4D：无界4D城市的组合生成模型） [01:49] 🎥 RepVideo: Rethinking Cross-Layer Representation for Video Generation（RepVideo：重新思考视频生成中的跨层表示） [02:30] 📚 Towards Best Practices for Open Datasets for LLM Training（面向LLM训练的最佳开放数据集实践） [03:11] 🎵 XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework（XMusic：迈向通用且可控的符号音乐生成框架） [03:46] 🔒 Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography（可信机器学习模型解锁当前密码学无法解决的隐私推理问题） [04:23] 🔍 Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding（参数倒置图像金字塔网络用于视觉感知与多模态理解） [05:03] 🎨 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot（多模态大语言模型在零样本条件下对美学的推理能力） [05:39] 🎥 Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion（Ouroboros-Diffusion：探索无调优长视频扩散中的一致内容生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

63

2025.01.15 | MiniMax-01扩展基础模型处理长上下文，填充符在T2I模型中影响图像生成。

本期的 15 篇论文如下： [00:23] ⚡ MiniMax-01: Scaling Foundation Models with Lightning Attention（MiniMax-01：基于闪电注意力机制扩展基础模型） [01:04] 🖼 Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models（填充符：T2I模型中填充符的机制分析） [01:44] 🎨 MangaNinja: Line Art Colorization with Precise Reference Following（MangaNinja：基于精确参考跟随的线稿上色） [02:21] 🧬 A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following（基于指令跟随的多模态AI副驾驶用于单细胞分析） [02:57] 🎥 Diffusion Adversarial Post-Training for One-Step Video Generation（扩散对抗后训练用于一步视频生成） [03:35] 🎲 PokerBench: Training Large Language Models to become Professional Poker Players（PokerBench：训练大型语言模型成为专业扑克玩家） [04:11] 🎨 FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors（FramePainter：赋予交互式图像编辑视频扩散先验） [04:52] 🎨 Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens（使用紧凑的文本感知一维标记实现文本到图像掩码生成模型的民主化） [05:30] 🔍 Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks（Omni-RGPT：通过标记统一图像和视频的区域级理解） [06:07] 🔍 Enhancing Automated Interpretability with Output-Centric Feature Descriptions（通过输出中心特征描述增强自动可解释性） [06:49] 📚 OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training（OpenCSG中文语料库：一系列用于大语言模型训练的高质量中文数据集） [07:27] 📹 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding（Tarsier2：从详细视频描述到全面视频理解的大型视觉语言模型进阶） [08:04] 🤔 HALoGEN: Fantastic LLM Hallucinations and Where to Find Them（HALoGEN：大型语言模型的幻觉及其发现之处） [08:43] 🤖 Potential and Perils of Large Language Models as Judges of Unstructured Textual Data（大型语言模型作为非结构化文本数据评判者的潜力与风险） [09:23] 🚫 AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages（AfriHate：非洲语言中仇恨言论和侮辱性语言的多语言数据集集合）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

2025.01.14 | 数学推理提升，内存开销减少

本期的 11 篇论文如下： [00:24] 📊 The Lessons of Developing Process Reward Models in Mathematical Reasoning（数学推理中过程奖励模型开发的经验教训） [01:10] 🧠 Tensor Product Attention Is All You Need（张量积注意力机制是关键） [01:53] 🤖 $\text{Transformer}^2$: Self-adaptive LLMs（Transformer²：自适应大型语言模型） [02:34] 🎥 VideoAuteur: Towards Long Narrative Video Generation（视频导演：面向长篇叙事视频生成） [03:22] 🌐 WebWalker: Benchmarking LLMs in Web Traversal（WebWalker：在网页遍历中评估大语言模型） [04:08] 🩺 O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning（O1复现之旅 -- 第三部分：医疗推理的推理时间扩展） [04:50] 🗣 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction（MinMo：一种用于无缝语音交互的多模态大型语言模型） [05:41] 🔧 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training（SPAM：带动量重置的尖峰感知Adam优化器用于稳定LLM训练） [06:25] 🩺 BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature（BIOMEDICA：一个开放的生物医学图像-文本档案、数据集及从科学文献中衍生出的视觉语言模型） [07:15] 🧪 ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning（ChemAgent：大型语言模型中自更新库提升化学推理能力） [07:51] 🌐 UnCommon Objects in 3D（三维中的不常见物体）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟