HuggingFace 每日AI论文速递 - 节目列表

2025.01.20 | 思维进化提升LLM推理能力，PaSa优化学术搜索效率。

本期的 9 篇论文如下：[00:28] 🧠 Evolving Deeper LLM Thinking（演化更深层次的LLM思维）[01:04] 🔍 PaSa: An LLM Agent for Comprehensive Academic Paper Search（PaSa：基于大语言模型的全面学术论文搜索代理）[01:41] 🎨 Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions（Textoon：基于文本描述生成生动的2D卡通角色）[02:18] 🤔 Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong（多项选择题：推理使大型语言模型（LLMs）更加自信，即使它们是错误的）[02:53] 🌍 Bridging Language Barriers in Healthcare: A Study on Arabic LLMs（跨越医疗语言障碍：阿拉伯语大语言模型研究）[03:28] 🎬 X-Dyna: Expressive Dynamic Human Image Animation（X-Dyna：基于扩散模型的动态人体图像动画生成）[04:04] 🎙 HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution（HiFi-SR：一种用于高保真语音超分辨率的统一生成式Transformer-卷积对抗网络）[04:43] 🔍 ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario（ComplexFuncBench：探索长上下文场景下的多步和约束函数调用）[05:23] 🎭 GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor（高斯头像编辑器：可动画化的高斯头部头像编辑器）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

6分钟

【周末特辑】1月第2周最火AI论文 | MiniMax-01扩展长上下文处理，数学推理PRM提升过程监督。

本期的 5 篇论文如下：[00:35] TOP1(🔥258) | ⚡ MiniMax-01: Scaling Foundation Models with Lightning Attention（MiniMax-01：基于闪电注意力机制扩展基础模型）[02:52] TOP2(🔥77) | 📊 The Lessons of Developing Process Reward Models in Mathematical Reasoning（数学推理中过程奖励模型开发的经验教训）[05:06] TOP3(🔥66) | 🧠 Tensor Product Attention Is All You Need（张量积注意力机制是关键）[06:49] TOP4(🔥64) | 🧠 Enabling Scalable Oversight via Self-Evolving Critic（通过自进化批评实现可扩展监督）[08:58] TOP5(🔥61) | 🎥 VideoRAG: Retrieval-Augmented Generation over Video Corpus（VideoRAG：基于视频语料库的检索增强生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

2025.01.17 | OmniThink提升机器写作深度与新颖性，扩散模型推理扩展提升生成质量。

本期的 12 篇论文如下：[00:26] 🧠 OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking（OmniThink：通过思考扩展机器写作的知识边界）[01:06] 🔍 Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps（扩散模型推理时扩展：超越去噪步骤的扩展）[01:37] 🩺 Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators（探索高级患者模拟器中的问诊与诊断关系）[02:09] 🎨 SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces（SynthLight：基于扩散模型的人像重光照技术——通过重新渲染合成人脸学习）[02:48] 🤖 FAST: Efficient Action Tokenization for Vision-Language-Action Models（FAST：视觉-语言-动作模型的高效动作标记化方法）[03:23] 🔍 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation（从视觉分词器的扩展中学习重建与生成）[04:01] 🧠 Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models（迈向大型推理模型：基于大语言模型的强化推理研究综述）[04:35] 🧹 The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models（堆：一个无污染的多语言代码数据集用于评估大型语言模型）[05:15] 🤖 RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation（RLHS：通过事后模拟缓解RLHF中的错位问题）[05:54] 🎨 AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation（AnyStory：面向统一单主体与多主体个性化的文本到图像生成）[06:36] 🎨 CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation（CaPa：用于高效4K纹理网格生成的雕刻与绘制合成框架）[07:18] 🎥 Do generative video models learn physical principles from watching videos?（生成视频模型是否通过观看视频学习物理原理？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

8分钟

89

2025.01.16 | MMDocIR推动多模态检索标准化，CityDreamer4D创新4D城市生成模型。

本期的 9 篇论文如下：[00:25] 📊 MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents（MMDocIR：长文档多模态检索的基准测试）[01:06] 🏙 CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities（CityDreamer4D：无界4D城市的组合生成模型）[01:49] 🎥 RepVideo: Rethinking Cross-Layer Representation for Video Generation（RepVideo：重新思考视频生成中的跨层表示）[02:30] 📚 Towards Best Practices for Open Datasets for LLM Training（面向LLM训练的最佳开放数据集实践）[03:11] 🎵 XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework（XMusic：迈向通用且可控的符号音乐生成框架）[03:46] 🔒 Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography（可信机器学习模型解锁当前密码学无法解决的隐私推理问题）[04:23] 🔍 Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding（参数倒置图像金字塔网络用于视觉感知与多模态理解）[05:03] 🎨 Multimodal LLMs Can Reason about Aesthetics in Zero-Shot（多模态大语言模型在零样本条件下对美学的推理能力）[05:39] 🎥 Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion（Ouroboros-Diffusion：探索无调优长视频扩散中的一致内容生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

6分钟

63

2025.01.15 | MiniMax-01扩展基础模型处理长上下文，填充符在T2I模型中影响图像生成。

本期的 15 篇论文如下：[00:23] ⚡ MiniMax-01: Scaling Foundation Models with Lightning Attention（MiniMax-01：基于闪电注意力机制扩展基础模型）[01:04] 🖼 Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models（填充符：T2I模型中填充符的机制分析）[01:44] 🎨 MangaNinja: Line Art Colorization with Precise Reference Following（MangaNinja：基于精确参考跟随的线稿上色）[02:21] 🧬 A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following（基于指令跟随的多模态AI副驾驶用于单细胞分析）[02:57] 🎥 Diffusion Adversarial Post-Training for One-Step Video Generation（扩散对抗后训练用于一步视频生成）[03:35] 🎲 PokerBench: Training Large Language Models to become Professional Poker Players（PokerBench：训练大型语言模型成为专业扑克玩家）[04:11] 🎨 FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors（FramePainter：赋予交互式图像编辑视频扩散先验）[04:52] 🎨 Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens（使用紧凑的文本感知一维标记实现文本到图像掩码生成模型的民主化）[05:30] 🔍 Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks（Omni-RGPT：通过标记统一图像和视频的区域级理解）[06:07] 🔍 Enhancing Automated Interpretability with Output-Centric Feature Descriptions（通过输出中心特征描述增强自动可解释性）[06:49] 📚 OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training（OpenCSG中文语料库：一系列用于大语言模型训练的高质量中文数据集）[07:27] 📹 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding（Tarsier2：从详细视频描述到全面视频理解的大型视觉语言模型进阶）[08:04] 🤔 HALoGEN: Fantastic LLM Hallucinations and Where to Find Them（HALoGEN：大型语言模型的幻觉及其发现之处）[08:43] 🤖 Potential and Perils of Large Language Models as Judges of Unstructured Textual Data（大型语言模型作为非结构化文本数据评判者的潜力与风险）[09:23] 🚫 AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages（AfriHate：非洲语言中仇恨言论和侮辱性语言的多语言数据集集合）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

10分钟

2025.01.14 | 数学推理提升，内存开销减少

本期的 11 篇论文如下：[00:24] 📊 The Lessons of Developing Process Reward Models in Mathematical Reasoning（数学推理中过程奖励模型开发的经验教训）[01:10] 🧠 Tensor Product Attention Is All You Need（张量积注意力机制是关键）[01:53] 🤖 $\text{Transformer}^2$: Self-adaptive LLMs（Transformer²：自适应大型语言模型）[02:34] 🎥 VideoAuteur: Towards Long Narrative Video Generation（视频导演：面向长篇叙事视频生成）[03:22] 🌐 WebWalker: Benchmarking LLMs in Web Traversal（WebWalker：在网页遍历中评估大语言模型）[04:08] 🩺 O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning（O1复现之旅 -- 第三部分：医疗推理的推理时间扩展）[04:50] 🗣 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction（MinMo：一种用于无缝语音交互的多模态大型语言模型）[05:41] 🔧 SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training（SPAM：带动量重置的尖峰感知Adam优化器用于稳定LLM训练）[06:25] 🩺 BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature（BIOMEDICA：一个开放的生物医学图像-文本档案、数据集及从科学文献中衍生出的视觉语言模型）[07:15] 🧪 ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning（ChemAgent：大型语言模型中自更新库提升化学推理能力）[07:51] 🌐 UnCommon Objects in 3D（三维中的不常见物体）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

9分钟

2025.01.13 | OmniManip实现通用机器人操作，VideoRAG提升视频检索生成性能。

本期的 10 篇论文如下：[00:24] 🤖 OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints（OmniManip：通过以对象为中心的交互原语作为空间约束实现通用机器人操作）[01:02] 🎥 VideoRAG: Retrieval-Augmented Generation over Video Corpus（VideoRAG：基于视频语料库的检索增强生成）[01:38] 🎥 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?（OVO-Bench：你的视频大语言模型离现实世界在线视频理解还有多远？）[02:26] 🧠 LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs（LlamaV-o1：重新思考大语言模型中的逐步视觉推理）[03:01] 🧠 Enabling Scalable Oversight via Self-Evolving Critic（通过自进化批评实现可扩展监督）[03:34] 🎥 ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning（ConceptMaster：无需测试时调优的扩散变换器模型上的多概念视频定制）[04:09] 🎥 Multi-subject Open-set Personalization in Video Generation（多主体开放集个性化视频生成）[04:47] 🔍 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding（ReFocus：视觉编辑作为结构化图像理解的思维链）[05:23] 🤖 Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains（多智能体微调：通过多样化推理链实现自我改进）[06:00] 🦠 Infecting Generative AI With Viruses（感染生成式人工智能的病毒）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

7分钟

【周末特辑】1月第1周最火AI论文 | 小型模型超越大型模型，REINFORCE++简化对齐方法

本期的 5 篇论文如下：[00:39] TOP1(🔥173) | 🧠 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking（rStar-Math：小型语言模型通过自我进化的深度思考掌握数学推理）[03:03] TOP2(🔥71) | 🚀 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models（REINFORCE++：一种简单高效的大语言模型对齐方法）[05:17] TOP3(🔥63) | 🧠 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though（迈向LLMs中的系统2推理：学习如何通过元思维链进行思考）[07:35] TOP4(🔥57) | 🔬 Agent Laboratory: Using LLM Agents as Research Assistants（智能体实验室：利用LLM智能体作为研究助手）[09:41] TOP5(🔥52) | 🌍 Cosmos World Foundation Model Platform for Physical AI（物理AI的宇宙世界基础模型平台）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

12分钟

2025.01.10 每日AI论文 | GAN训练简化性能提升，视频自回归预训练竞争力显著。

本期的 7 篇论文如下：[00:23] 🧠 The GAN is dead; long live the GAN! A Modern GAN Baseline（GAN已死；GAN万岁！一个现代的GAN基线）[01:02] 🎥 An Empirical Study of Autoregressive Pre-training from Videos（视频自回归预训练的实证研究）[01:49] 🚗 Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives（视觉语言模型是否准备好用于自动驾驶？从可靠性、数据和指标角度的实证研究）[02:32] 🔍 On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis（关于视觉自回归模型的计算极限与可证明高效准则：细粒度复杂度分析）[03:14] 🌍 Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model（Centurio：大型视觉语言模型多语言能力的驱动因素研究）[03:50] 📜 Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models（构建历史土耳其语自然语言处理的基础：资源与模型）[04:26] 🔒 Entropy-Guided Attention for Private LLMs（熵引导注意力机制在私有大语言模型中的应用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

5分钟

2025.01.09 每日AI论文 | 小型模型自我进化超越GPT-3，多模态模型提升数学推理能力。

本期的 11 篇论文如下：[00:25] 🧠 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking（rStar-Math：小型语言模型通过自我进化的深度思考掌握数学推理）[01:06] 🧠 URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics（URSA：理解与验证多模态数学中的思维链推理）[01:45] 🧠 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though（迈向LLMs中的系统2推理：学习如何通过元思维链进行思考）[02:25] 🔬 Agent Laboratory: Using LLM Agents as Research Assistants（智能体实验室：利用LLM智能体作为研究助手）[03:02] 🔬 LLM4SR: A Survey on Large Language Models for Scientific Research（LLM4SR：大语言模型在科学研究中的应用综述）[03:44] 🔍 GeAR: Generation Augmented Retrieval（生成增强检索）[04:22] 🤖 InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection（InfiGUIAgent：具备原生推理与反思能力的多模态通用GUI代理）[05:02] 🐦 Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation（Chirpy3D：基于连续部件潜变量的创造性3D鸟类生成）[05:41] 🖼 SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images（SPAR3D：基于单图像的稳定点感知三维物体重建）[06:17] 🧠 DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization（DPO核：一种语义感知、核增强且富含散度的直接偏好优化范式）[06:55] 🌳 EpiCoder: Encompassing Diversity and Complexity in Code Generation（EpiCoder：在代码生成中涵盖多样性与复杂性）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

7分钟

2025.01.08 每日AI论文 | REINFORCE++提升大模型对齐效率，MotionBench优化视频运动理解

本期的 11 篇论文如下：[00:24] 🚀 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models（REINFORCE++：一种简单高效的大语言模型对齐方法）[01:00] 🎥 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models（MotionBench：用于评估和改进视觉语言模型细粒度视频运动理解的基准）[01:40] 🔍 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos（Sa2VA：将SAM2与LLaVA结合以实现图像和视频的密集基础理解）[02:21] 🌍 Cosmos World Foundation Model Platform for Physical AI（物理AI的宇宙世界基础模型平台）[03:01] 🔍 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token（LLaVA-Mini：使用单一视觉标记的高效图像与视频大型多模态模型）[03:40] 🎥 Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control（扩散作为着色器：支持多样化视频生成控制的3D感知视频扩散）[04:22] 🎥 MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting（MoDec-GS：全局到局部运动分解与时间间隔调整用于紧凑动态3D高斯泼溅）[05:05] 📊 PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides（PPTAgent：超越文本到幻灯片的演示文稿生成与评估）[05:42] 🎭 MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control（MagicFace：基于动作单元控制的高保真面部表情编辑）[06:17] 🎥 Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers（魔镜：基于视频扩散变换器的身份保持视频生成）[06:52] 🐬 Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback（海豚：通过思考、实践和反馈实现闭环开放式自动研究）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

7分钟

86