节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

3个月前

2025.07.07 | GPT-4o在语义任务中表现良好；潜在空间模拟精度高。

本期的 4 篇论文如下： [00:27] 🖼 How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks（GPT-4o的视觉理解能力如何？在标准计算机视觉任务上评估多模态基础模型） [01:09] 🌌 Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation（迷失于潜在空间：用于物理模拟的潜在扩散模型实证研究） [01:45] 🇮 Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages（Eka-Eval：一个用于印度语言大型语言模型的综合评估框架） [02:25] ✍ LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing（LitBench：创意写作可靠评估的基准和数据集）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

3个月前

【周末特辑】7月第1周最火AI论文 | 多模态推理模型提升；短视频理解领先。

本期的 5 篇论文如下： [00:35] TOP1(🔥165) | 🧠 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning（GLM-4.1V-Thinking：基于可扩展强化学习的通用多模态推理） [02:53] TOP2(🔥108) | 🎬 Kwai Keye-VL Technical Report（Kwai Keye-VL 技术报告） [05:17] TOP3(🔥67) | 🎨 LongAnimation: Long Animation Generation with Dynamic Global-Local Memory（LongAnimation：基于动态全局-局部记忆的长期动画生成） [07:40] TOP4(🔥67) | 🧭 WebSailor: Navigating Super-human Reasoning for Web Agent（WebSailor：为Web Agent导航超人推理） [10:00] TOP5(🔥58) | 🎨 BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing（BlenderFusion：基于3D的视觉编辑和生成式合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

【月末特辑】6月最火AI论文 | LLM通过自我反思提升性能；MiniMax-M1高效扩展测试计算。

本期的 10 篇论文如下： [00:37] TOP1(🔥258) | 💡 Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning（反思、重试、奖励：通过强化学习实现LLM的自我提升） [02:51] TOP2(🔥249) | 💡 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention（MiniMax-M1：利用闪电注意力高效扩展测试时计算） [05:24] TOP3(🔥240) | 🤖 Reinforcement Pre-Training（强化预训练） [07:54] TOP4(🔥165) | 🧠 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning（超越80/20法则：高熵少数Token驱动LLM推理的有效强化学习） [09:53] TOP5(🔥134) | 🕰 Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA（明日依旧为真吗？多语种常青问题分类以提升可信赖的问答系统） [12:24] TOP6(🔥132) | 🧠 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models（ProRL：延长的强化学习拓展大型语言模型的推理边界） [14:50] TOP7(🔥126) | 🧠 Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models（自信即全部：基于语言模型的小样本强化学习微调） [16:36] TOP8(🔥116) | 🧲 Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights（拖拽式大语言模型：零样本提示到权重） [18:34] TOP9(🔥108) | 🤖 SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics（SmolVLA：一种用于经济高效型机器人的视觉-语言-动作模型） [21:05] TOP10(🔥107) | 🩺 Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning（灵枢：用于统一多模态医学理解与推理的通用基础模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

24分钟

2025.07.04 | WebSailor提升LLM推理能力；LangScene-X优化3D场景重建。

本期的 15 篇论文如下： [00:22] 🧭 WebSailor: Navigating Super-human Reasoning for Web Agent（WebSailor：为Web Agent导航超人推理） [00:59] 🖼 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion（LangScene-X：通过TriMap视频扩散重建可泛化的3D语言嵌入场景） [01:44] 🧬 IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction（IntFold：用于通用和专用生物分子结构预测的可控基础模型） [02:35] 👂 Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback（倾听内心的声音：通过中间特征反馈对齐ControlNet训练） [03:17] 🤝 Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy（Skywork-Reward-V2：通过人机协同扩展偏好数据标注） [04:00] 🖼 Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers（基于图像的多模态推理：基础、方法与未来前沿） [04:38] 🧠 Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving（布尔巴基：用于定理证明的自生成和目标条件MDP） [05:12] 🧠 Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search（解耦规划与执行：一种用于深度搜索的分层推理框架） [05:47] 💡 Fast and Simplex: 2-Simplicial Attention in Triton（快速且简明：Triton中的2-单形注意力机制） [06:33] 🧐 Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers（大型语言模型能否识别科学研究中的关键局限性？人工智能研究论文的系统性评估） [07:16] 🧩 Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models（选择与合并：面向具有大型语言模型的可适应和可扩展的命名实体识别） [08:12] 🤖 Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs（自校正基准：揭示并解决大型语言模型中的自校正盲点） [08:51] 💡 Energy-Based Transformers are Scalable Learners and Thinkers（基于能量的Transformer是可扩展的学习者和思考者） [09:33] ⚙ AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training（AsyncFlow：用于高效大语言模型后训练的异步流式强化学习框架） [10:16] 🚀 ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention（ZeCO：线性注意力机制的零通信开销序列并行）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

73

2025.07.03 | 多模态模型提升短视频理解；动画生成保持颜色一致。

本期的 9 篇论文如下： [00:21] 🎬 Kwai Keye-VL Technical Report（Kwai Keye-VL 技术报告） [01:02] 🎨 LongAnimation: Long Animation Generation with Dynamic Global-Local Memory（LongAnimation：基于动态全局-局部记忆的长期动画生成） [01:50] 👁 Depth Anything at Any Condition（任意条件下的深度感知） [02:28] 🤖 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective（视觉-语言-动作模型综述：一种动作Token化的视角） [03:11] 🪄 FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model（FreeMorph：基于扩散模型的免调参通用图像渐变） [03:51] 🖼 Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation（面向高效自回归图像生成的局部感知并行解码） [04:33] 🎬 STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing（STR-Match: 匹配时空相关性得分的免训练视频编辑方法） [05:14] 📊 MARVIS: Modality Adaptive Reasoning over VISualizations（MARVIS：基于可视化的模态自适应推理） [05:51] 🗣 JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching（JAM-Flow：基于流匹配的联合音频-运动合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.07.02 | 多模态推理提升；双向嵌入优化

本期的 12 篇论文如下： [00:23] 💡 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning（GLM-4.1V-Thinking：基于可扩展强化学习的通用多模态推理） [01:00] 🖼 MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings（MoCa：模态感知持续预训练提升双向多模态嵌入效果） [01:35] 🔬 SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks（SciArena：科学文献任务中基础模型的开放评估平台） [02:19] 🤔 Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning（数学推理能力是否能提升通用大语言模型的能力？理解大语言模型推理的迁移性） [02:59] 🎬 Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation（径向注意力：用于长视频生成的具有能量衰减的O(n log n)稀疏注意力机制） [03:37] 🤖 DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation（DiffuCoder：理解并改进用于代码生成的掩码扩散模型） [04:19] 🧠 HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context（HumanOmniV2：基于上下文理解到全模态推理） [04:53] 🧠 Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact（超越Token：从脑启发智能到通用人工智能的认知基础及其社会影响） [05:30] 💡 Data Efficacy for Language Model Training（语言模型训练中的数据效能） [06:05] 🎬 FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion（FreeLong++：通过多频段频谱融合实现免训练长视频生成） [06:40] 🖼 IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering（IR3D-Bench：评估视觉-语言模型作为智能体进行逆向渲染的场景理解能力） [07:28] 🛡 Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images（Peccavi：一种针对AI生成图像的视觉释义攻击安全且无失真的图像水印技术）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

2025.07.01 | 多模态生成领先；视频扩散效率提升

本期的 15 篇论文如下： [00:21] 🖼 Ovis-U1 Technical Report（Ovis-U1 技术报告） [00:58] 🎬 VMoBA: Mixture-of-Block Attention for Video Diffusion Models（VMoBA：用于视频扩散模型的混合块注意力机制） [01:36] ✍ Calligrapher: Freestyle Text Image Customization（书法家：自由风格的文本图像定制） [02:21] 🖼 Listener-Rewarded Thinking in VLMs for Image Preferences（图像偏好：视觉语言模型中基于监听者奖励的思考） [03:04] 🧠 SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning（SPIRAL：基于零和博弈的自博弈通过多智能体多轮强化学习激励推理） [03:46] 📸 Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention（基于图结构几何注意力机制的稳定ToF深度图像去噪） [04:29] 🧬 Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective（上下文演化提示：一种开放式、自复制的视角） [05:09] 🤔 Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?（“顿悟时刻”再探：视觉语言模型能否在推理时扩展中实现真正的自我验证？） [05:58] 💾 MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation（MEMFOF：面向内存高效多帧光流估计的高分辨率训练） [06:38] 🚀 SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity（SparseLoRA：通过上下文稀疏性加速LLM微调） [07:23] 🏙 UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding（UrbanLLaVA：一个用于城市智能的、具备空间推理与理解能力的多模态大型语言模型） [08:01] 🧠 MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning（MARBLE：一个用于多模态空间推理与规划的硬基准） [08:38] 🧰 Teaching a Language Model to Speak the Language of Tools（教语言模型说工具的语言） [09:16] ✂ VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs（VOCABTRIM：用于LLM高效推测解码的词汇表剪枝） [10:01] 🤖 RoboScape: Physics-informed Embodied World Model（RoboScape：物理信息驱动的具身世界模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.06.30 | 3D视觉编辑；视频令牌压缩

本期的 14 篇论文如下： [00:26] 🎨 BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing（BlenderFusion：基于3D的视觉编辑和生成式合成） [00:59] ✂ LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs（LLaVA-Scissor：基于语义连通分量的视频LLM令牌压缩） [01:42] 🖼 XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation（XVerse：通过DiT调制实现对身份和语义属性的多主体一致性控制） [02:24] 🎬 ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models（ShotBench：视觉-语言模型中专家级电影理解） [03:05] 🖼 From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios（从理想到现实：面向真实场景的统一且数据高效的密集预测） [03:44] 🖼 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning（MiCo：用于增强视觉推理的多图像对比学习） [04:24] 🧮 Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity（Pangu Pro MoE：用于高效稀疏性的分组专家混合模型） [05:06] 🗺 Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs（细粒度偏好优化提升视觉语言模型中的空间推理能力） [05:52] 🤖 Ark: An Open-source Python-based Framework for Robot Learning（Ark：一个用于机器人学习的开源Python框架） [06:36] 🎨 Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls（噪声一致性训练：一种在学习额外控制时用于单步生成器的原生方法） [07:20] 🏎 The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements（自动化LLM竞速基准：复现NanoGPT的改进） [08:01] 🧠 Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training（Gazal-R1：通过参数高效的两阶段训练实现最先进的医学推理） [08:45] 🧮 Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning（Confucius3-Math：一个用于中国K-12数学学习的轻量级高性能推理大语言模型） [09:39] 👁 RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models（RetFiner：用于视网膜基础模型的视觉-语言精炼方案）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

91

【周末特辑】6月第5周最火AI论文 | 拖拽式大模型提升效率；法线光照恢复高精度。

本期的 5 篇论文如下： [00:42] TOP1(🔥107) | 🧲 Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights（拖拽式大语言模型：零样本提示到权重） [02:39] TOP2(🔥80) | 💡 Light of Normals: Unified Feature Representation for Universal Photometric Stereo（法线光照：用于通用光度立体的统一特征表示） [04:59] TOP3(🔥79) | 🖼 Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding（视觉引导分块：增强RAG的多模态文档理解方案） [07:07] TOP4(🔥66) | 🎨 OmniGen2: Exploration to Advanced Multimodal Generation（OmniGen2：迈向高级多模态生成的探索） [09:18] TOP5(🔥59) | 🖼 ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation（ShareGPT-4o-Image：通过GPT-4o级别的图像生成能力对齐多模态模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2025.06.27 | 强化学习提升搜索效率；记忆增强生成逼真驾驶场景。

本期的 15 篇论文如下： [00:25] 🔍 MMSearch-R1: Incentivizing LMMs to Search（MMSearch-R1：激励大型多模态模型进行搜索） [00:59] 🚗 MADrive: Memory-Augmented Driving Scene Modeling（MADrive：基于记忆增强的驾驶场景建模） [01:43] 🤖 WorldVLA: Towards Autoregressive Action World Model（WorldVLA：面向自回归动作世界模型） [02:23] 💡 Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test（大型语言模型预训练中Grokking现象 কোথায়? 无需测试，监测从记忆到泛化的过程） [03:14] 🤖 Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge（Mind2Web 2：使用Agent-as-a-Judge评估自主搜索） [04:00] 🚗 SAM4D: Segment Anything in Camera and LiDAR Streams（SAM4D：相机和激光雷达流中的可分割一切） [04:40] 🎨 FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing（FaSTA$^*$: 快速-慢速工具路径智能体，通过子程序挖掘实现高效的多轮图像编辑） [05:16] 🤖 Whole-Body Conditioned Egocentric Video Prediction（全身条件下的自我中心视频预测） [05:53] 🧠 Arch-Router: Aligning LLM Routing with Human Preferences（Arch-Router：将LLM路由与人类偏好对齐） [06:35] 🎨 FairyGen: Storied Cartoon Video from a Single Child-Drawn Character（FairyGen：从单张儿童绘画生成故事驱动的卡通视频） [07:12] 🌐 DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster（DiLoCoX：一种用于去中心化集群的低通信大规模训练框架） [07:55] 🧬 An Agentic System for Rare Disease Diagnosis with Traceable Reasoning（基于Agent的罕见病诊断系统，具有可追溯的推理能力） [08:35] 🤖 HeurAgenix: Leveraging LLMs for Solving Complex Combinatorial Optimization Challenges（HeurAgenix：利用大型语言模型解决复杂组合优化难题） [09:18] 🦘 Learning to Skip the Middle Layers of Transformers（学习跳过Transformer的中间层） [09:57] 🎵 MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners（MuseControlLite：基于轻量级调节器的多功能音乐生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

2025.06.26 | 高质量多模态模型；4比特量化提升性能

本期的 14 篇论文如下： [00:23] 🖼 ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation（ShareGPT-4o-Image：通过GPT-4o级别的图像生成能力对齐多模态模型） [01:05] 🛡 Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models（面向稳健4比特量化的异常值安全预训练大语言模型） [01:49] 🎨 Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models（逆向与编辑：基于循环一致性模型的高效快速图像编辑） [02:30] 🧠 OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling（OctoThinker：中期训练激励强化学习扩展） [03:13] 🤖 DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning（DualTHOR：一个用于情境感知规划的双臂人形机器人仿真平台） [03:49] 🦾 RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation（RoboTwin 2.0：一种可扩展的数据生成器和基准，具有强大的领域随机化，用于鲁棒的双臂机器人操作） [04:33] 🧪 Use Property-Based Testing to Bridge LLM Code Generation and Validation（利用基于属性的测试弥合LLM代码生成与验证之间的差距） [05:18] 🌍 When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs（当生活给你样本时：扩展多语言LLM的推理计算的益处） [05:56] 🖼 HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling（HiWave：基于小波变换扩散采样的免训练高分辨率图像生成） [06:39] 🤖 ReCode: Updating Code API Knowledge with Reinforcement Learning（ReCode：利用强化学习更新代码API知识） [07:15] 💬 Is There a Case for Conversation Optimized Tokenizers in Large Language Models?（大型语言模型中，面向对话优化的分词器是否有意义？） [07:59] 🔬 Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content（Biomed-Enriched：一个利用大型语言模型富集的生物医学数据集，用于预训练和提取稀有及隐藏内容） [08:47] 🤖 MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications（MATE：基于LLM的多智能体翻译环境，用于辅助应用） [09:28] 📉 The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs（调试衰减指数：重新思考代码大语言模型的调试策略）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟