节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

本期的 5 篇论文如下： [00:35] TOP1(🔥139) | 🤖 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey（面向大语言模型的智能体强化学习全景：一项综述） [01:52] TOP2(🔥133) | 🔒 A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code（A.S.E：一个用于评估AI生成代码安全的仓库级基准） [02:57] TOP3(🔥127) | 🤖 A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers（科学大型语言模型综述：从数据基础到智能体前沿） [04:15] TOP4(🔥103) | 🧠 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning（R-4B: 通过双模式退火和强化学习激励多模态大语言模型的通用自动思考能力） [05:11] TOP5(🔥101) | 🤔 Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth（废话学：用深度解读无意义内容挑战大型语言模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3天前

2025.09.05 | 大型语言模型语义理解弱；图像编辑模型提升几何估计

本期的 13 篇论文如下： [00:22] 🤔 Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth（废话学：用深度解读无意义内容挑战大型语言模型） [00:47] 📐 From Editor to Dense Geometry Estimator（从编辑模型到密集几何估计器） [01:08] 🧠 Towards a Unified View of Large Language Model Post-Training（迈向大语言模型后训练的统一视角） [01:39] 🔄 Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?（逆向IFEval：大型语言模型能否摒弃顽固训练惯例以遵循真实指令？） [02:05] 🔬 DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks（深度研究竞技场：基于研讨会任务对大语言模型研究能力的首次考核） [02:26] 🚀 Transition Models: Rethinking the Generative Learning Objective（过渡模型：重新思考生成式学习目标） [02:54] 🔍 NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings（NER检索器：基于类型感知嵌入的零样本命名实体检索） [03:24] ⚡ Few-step Flow for 3D Generation via Marginal-Data Transport Distillation（基于边缘数据传输蒸馏的少步流3D生成方法） [03:53] 🎬 Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding（视频多轮推理：面向长视频理解的强化多轮推理框架） [04:19] 🎭 Durian: Dual Reference-guided Portrait Animation with Attribute Transfer（Durian：基于双参考引导的肖像动画与属性迁移） [04:47] 📐 Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings（Drawing2CAD：基于序列到序列学习的矢量绘图CAD生成） [05:24] 🧠 Delta Activations: A Representation for Finetuned Large Language Models（Delta激活：微调大型语言模型的一种表示方法） [06:01] ⚠ False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize（虚假安全感：为何基于探测的恶意输入检测方法难以泛化）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

86

4天前

2025.09.04 | 机器人任务规划高效；数据推理能力提升

本期的 5 篇论文如下： [00:24] 🤖 Robix: A Unified Model for Robot Interaction, Reasoning and Planning（Robix：一个用于机器人交互、推理和规划的统一模型） [00:54] 🔍 Open Data Synthesis For Deep Research（面向深度研究的开放数据合成） [01:30] 🧠 LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations（LMEnt：一套分析语言模型从预训练数据到表示的知识套件） [02:00] 🧩 MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement（MOSAIC: 基于对应感知对齐和解纠缠的多主体个性化生成） [02:32] 🧠 Planning with Reasoning using Vision Language World Model（基于视觉语言世界模型的规划与推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

90

5天前

2025.09.03 | 智能体RL提升大模型自主性；SimpleTIR解多轮工具推理

本期的 15 篇论文如下： [00:19] 🤖 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey（面向大语言模型的智能体强化学习全景：一项综述） [00:40] 🚀 SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning（SimpleTIR：面向多轮工具集成推理的端到端强化学习） [01:12] 🤖 UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning（UI-TARS-2技术报告：通过多轮强化学习推进GUI代理） [01:41] 🎥 ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding（ELV-Halluc：长视频理解中的语义聚合幻觉基准测试） [02:12] 🔄 LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model（LLaVA-Critic-R1：你的评论模型其实是一个强大的策略模型） [02:43] 🔧 VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use（VerlTool：迈向整体性代理强化学习与工具使用） [03:11] 📄 POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion（POINTS-Reader：无蒸馏适配的视觉-语言模型用于文档转换） [03:33] 🩺 Baichuan-M2: Scaling Medical Capability with Large Verifier System（百川-M2：通过大规模验证系统扩展医疗能力） [03:57] 🎥 Kwai Keye-VL 1.5 Technical Report（快手 Keye-VL 1.5 技术报告） [04:20] 🤖 Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR（通过监督学习框架实现隐式Actor-Critic耦合用于RLVR） [04:45] 🧠 Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic（推理向量：通过任务算术传递思维链能力） [05:11] 🔄 Jointly Reinforcing Diversity and Quality in Language Model Generations（在语言模型生成中联合强化多样性与质量） [05:42] 🚀 DCPO: Dynamic Clipping Policy Optimization（DCPO: 动态裁剪策略优化） [06:04] 🚀 OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning（OpenVision 2：用于多模态学习的生成式预训练视觉编码器系列） [06:27] 🎬 GenCompositor: Generative Video Compositing with Diffusion Transformer（GenCompositor：基于扩散变换器的生成式视频合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6天前

2025.09.02 | PVPO优化推理性能；T2R-bench暴露模型短板

本期的 6 篇论文如下： [00:23] 🧠 PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning（PVPO：基于预估值策略优化的智能体推理方法） [00:49] 📊 T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables（T2R-bench：一个用于从真实世界工业表格生成文章级报告的基准测试） [01:18] 🔍 No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes（无标签遗漏：适用于所有监督制度的统一表面缺陷检测模型） [01:44] 📊 UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat（ALLaM 34B 的UI级评估：通过 HUMAIN Chat 测量以阿拉伯语为中心的大语言模型） [02:11] 🧠 From reactive to cognitive: brain-inspired spatial intelligence for embodied agents（从反应到认知：用于具身智能体的脑启发表象智能） [02:36] 🔄 How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench（输入重构如何提高复杂动态环境中的工具使用准确性？一项关于$τ$-bench的研究）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

66

2025.09.01 | R-4B模型优化思考效率；EO-1提升机器人控制能力

本期的 15 篇论文如下： [00:24] 🧠 R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning（R-4B: 通过双模式退火和强化学习激励多模态大语言模型的通用自动思考能力） [00:59] 🤖 EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control（具身一体视觉：交错视觉-文本-动作预训练用于通用机器人控制） [01:29] 🔒 A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code（A.S.E：一个用于评估AI生成代码安全的仓库级基准） [01:57] 🎥 Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation（Droplet3D：视频中的常识先验促进3D生成） [02:26] 🗣 TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis（TalkVid: 一个用于音频驱动说话头部合成的大规模多样化数据集） [02:58] 🤖 A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers（科学大型语言模型综述：从数据基础到智能体前沿） [03:28] 🤖 UItron: Foundational GUI Agent with Advanced Perception and Planning（UItron：具有先进感知和规划能力的基础GUI代理） [03:50] 🎮 Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models（在游戏中思考：通过强化学习与大型语言模型学习游戏推理） [04:20] 🔄 TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training（TiKMiX：将数据影响力纳入语言模型预训练的动态混合） [04:45] 💻 Efficient Code Embeddings from Code Generation Models（来自代码生成模型的高效代码嵌入） [05:10] ⏸ Morae: Proactively Pausing UI Agents for User Choices（Morae: 主动暂停UI代理以供用户选择） [05:37] 🔍 AHELM: A Holistic Evaluation of Audio-Language Models（AHELM：音频语言模型的全面评估） [06:05] 🤖 HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation（HERMES: 基于多源运动数据的人到机器人具身学习用于移动灵巧操作） [06:34] 🔄 Model-Task Alignment Drives Distinct RL Outcomes（模型-任务对齐驱动强化学习的差异化结果） [07:08] 👁 Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery（模仿物理学家的眼睛：一种以视觉语言模型为中心的物理公式发现方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

【月末特辑】8月最火AI论文 | 科学AI模型缩小性能差距；图像模型解决文本渲染与编辑

本期的 10 篇论文如下： [00:30] TOP1(🔥242) | 🧪 Intern-S1: A Scientific Multimodal Foundation Model（Intern-S1：一个科学多模态基础模型） [01:36] TOP2(🔥239) | 🎨 Qwen-Image Technical Report（Qwen-Image技术报告） [02:46] TOP3(🔥227) | 🤔 Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens（LLM思维链推理是海市蜃楼吗？一个数据分布的视角） [04:14] TOP4(🔥220) | 🚀 DINOv3（DINOv3：视觉基础模型新里程碑） [05:25] TOP5(🔥168) | 🚀 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models（GLM-4.5：智能体、推理与编程（ARC）基础模型） [06:25] TOP6(🔥166) | ✨ On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification（关于SFT泛化性的研究：一个基于奖励修正的强化学习视角） [07:29] TOP7(🔥164) | 🚀 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency（InternVL3.5：提升开源多模态模型在通用性、推理能力和效率上的表现） [08:45] TOP8(🔥156) | 🤖 VeriGUI: Verifiable Long-Chain GUI Dataset（VeriGUI：可验证的长链GUI数据集） [09:53] TOP9(🔥142) | 📚 We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning（We-Math 2.0：一个激励视觉数学推理的多功能数学手册系统） [11:26] TOP10(🔥139) | 🚀 NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale（NextStep-1：迈向大规模连续令牌自回归图像生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

【周末特辑】8月第5周最火AI论文 | 多模态模型效率提升；自博弈策略提高多样性

本期的 5 篇论文如下： [00:36] TOP1(🔥161) | 🚀 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency（InternVL3.5：提升开源多模态模型在通用性、推理能力和效率上的表现） [01:25] TOP2(🔥114) | 📈 Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR（超越Pass@1：变分问题合成的自博弈策略持续提升RLVR） [02:23] TOP3(🔥108) | 🚀 AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs（AgentFly：无需微调LLM即可微调LLM智能体） [03:51] TOP4(🔥94) | 🗣 VibeVoice Technical Report（VibeVoice技术报告） [05:17] TOP5(🔥78) | 🔍 Beyond Transcription: Mechanistic Interpretability in ASR（超越转录：自动语音识别中的机械可解释性）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2025.08.29 | 稳定文本到图像生成；高效数学推理

本期的 15 篇论文如下： [00:24] ⚖ Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning（Pref-GRPO：基于成对偏好奖励的GRPO用于稳定的文本到图像强化学习） [00:57] 🧠 rStar2-Agent: Agentic Reasoning Technical Report（rStar2-Agent：智能体推理技术报告） [01:28] 🎨 USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning（USO: 通过解耦和奖励学习的统一风格与主题驱动生成） [01:56] 🚀 AWorld: Orchestrating the Training Recipe for Agentic AI（AWorld：编排智能体AI的训练配方） [02:26] 🎯 TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning（TCIA：一种用于指令微调的任务中心式指令增强方法） [02:54] 🧠 Mixture of Contexts for Long Video Generation（上下文混合用于长视频生成） [03:17] 🧠 CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification（CogVLA：基于指令驱动路由与稀疏化的认知对齐视觉-语言-动作模型） [03:51] 🔍 MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers（MCP-Bench: 通过MCP服务器使用复杂现实世界任务对工具使用LLM代理进行基准测试） [04:23] 🎨 OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning（OneReward：通过多任务人类偏好学习实现统一的掩码引导图像生成） [04:54] 🛡 Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection（扭转局面：通过秩一安全注入实现轻量级对齐增强） [05:21] 🧠 Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD（大型语言模型中的说服动态：使用DuET-PD研究知识和安全方面的鲁棒性和适应性） [05:56] 💃 Dress&Dance: Dress up and Dance as You Like It - Technical Preview（着装与舞蹈：随心着装与舞蹈 - 技术预览） [06:18] 🎯 OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models（OnGoal：在大型语言模型多轮对话中跟踪和可视化对话目标） [06:42] 📷 Multi-View 3D Point Tracking（多视图3D点跟踪） [07:10] 🎭 FakeParts: a New Family of AI-Generated DeepFakes（FakeParts：一种新型AI生成的深度伪造家族）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

93

2025.08.28 | 推理分解减幻觉；可解释性编码信息

本期的 14 篇论文如下： [00:25] 🧠 Self-Rewarding Vision-Language Model via Reasoning Decomposition（通过推理分解的自奖励视觉语言模型） [00:49] 🔍 Beyond Transcription: Mechanistic Interpretability in ASR（超越转录：自动语音识别中的机械可解释性） [01:22] 🤖 Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies（离散扩散VLA：将离散扩散引入视觉-语言-动作策略中的动作解码） [01:52] 🧠 CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning（CODA：基于解耦强化学习的双脑计算机代理协调大脑与小脑） [02:19] 🤖 MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation（MIDAS：通过实时自回归视频生成的多模态交互式数字人合成） [02:51] 🔮 Predicting the Order of Upcoming Tokens Improves Language Modeling（预测未来token顺序提升语言建模效果） [03:20] 💓 Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation（凝视心脏：用于rPPG和健康生物标志物估计的多视角视频数据集） [03:52] ⚡ Diffusion Language Models Know the Answer Before Decoding（扩散语言模型在解码前就知道答案） [04:16] 👁 Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents（当心第三只眼！MLLM驱动的智能手机代理中的隐私意识基准测试） [04:38] 🎧 AudioStory: Generating Long-Form Narrative Audio with Large Language Models（AudioStory：使用大型语言模型生成长篇叙事音频） [05:01] 🧠 StepWiser: Stepwise Generative Judges for Wiser Reasoning（StepWiser：逐步生成式评判器以实现更明智的推理） [05:25] 🔄 Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference（驯服混沌：异构与解耦大语言模型推理的协调自动扩展） [05:53] 💃 MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment（MotionFlux：基于整流流匹配和偏好优化的高效文本引导运动生成） [06:18] 📊 DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis（DeepScholar-Bench：用于生成式研究综合的实时基准与自动化评估）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

95

2025.08.27 | 物理模型评估显不足；树算法优化提效降本

本期的 15 篇论文如下： [00:23] 🔬 CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics（CMPhysBench：用于评估凝聚态物理中大语言模型的基准测试） [00:57] 🌳 TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling（TreePO: 通过启发式树建模弥合策略优化与效果和推理效率之间的差距） [01:21] 🗣 VibeVoice Technical Report（VibeVoice技术报告） [01:45] 🔨 VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space（VoxHammer：在原生3D空间中无需训练的精确连贯3D编辑） [02:13] 💡 Spacer: Towards Engineered Scientific Inspiration（Spacer：迈向工程化的科学灵感） [02:45] 🧠 OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation（OmniHuman-1.5：通过认知模拟为数字人注入活跃思维） [03:10] 🧠 UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning（UltraMemV2：扩展至1200亿参数的具有卓越长上下文学习能力的记忆网络） [03:36] ⚡ Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels（Pixie: 从像素中快速且可泛化的3D物理监督学习） [04:04] 🎥 Autoregressive Universal Video Segmentation Model（自回归通用视频分割模型） [04:30] 🎬 Wan-S2V: Audio-Driven Cinematic Video Generation（Wan-S2V：音频驱动的电影级视频生成） [04:56] 🎬 CineScale: Free Lunch in High-Resolution Cinematic Visual Generation（CineScale：高分辨率电影视觉生成中的免费午餐） [05:22] 🔷 FastMesh:Efficient Artistic Mesh Generation via Component Decoupling（FastMesh: 通过组件解耦实现高效艺术网格生成） [05:45] 📊 ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks（ReportBench：通过学术调查任务评估深度研究代理） [06:13] 🧠 ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models（ThinkDial：一种控制大型语言模型推理努力的开源方法） [06:42] 🧠 MovieCORE: COgnitive REasoning in Movies（MovieCORE：电影中的认知推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

62