节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.06.10 | 强化学习改进语言模型；医学多模态模型提升推理能力。

本期的 15 篇论文如下： [00:21] 🤖 Reinforcement Pre-Training（强化预训练） [01:01] 🩺 Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning（灵枢：用于统一多模态医学理解与推理的通用基础模型） [01:42] 📱 MiniCPM4: Ultra-Efficient LLMs on End Devices（MiniCPM4：终端设备上的超高效大型语言模型） [02:30] 🛡 Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance（Saffron-1：面向LLM安全保障的推理扩展范式） [03:07] 🖼 OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation（OneIG-Bench：用于图像生成的全方位细致评估） [03:49] 🏠 SpatialLM: Training Large Language Models for Structured Indoor Modeling（SpatialLM：用于结构化室内建模的大型语言模型训练） [04:35] 🤖 Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning（Astra：通过分层多模态学习迈向通用移动机器人） [05:14] 🖼 Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers（重新思考多模态扩散Transformer中的跨模态交互） [06:02] 🖼 Image Reconstruction as a Tool for Feature Analysis（图像重建作为特征分析的工具） [06:41] 🧪 GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition（GTR-CoT：用于分子结构识别的图遍历视觉链式思考） [07:22] 📉 Through the Valley: Path to Effective Long CoT Training for Small Language Models（穿越低谷：小语言模型有效长链思考训练之路） [08:04] 🤖 BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation（BitVLA：用于机器人操作的1-bit视觉-语言-动作模型） [08:42] 🧠 Pre-trained Large Language Models Learn Hidden Markov Models In-context（预训练大语言模型上下文学习隐马尔可夫模型） [09:25] 🤔 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity（思考的幻觉：通过问题复杂性的视角理解推理模型的优势与局限性） [10:04] 🧠 CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models（CCI4.0：用于增强大型语言模型推理能力的双语预训练数据集）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

2025.06.09 | 常青问题分类提升问答系统；多模态融合优化音频描述。

本期的 15 篇论文如下： [00:24] 🕰 Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA（明日依旧为真吗？多语种常青问题分类以提升可信赖的问答系统） [01:04] 🎧 FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion（FusionAudio-1.2M：通过多模态上下文融合实现细粒度音频描述） [01:46] 🤔 Is Extending Modality The Right Path Towards Omni-Modality?（扩展模态是通向全模态的正确路径吗？） [02:23] 🎤 Audio-Aware Large Language Models as Judges for Speaking Styles（音频感知大语言模型作为语音风格的评判者） [03:00] 🧠 Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs（利用自注意力机制实现LLM中输入依赖的软提示） [03:36] 🖼 STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis（STARFlow：用于高分辨率图像合成的可扩展隐式归一化流） [04:17] 🧠 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning（MORSE-500：一个程序化可控的视频基准，用于压力测试多模态推理） [04:56] 🧩 PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers（PartCrafter: 基于组合潜在扩散Transformer的结构化3D网格生成） [05:33] 🤝 Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision（桥接视角：关于以自我中心和以外部视角进行跨视角协同智能的调查） [06:18] 🤖 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model（3DFlowAction：从3D流动世界模型中学习跨具身操作） [07:00] 🚀 Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward（前缀分组器：通过共享前缀前向传播实现高效的GRPO训练） [07:45] 🧪 CodeContests+: High-Quality Test Case Generation for Competitive Programming（CodeContests+: 针对竞争性编程的高质量测试用例生成） [08:35] 🤖 Splatting Physical Scenes: End-to-End Real-to-Sim from Imperfect Robot Data（物理场景的点云重建：从不完美的机器人数据实现端到端的真实到仿真） [09:13] 🤖 HASHIRU: Hierarchical Agent System for Hybrid Intelligent Resource Utilization（HASHIRU：用于混合智能资源利用的分层代理系统） [09:55] 🧠 Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning（少量真知：用于高效多模态推理的高价值数据选择）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

85

1个月前

【周末特辑】6月第2周最火AI论文 | LLM自我反思提升性能；高熵Token优化推理。

本期的 5 篇论文如下： [00:47] TOP1(🔥169) | 💡 Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning（反思、重试、奖励：通过强化学习实现LLM的自我提升） [02:55] TOP2(🔥130) | 🧠 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning（超越80/20法则：高熵少数Token驱动LLM推理的有效强化学习） [05:06] TOP3(🔥115) | 🧠 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models（ProRL：延长的强化学习拓展大型语言模型的推理边界） [07:27] TOP4(🔥89) | 🧠 AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time（AlphaOne：测试时驱动大模型进行快慢思考的推理框架） [09:46] TOP5(🔥75) | 🤖 SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics（SmolVLA：一种用于经济高效型机器人的视觉-语言-动作模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

1个月前

2025.06.06 | 智能助手加速ComfyUI开发；单步视频修复提升效率。

本期的 15 篇论文如下： [00:24] 🤖 ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development（ComfyUI-Copilot：用于自动化工作流开发的智能助手） [00:59] 🎬 SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training（SeedVR2：基于扩散对抗后训练的单步视频修复） [01:39] 🤖 RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics（RoboRefer：面向机器人视觉-语言模型中基于推理的空间指代） [02:26] 🚄 Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts（对角批量处理解锁循环记忆Transformer在长文本中的并行性） [03:08] 🧠 Video World Models with Long-term Spatial Memory（基于长期空间记忆的视频世界模型） [03:46] 🌐 Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights（Surfer-H：基于开放权重的低成本高效能Web代理） [04:32] ⚛ VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models（VideoREPA：通过与基础模型的关系对齐学习物理知识以用于视频生成） [05:17] 📚 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models（Qwen3 Embedding：通过基础模型推进文本嵌入和重排序） [05:55] 🔢 AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs（AV-Reasoner：提升多模态大型语言模型线索引导的音视频计数能力及构建基准） [06:38] 🌌 Aligning Latent Spaces with Flow Priors（利用流动先验对齐隐空间） [07:22] 📚 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text（Common Pile v0.1：一个包含公共领域和开放许可文本的8TB数据集） [08:15] 🧠 Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations（展开空间认知：评估视觉模拟上的多模态模型） [09:06] 🧠 StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs（StreamBP：LLM长序列训练的内存高效精确反向传播） [09:48] 🚀 Inference-Time Hyper-Scaling with KV Cache Compression（基于KV缓存压缩的推理时超 масштабирование） [10:30] 👁 SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs（SparseMM：多模态大型语言模型中视觉概念响应涌现的 Head 稀疏性）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

2025.06.05 | 紧凑强大视觉模型；多阶段训练提升推理能力

本期的 15 篇论文如下： [00:21] 🤖 MiMo-VL Technical Report（MiMo-VL 技术报告） [01:14] 💡 Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning（多模态推理进阶：从优化冷启动到分阶段强化学习） [01:57] 🤖 AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment（AmbiK：厨房环境中歧义性任务数据集） [02:42] 🔄 CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark（CASS：基于数据、模型和基准的Nvidia到AMD的转译） [03:20] 🔬 A Controllable Examination for Long-Context Language Models（长文本语言模型的可控评测） [04:14] ✍ SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models（SuperWriter：基于反思机制的LLM长文本生成） [04:55] 🤔 MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos（MMR-V：未尽之言？视频中多模态深度推理的基准测试） [05:37] 🔎 Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis（基于捷径神经元分析建立可信赖的大语言模型评估体系） [06:17] 🌐 Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation（航行者：用于可探索3D场景生成的长程和世界一致的视频扩散） [07:04] 💡 IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation（IllumiCraft：用于可控视频生成的统一几何与光照扩散） [07:49] 🎨 Image Editing As Programs with Diffusion Models（扩散模型驱动的图像编辑程序化） [08:27] 🎯 $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models（Ψ-采样器：基于SMC的评分模型中用于推理时奖励对齐的初始粒子采样） [09:04] 📊 VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation（VisCoder：微调大型语言模型以生成可执行的Python可视化代码） [09:48] 💡 Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem（通过在单一问题上进行评价微调来释放预训练大型语言模型的推理潜力） [10:28] 🎬 LayerFlow: A Unified Model for Layer-aware Video Generation（LayerFlow：一种用于分层感知视频生成的统一模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

2025.06.04 | 强化学习提升LLM性能；UniWorld统一视觉理解与生成。

本期的 15 篇论文如下： [00:23] 💡 Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning（反思、重试、奖励：通过强化学习实现LLM的自我提升） [01:09] 🖼 UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation（UniWorld：用于统一视觉理解与生成的高分辨率语义编码器） [01:53] 🧪 CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs（CSVQA：一个用于评估视觉语言模型STEM推理能力的中文多模态基准） [02:37] 🤖 VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments（VS-Bench：评估视觉语言模型在多智能体环境中进行战略推理和决策的能力） [03:15] 🧠 SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis（SynthRL：利用可验证数据合成扩展视觉推理） [04:01] 🧠 OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models（OmniSpatial：面向视觉语言模型的综合空间推理基准） [04:47] 🤖 Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces（视觉具身大脑：让多模态大型语言模型在空间中观察、思考和控制） [05:24] 👀 MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs（MotionSight：提升多模态大型语言模型中的细粒度运动理解能力） [06:10] 🤖 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents（GUI-Actor：面向GUI代理的无坐标视觉定位） [06:48] 🎬 Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers（Sparse-vDiT：释放稀疏注意力以加速视频扩散Transformer） [07:27] 🧩 DINGO: Constrained Inference for Diffusion LLMs（DINGO：扩散LLM的约束推理） [08:10] 🎬 AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation（AnimeShooter：一个用于参考引导视频生成的多镜头动画数据集） [08:47] 🤖 Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics（Robot-R1：用于增强机器人具身推理的强化学习） [09:35] 🤖 Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning（基于强化学习的LLM代码生成器与单元测试器协同进化） [10:21] 🖼 Native-Resolution Image Synthesis（原生分辨率图像合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

2025.06.03 | 高熵Token提升LLM推理；推理健身房优化强化学习环境。

本期的 15 篇论文如下： [00:22] 🧠 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning（超越80/20法则：高熵少数Token驱动LLM推理的有效强化学习） [01:05] 🧠 REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards（推理健身房：基于可验证奖励的强化学习推理环境） [01:46] 🤖 SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics（SmolVLA：一种用于经济高效型机器人的视觉-语言-动作模型） [02:31] 🚀 Taming LLMs by Scaling Learning Rates with Gradient Grouping（通过梯度分组调整学习率以驯服大型语言模型） [03:19] 🧩 Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles（拼图-R1：基于规则的视觉强化学习与拼图游戏研究） [04:06] 🎬 Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models（用于视频扩散模型多功能控制的时序上下文微调） [04:43] 🤖 ARIA: Training Language Agents with Intention-Driven Reward Aggregation（ARIA：基于意图驱动的奖励聚合训练语言智能体） [05:27] 🤖 LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks（LoHoVLA：用于长时程具身任务的统一视觉-语言-动作模型） [06:02] 🤖 ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding（ShapeLLM-Omni：用于3D生成与理解的原生多模态LLM） [06:41] 🤖 Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control（基于协作轨迹控制的机器人操作视频生成学习） [07:15] 🚀 AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning（AReaL：用于语言推理的大规模异步强化学习系统） [07:56] 🌍 EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models（地球之 Mind：面向多粒度和多传感器地球观测的大型多模态模型） [08:35] 🤔 SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning（SRPO：通过反思感知强化学习增强多模态LLM的推理能力） [09:14] 🤖 MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning（MiCRo：用于个性化偏好学习的混合建模和上下文感知路由） [09:48] 🤖 Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models（激励推理以提升大型语言模型的高级指令跟随能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1个月前

【月末特辑】5月最火AI论文 | 小型语言模型在翻译中表现优异；多模态推理模型发展历程综述。

本期的 10 篇论文如下： [00:40] TOP1(🔥209) | 🌐 Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model（Mutarjim：利用小型语言模型推进阿拉伯语-英语双向翻译） [03:07] TOP2(🔥172) | 🧠 Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models（感知、推理、思考与规划：大型多模态推理模型综述） [05:19] TOP3(🔥171) | 🤖 Qwen3 Technical Report（Qwen3技术报告） [07:49] TOP4(🔥168) | 🚀 Absolute Zero: Reinforced Self-play Reasoning with Zero Data（绝对零度：基于零数据的强化自博弈推理） [09:39] TOP5(🔥141) | 💡 Seed1.5-VL Technical Report（Seed1.5-VL 技术报告） [12:15] TOP6(🔥140) | 🗜 Shifting AI Efficiency From Model-Centric to Data-Centric Compression（AI效率转移：从以模型为中心到以数据为中心的压缩） [14:08] TOP7(🔥126) | 💡 Emerging Properties in Unified Multimodal Pretraining（统一多模态预训练中的涌现属性） [16:30] TOP8(🔥121) | 🗣 MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder（MiniMax-Speech：具有可学习说话人编码器的内在零样本语音合成） [19:21] TOP9(🔥116) | 💡 Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models（超越“灵光乍现”：迈向大型推理模型中系统性元能力对齐） [21:49] TOP10(🔥111) | 🔗 Chain-of-Model Learning for Language Model（语言模型的链式模型学习）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

24分钟

99+

1个月前

2025.06.02 | 延长RL提升推理；快慢思考优化推理。

本期的 15 篇论文如下： [00:23] 🧠 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models（ProRL：延长的强化学习拓展大型语言模型的推理边界） [01:01] 🧠 AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time（AlphaOne：测试时驱动大模型进行快慢思考的推理框架） [01:42] 🤔 Time Blindness: Why Video-Language Models Can't See What Humans Can?（时间盲区：为何视频-语言模型无法像人类一样观察？） [02:32] 🖼 Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation（不要只看一次：迈向具有选择性视觉重访的多模态交互推理） [03:13] 📊 Large Language Models for Data Synthesis（用于数据合成的大型语言模型） [03:59] 🖼 ViStoryBench: Comprehensive Benchmark Suite for Story Visualization（ViStoryBench：故事可视化综合基准测试套件） [04:39] 🧪 HardTests: Synthesizing High-Quality Test Cases for LLM Coding（HardTests：为大型语言模型代码生成合成高质量测试用例） [05:21] 🤖 Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents（开放验证码世界：一个用于测试和评估多模态大型语言模型代理的综合性Web平台） [05:59] 🤔 Vision Language Models are Biased（视觉语言模型存在偏见） [06:41] 🦾 CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects（CoDA：用于铰接物体全身操控的协同扩散噪声优化） [07:20] 🚀 CLaSp: In-Context Layer Skip for Self-Speculative Decoding（CLaSp：用于自推测解码的上下文层跳跃） [08:03] 📐 UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation（UniGeo：驾驭视频扩散模型以实现统一的、一致的几何估计） [08:44] 🤔 MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs（MetaFaith：大型语言模型中忠实的自然语言不确定性表达） [09:28] ✍ EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering（EasyText：用于多语言文本渲染的可控扩散Transformer） [10:11] 🎧 Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models（Fork-Merge解码：增强视听大型语言模型中的多模态理解）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

【周末特辑】6月第1周最火AI论文 | 小型模型在翻译中表现优异；数据中心压缩提升AI效率。

本期的 5 篇论文如下： [00:43] TOP1(🔥205) | 🌐 Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model（Mutarjim：利用小型语言模型推进阿拉伯语-英语双向翻译） [03:10] TOP2(🔥139) | 🗜 Shifting AI Efficiency From Model-Centric to Data-Centric Compression（AI效率转移：从以模型为中心到以数据为中心的压缩） [04:55] TOP3(🔥106) | 📊 TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations（TabSTAR：具有语义目标感知表征的表格基础模型） [07:01] TOP4(🔥100) | 🤖 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models（用于推理语言模型的强化学习的熵机制） [09:30] TOP5(🔥97) | 🧪 ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows（ScienceBoard：评估现实科学工作流程中的多模态自主Agent）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

45

1个月前

2025.05.30 | 推理扩展提升表格推理；多模态模型视频反馈有待优化。

本期的 15 篇论文如下： [00:22] 📊 Table-R1: Inference-Time Scaling for Table Reasoning（Table-R1：表格推理的推理时扩展） [01:02] 🤖 VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos（VF-Eval：评估多模态大语言模型生成AIGC视频反馈的能力） [01:45] 🧠 Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence（Spatial-MLLM：提升多模态大语言模型在基于视觉的空间智能方面的能力） [02:25] 🧠 The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason（行胜于言：论证推理学习中的噪声奖励） [03:11] 🤖 ZeroGUI: Automating Online GUI Learning at Zero Human Cost（ZeroGUI：零人工成本的在线GUI学习自动化） [03:45] 🤔 VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?（VideoReasonBench：多模态大语言模型能否执行以视觉为中心的复杂视频推理？） [04:39] 🧬 Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering（Satori-SWE: 面向高效软件工程的演化测试时扩展） [05:15] 🤔 Are Reasoning Models More Prone to Hallucination?（推理模型更容易产生幻觉吗？） [05:51] 🤖 cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning（cadrille：基于在线强化学习的多模态CAD重建） [06:29] 🎨 D-AR: Diffusion via Autoregressive Models（D-AR：基于自回归模型的扩散） [07:16] 📸 AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views（AnySplat：来自非约束视角的Feed-forward 3D高斯溅射） [07:53] 🛠 SWE-bench Goes Live!（SWE-bench-Live：一个实时更新的问题解决基准评测） [08:36] 💡 Multi-Domain Explainability of Preferences（偏好的多领域可解释性） [09:16] 🤖 UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning（UniRL：基于监督学习和强化学习的自提升统一多模态模型） [10:01] 🗣 FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian（FAMA：首个面向英语和意大利语的大规模开放科学语音基础模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

59

1个月前

2025.05.29 | 熵机制提升模型性能；令牌路由优化推理效率。

本期的 15 篇论文如下： [00:22] 🤖 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models（用于推理语言模型的强化学习的熵机制） [00:56] 🛣 R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing（R2R：通过大小模型令牌路由高效导航不同的推理路径） [01:40] 🧠 Skywork Open Reasoner 1 Technical Report（Skywork开放推理器1技术报告） [02:20] 🔍 Sherlock: Self-Correcting Reasoning in Vision-Language Models（夏洛克：视觉-语言模型中的自我纠正推理） [02:55] 🤖 Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO（基于GRPO的无监督后训练提升多模态LLM推理能力） [03:35] 🤖 SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents（SWE-rebench：一个用于软件工程代理任务收集和去污染评估的自动化流程） [04:25] 🚀 SageAttention2++: A More Efficient Implementation of SageAttention2（SageAttention2++：一种更高效的SageAttention2实现） [05:12] 🧠 Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start（通过强化学习与冷启动推进多模态推理） [05:59] 🎬 Fostering Video Reasoning via Next-Event Prediction（通过预测下一事件促进视频推理） [06:42] 💡 RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination（RenderFormer：基于Transformer的三角形网格全局光照神经渲染） [07:25] 🔬 DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research（DeepResearchGym：一个免费、透明且可复现的深度研究评估沙盒） [08:16] 🖼 Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment（链式缩放：通过尺度自回归和偏好对齐实现极限超分辨率） [08:58] 🧩 Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs（通用推理器：一个用于冻结LLM的单一、可组合的即插即用推理器） [09:38] 🚚 SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem（SVRPBench：一个面向随机车辆路径问题的真实基准） [10:26] 🌐 Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models（跨语言质量评估：一种基于语言模型的多语种预训练数据过滤方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

52

1个月前