节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.08.26 | 提升模型推理效率；增强生成语义对齐

本期的 15 篇论文如下： [00:24] 🚀 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency（InternVL3.5：提升开源多模态模型在通用性、推理能力和效率上的表现） [00:52] 🧠 Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation（Visual-CoG：阶段感知强化学习与指导链用于文本到图像生成） [01:19] 🎨 MV-RAG: Retrieval Augmented Multiview Diffusion（MV-RAG：检索增强多视角扩散） [01:45] 🧠 T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation（T2I-ReasonBench：推理增强型文本到图像生成基准评估） [02:10] 🤔 Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling（超越记忆：借助循环、记忆和测试时计算扩展来提升推理深度） [02:41] 🚀 Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning（打破探索瓶颈：通用大型语言模型推理的评分标准支架式强化学习） [03:04] 🎨 PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs（PosterGen：基于多智能体LLMs的美学感知型论文海报生成） [03:25] 🤔 UQ: Assessing Language Models on Unsolved Questions（UQ：评估语言模型面对未解决问题） [03:54] 📚 MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment（MEENA (PersianMMMU)：面向多级别评估的多模态多语言教育考试） [04:25] 🗺 Explain Before You Answer: A Survey on Compositional Visual Reasoning（先解释再回答：组合式视觉推理研究综述） [04:47] 📊 ST-Raptor: LLM-Powered Semi-Structured Table Question Answering（ST-Raptor：大语言模型驱动的半结构化表格问答） [05:15] 🔍 SpotEdit: Evaluating Visually-Guided Image Editing Methods（SpotEdit：评估视觉引导的图像编辑方法） [05:39] 📖 German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German（German4All：德语中可读性控制复述的数据集与模型） [06:06] 📉 Limitations of Normalization in Attention Mechanism（注意力机制中归一化的局限性） [06:33] 🌐 MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting（MeshSplat：基于高斯辐射场的可泛化稀疏视角表面重建）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

2周前

2025.08.25 | 无微调智能体高效学习；四足机器人长周期探索

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:23] 🚀 AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs（AgentFly：无需微调LLM即可微调LLM智能体） [00:48] 🐕 ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks（ODYSSEY：开放世界四足机器人长周期任务探索与操作） [01:24] 📈 Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR（超越Pass@1：变分问题合成的自博弈策略持续提升RLVR） [01:51] 🗑 CRISP: Persistent Concept Unlearning via Sparse Autoencoders（CRISP：基于稀疏自编码器的持久概念消除） [02:21] 🔍 Selective Contrastive Learning for Weakly Supervised Affordance Grounding（选择性对比学习用于弱监督动作功能区域定位） [02:49] 🏆 AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions（AetherCode：评估LLM在顶级编程竞赛中的获胜能力） [03:19] 👁 EgoTwin: Dreaming Body and View in First Person（EgoTwin：第一人称视角的身体与视野生成） [03:46] 🤔 Do What? Teaching Vision-Language-Action Models to Reject the Impossible（做什么？教导视觉-语言-动作模型拒绝不可能） [04:14] 🩺 End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning（端到端智能体RAG系统训练，实现可追溯的诊断推理） [04:40] ⚡ TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference（TPLA：用于高效解耦预填充与解码推理的张量并行潜在注意力） [05:06] 🤖 AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications（AgentScope 1.0：一个以开发者为中心的智能体应用构建框架） [05:37] 🔄 RotaTouille: Rotation Equivariant Deep Learning for Contours（RotaTouille：轮廓的旋转等变深度学习） [06:04] 🤔 InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles（InMind：评估LLM在捕获和应用个体人类推理风格方面的能力） [06:28] 🚀 CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning（CARFT：通过结合带标注思维链的强化微调与对比学习提升大型语言模型推理能力） [06:54] ✏ Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing（Sketch3DVE：基于草图的3D感知场景视频编辑）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

88

2周前

【周末特辑】8月第4周最火AI论文 | 视觉模型新突破；科学多模态领先

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:39] TOP1(🔥172) | 🚀 DINOv3（DINOv3：视觉基础模型新里程碑） [01:39] TOP2(🔥170) | 🧪 Intern-S1: A Scientific Multimodal Foundation Model（Intern-S1：一个科学多模态基础模型） [03:08] TOP3(🔥100) | 🤖 Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL（智能体链：基于多智能体蒸馏与智能体强化学习的端到端智能体基础模型） [04:18] TOP4(🔥98) | ✨ Ovis2.5 Technical Report（Ovis2.5 技术报告） [05:12] TOP5(🔥86) | 🧠 SSRL: Self-Search Reinforcement Learning（SSRL：自搜索强化学习）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99+

2周前

2025.08.22 | 科学多模态缩小差距；GUI自动化解决挑战

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:22] 🧪 Intern-S1: A Scientific Multimodal Foundation Model（Intern-S1：一个科学多模态基础模型） [00:46] 🤖 Mobile-Agent-v3: Foundamental Agents for GUI Automation（Mobile-Agent-v3：GUI自动化基础智能体） [01:10] ✅ Deep Think with Confidence（置信深思） [01:31] 🤔 LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries（LiveMCP-101：在挑战性查询上对启用MCP的智能体进行压力测试与诊断） [02:01] 🎬 Waver: Wave Your Way to Lifelike Video Generation（Waver：驾驭波形，生成栩栩如生的视频） [02:25] 🏞 SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass（SceneGen：单图一次前向传播生成三维场景） [02:56] 📚 A Survey on Large Language Model Benchmarks（大语言模型基准测试综述） [03:20] 🤸 ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling（ATLAS：解耦骨骼与形状参数，实现富有表现力的参数化人体建模） [03:46] 🎨 Visual Autoregressive Modeling for Instruction-Guided Image Editing（用于指令引导图像编辑的视觉自回归建模） [04:15] 🤖 aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists（aiXiv：由AI科学家生成的下一代开放获取科学发现生态系统） [04:40] 🗺 "Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries（“咖啡馆入口是否无障碍？门在哪里？”——迈向地理空间AI智能体实现视觉查询） [05:12] 🔍 When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding（何时何物：基于扩散模型的视频大语言模型，结合实体感知分割实现长视频理解） [05:44] 💰 Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models（Fin-PRM：大型语言模型金融推理的领域专用过程奖励模型） [06:08] ⚡ Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds（Snap-Snap：双图快拍，毫秒级3D人体高斯重建） [06:37] 🫂 INTIMA: A Benchmark for Human-AI Companionship Behavior（INTIMA：人机陪伴行为基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

2周前

2025.08.21 | 金融大模型认知诊断；DuPO优化自验证

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:22] 🧠 From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models（从分数到技能：金融大语言模型认知诊断评估框架） [00:49] ✅ DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization（DuPO：通过双重偏好优化实现大模型可靠自验证） [01:17] 🔮 FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction（FutureX：面向LLM智能体未来预测的先进实时基准） [01:44] 🏗 MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds（MeshCoder：LLM赋能的点云结构化网格代码生成） [02:14] 🪄 Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization（Tinker：扩散模型赋能3D——从稀疏输入实现多视角一致性编辑，无需逐场景优化） [02:40] 🤖 From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery（从科学AI到具身科学：自主科学发现综述） [03:06] ⚙ Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs（量化技术邂逅扩散大语言模型：扩散大语言模型后训练量化系统性研究） [03:37] 🛠 MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers（MCP-Universe：基于真实世界模型上下文协议服务器的大语言模型基准测试） [04:12] ⚡ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model（NVIDIA Nemotron Nano 2：一个准确高效的混合Mamba-Transformer推理模型） [04:45] 🤖 RynnEC: Bringing MLLMs into Embodied World（RynnEC：将多模态大语言模型引入具身世界） [05:12] ⚖ On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting（在线强化学习与离线专家融合：通过动态加权协调监督微调与强化学习） [05:41] 🧐 ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?（ViExam：视觉语言模型在越南语多模态考试题上能否超越人类？） [06:08] ⚡ Leuvenshtein: Efficient FHE-based Edit Distance Computation with Single Bootstrap per Cell（Leuvenshtein: 基于FHE的高效编辑距离计算，每单元单次自举） [06:40] 📏 Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer（基于潜在深度平衡规范器的局部尺度等变性） [07:06] 🤔 mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning（mSCoRe: 一个多语言、可扩展的基于技能的常识推理基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99+

2周前

2025.08.20 | 智能体链提升效率；长视频3D重建优化

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:23] 🤖 Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL（智能体链：基于多智能体蒸馏与智能体强化学习的端到端智能体基础模型） [00:52] 🎥 LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos（LongSplat：针对随意长视频的鲁棒无姿态3D高斯泼溅） [01:13] 🛠 Prompt Orchestration Markup Language（提示编排标记语言） [01:33] 🎨 MultiRef: Controllable Image Generation with Multiple Visual References（MultiRef：多视觉参考可控图像生成） [02:00] 🤖 Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge（基于用户画像感知的LLM评判播客推荐效果评估） [02:29] 🦾 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation（Embodied-R1：强化具身推理实现通用机器人操作） [02:59] ✅ Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation（关注生成过程：LLM生成时的细粒度置信度估计） [03:22] 🎨 Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer（基于多模态扩散Transformer的免训练文本引导颜色编辑） [03:45] 🪄 OmniTry: Virtual Try-On Anything without Masks（OmniTry：无需掩膜的万物虚拟试穿） [04:08] ⏰ A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models（防患未然：语言模型的主动式自我精炼） [04:32] 👂 Advances in Speech Separation: Techniques, Challenges, and Future Trends（语音分离的进展：技术、挑战与未来趋势） [05:04] 😥 Leveraging Large Language Models for Predictive Analysis of Human Misery（利用大型语言模型对人类痛苦进行预测性分析） [05:27] ⏳ TempFlow-GRPO: When Timing Matters for GRPO in Flow Models（TempFlow-GRPO：时序性在流模型GRPO中的关键作用） [05:58] 🗺 CAMAR: Continuous Actions Multi-Agent Routing（CAMAR：连续动作多智能体路径规划） [06:25] 🔒 Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends（大型语言模型版权保护：方法、挑战与趋势综述）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

3周前

2025.08.19 | Ovis2.5提升多模态；ComoRAG优化长叙事推理

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:20] ✨ Ovis2.5 Technical Report（Ovis2.5 技术报告） [00:51] 🧠 ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning（ComoRAG：一种认知启发式记忆组织RAG，用于有状态长叙事推理） [01:14] 🎥 4DNeX: Feed-Forward 4D Generative Modeling Made Easy（4DNeX：前馈4D生成建模轻松实现） [01:38] ✨ Next Visual Granularity Generation（下一视觉粒度生成） [01:57] ⚡ Speed Always Wins: A Survey on Efficient Architectures for Large Language Models（速度至上：大型语言模型高效架构综述） [02:30] 🤔 Has GPT-5 Achieved Spatial Intelligence? An Empirical Study（GPT-5是否已实现空间智能？一项实证研究） [03:00] 🎮 HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds（HeroBench：虚拟世界中长周期规划与结构化推理的基准测试） [03:26] ❗ When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs（当标点符号至关重要时：大型语言模型提示鲁棒性方法的大规模比较） [03:56] 🎮 Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model（矩阵游戏 2.0：一个开源、实时、流式的交互式世界模型） [04:21] 💡 Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models（Lumen：基于视频生成模型的一致性视频重打光与和谐背景替换） [04:47] 🌐 G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration（G-CUT3R：融合相机与深度先验的引导式三维重建） [05:15] ✨ S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models（S^2-Guidance：扩散模型无训练增强的随机自引导） [05:49] 👂 Representing Speech Through Autoregressive Prediction of Cochlear Tokens（通过自回归预测耳蜗令牌实现语音表征） [06:09] 💡 Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping（逆向LLaVA：通过文本到视觉映射消除对齐预训练） [06:40] 🎬 Precise Action-to-Video Generation Through Visual Action Prompts（通过视觉动作提示实现精确的动作到视频生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99+

3周前

2025.08.18 | 超越图像思考；自搜索强化

HuggingFace 每日AI论文速递

本期的 13 篇论文如下： [00:19] 💡 Thyme: Think Beyond Images（Thyme：超越图像的思考） [00:48] 🧠 SSRL: Self-Search Reinforcement Learning（SSRL：自搜索强化学习） [01:16] 🚀 DINOv3（DINOv3：视觉基础模型新里程碑） [01:42] 🔍 PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing（PaperRegister：通过分层寄存器索引提升灵活粒度论文搜索） [02:13] 🚀 XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization（XQuant：通过KV缓存重物化突破LLM推理的内存瓶颈） [02:40] 🚀 BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining（BeyondWeb：万亿规模预训练中合成数据扩展的经验教训） [03:09] 🎨 StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation（StyleMM：通过文本驱动的对齐图像翻译实现风格化3D可变形人脸模型） [03:35] 🌌 TexVerse: A Universe of 3D Objects with High-Resolution Textures（TexVerse：高分辨率纹理3D对象宇宙） [03:59] 🗣 FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation（FantasyTalking2：面向音频驱动人像动画的时间步-层级自适应偏好优化） [04:32] 💡 X-Node: Self-Explanation is All We Need（X-Node：自解释即是我们所需的一切） [04:57] ⚙ Controlling Multimodal LLMs via Reward-guided Decoding（通过奖励引导解码控制多模态大语言模型） [05:21] ✨ SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation（稀疏数据，丰硕成果：通过类别条件图像转换实现小样本半监督学习） [05:52] 🌍 MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data（MAESTRO：用于多模态、多时相、多光谱地球观测数据的掩码自编码器）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

99+

3周前

【周末特辑】8月第3周最火AI论文 | GLM-4.5统一智能体推理编程；We-Math提升视觉数学推理

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:32] TOP1(🔥139) | 🚀 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models（GLM-4.5：智能体、推理与编程（ARC）基础模型） [01:44] TOP2(🔥121) | 📚 We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning（We-Math 2.0：一个激励视觉数学推理的多功能数学手册系统） [02:46] TOP3(🔥110) | 🧠 ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability（ReasonRank：赋予段落排序强大推理能力） [03:48] TOP4(🔥109) | 🤖 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent（WebWatcher：突破视觉-语言深度研究智能体的新前沿） [05:00] TOP5(🔥107) | 🚀 NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale（NextStep-1：迈向大规模连续令牌自回归图像生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

99+

3周前

2025.08.15 | 数学推理手册提升模型能力；连续令牌生成图像模型

HuggingFace 每日AI论文速递

本期的 12 篇论文如下： [00:23] 📚 We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning（We-Math 2.0：一个激励视觉数学推理的多功能数学手册系统） [00:50] 🚀 NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale（NextStep-1：迈向大规模连续令牌自回归图像生成） [01:17] 🎨 ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing（ToonComposer：通过生成式关键帧后处理简化卡通制作） [01:43] 🤔 PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts（PRELUDE：一个旨在要求长上下文全局理解与推理的基准） [02:14] 🚀 UI-Venus Technical Report: Building High-performance UI Agents with RFT（UI-Venus技术报告：采用RFT构建高性能UI智能体） [02:42] 🚀 STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer（STream3R：基于因果Transformer的可扩展序列三维重建） [03:11] ⚖ Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models（Pass@k 训练：自适应平衡大型推理模型的探索与利用） [03:37] 🤔 HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs（HumanSense：通过推理型多模态大语言模型实现从多模态感知到共情语境感知响应） [04:08] 📚 A Survey on Diffusion Language Models（扩散语言模型综述） [04:39] 💡 From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms（从黑箱到透明：在大学课堂中利用可解释人工智能提升自动化口译评估） [05:03] 📸 Processing and acquisition traces in visual encoders: What does CLIP know about your camera?（视觉编码器中的处理与采集痕迹：CLIP对你的相机了解多少？） [05:30] ⚖ When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing（当可解释性遇上隐私：后验可解释性与差分隐私在自然语言处理背景下的交集研究）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

56

3周前

2025.08.14 | 分子推理框架提升性能；视频身份控制轻量高效

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:17] 🧪 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery（Mol-R1：迈向分子发现中的显式长链思维推理） [00:38] ✨ Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation（Stand-In：视频生成中轻量级即插即用的身份控制） [01:06] 🎥 Story2Board: A Training-Free Approach for Expressive Storyboard Generation（Story2Board：一种富有表现力的故事板生成免训练方法） [01:32] 🛡 AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving（AWorld：具有稳定操控能力的动态多智能体系统，用于鲁棒的GAIA问题解决） [01:59] ⚡ Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing（扩散大语言模型通过离散扩散强制实现超越自回归的推理速度） [02:21] 🪄 Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation（Echo-4o：利用GPT-4o合成图像的力量改进图像生成） [02:51] 🧠 Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory（感知、聆听、记忆与推理：一种具备长期记忆的多模态智能体） [03:21] 🤝 Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment（学习对齐，对齐以学习：一种自优化对齐的统一方法） [03:48] 🚧 MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models（MathReal：我们来真的！一个用于评估多模态大语言模型数学推理能力的真实场景基准） [04:12] 💡 Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models（Cooper：大型语言模型强化学习中策略与奖励模型的协同优化） [04:32] 👻 IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding（IAG：针对视觉定位中VLMs的输入感知后门攻击） [04:59] 💡 Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models（噪声超网络：均摊扩散模型中的测试时计算量） [05:21] 💻 VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models（VisCodex：通过融合视觉和编码模型实现统一多模态代码生成） [05:47] ✨ GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors（GSFixer：利用参考引导的视频扩散先验改进3D高斯泼溅） [06:13] ✨ CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing（CannyEdit：选择性Canny控制与双提示引导的免训练图像编辑）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

3周前

2025.08.13 | 多模态AI突破；3D世界生成

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:22] 🤖 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent（WebWatcher：突破视觉-语言深度研究智能体的新前沿） [00:45] 🌎 Matrix-3D: Omnidirectional Explorable 3D World Generation（Matrix-3D：全向可探索三维世界生成） [01:17] 🚀 Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL（超越十回合：通过大规模异步强化学习解锁长周期智能体搜索） [01:43] 🕺 CharacterShot: Controllable and Consistent 4D Character Animation（CharacterShot：可控且一致的4D角色动画） [02:05] ⏳ Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models（时间即特征：利用扩散语言模型中的时序动态） [02:29] 🔍 HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches（HierSearch：一个整合本地和网络搜索的分层企业深度搜索框架） [02:55] 🧊 VertexRegen: Mesh Generation with Continuous Level of Detail（VertexRegen：连续细节层次的网格生成） [03:16] 🎯 Test-Time Reinforcement Learning for GUI Grounding via Region Consistency（基于区域一致性的GUI定位测试时强化学习） [03:43] ⏱ Train Long, Think Short: Curriculum Learning for Efficient Reasoning（长程训练，短程思考：高效推理的课程学习） [04:05] 🎓 Aryabhata: An exam-focused language model for JEE Math（Aryabhata：一个专注于JEE数学考试的语言模型） [04:30] 🖼 UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation（UNCAGE：文本到图像生成中掩码生成式Transformer的对比注意力引导） [04:52] 🧠 Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy（民主化外交：一个评估任意大型语言模型在《外交》游戏中表现的工具） [05:20] 👋 Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors（迈向融合类人先验的可供性感知机器人灵巧抓取） [05:45] 📈 Adversarial Video Promotion Against Text-to-Video Retrieval（针对文本到视频检索的对抗性视频推广） [06:10] 🎬 Cut2Next: Generating Next Shot via In-Context Tuning（Cut2Next：通过上下文调优生成下一镜头）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

99+

4周前