节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.10.24 | AdaSPEC挑40% token提速两成；AutoPage 10美分生成交互网页

本期的 15 篇论文如下： [00:23] 🎯 AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders（AdaSPEC：面向高效推测解码的选择性知识蒸馏） [00:57] 🤖 Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1（低成本人机协作论文一键成页：低于0.1美元） [01:35] 🔍 Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence（Open-o3视频：显式时空证据支撑的开放域视频推理） [02:06] 🎬 HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives（HoloCine：端到端生成多镜头长时电影级叙事视频） [02:52] 🌀 Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall（绕过离散扩散采样墙的确定性捷径） [03:33] 💎 Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values（每个问题都有它的价值：显式人类价值驱动的强化学习） [04:06] ⚖ The Massive Legal Embedding Benchmark (MLEB)（大规模法律嵌入评测基准（MLEB）） [04:48] 🔍 DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion（DyPE：面向超高分辨率扩散模型的动态位置外推方法） [05:33] 🕵 Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence（柯南：像侦探一样在多尺度视觉证据上渐进式推理） [06:12] 🤖 Search Self-play: Pushing the Frontier of Agent Capability without Supervision（搜索自博弈：无需监督即可拓展智能体能力边界） [06:56] 🎭 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations（探究大音频语言模型在说话人情绪变化下的安全漏洞） [07:42] 🖼 LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas（LayerComposer：基于空间感知分层画布的交互式个性化文生图） [08:10] 🎧 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models（SAKE：面向大型音频-语言模型听觉属性知识编辑的探索） [08:51] 🖼 ARGenSeg: Image Segmentation with Autoregressive Image Generation Model（ARGenSeg：基于自回归图像生成的图像分割） [09:39] 🧩 Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets（Seed3D 1.0：从单张图像生成高保真、可仿真的3D资产）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

55

1周前

2025.10.23 | 线性注意力显存降十倍；动态裁剪PPO稳提分

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:19] 🧠 Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning（每一种注意力都重要：面向长上下文推理的高效混合架构） [00:59] ⚖ BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping（BAPO：通过自适应裁剪的平衡策略优化稳定LLM离策略强化学习） [01:40] 🧠 LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts（LoongRL：面向长文本高级推理的强化学习方法） [02:18] 🌍 GigaBrain-0: A World Model-Powered Vision-Language-Action Model（GigaBrain-0：基于世界模型的通才视觉-语言-动作大模型） [02:49] 🔄 Language Models are Injective and Hence Invertible（语言模型是单射的，因此可逆） [03:25] 📹 VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos（VideoAgentTrek：利用无标注视频预训练计算机操作智能体） [04:01] 📲 DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents（DaMo：面向手机智能体的多模态大模型微调数据配比优化器） [04:55] 🚀 Unified Reinforcement and Imitation Learning for Vision-Language Models（统一强化与模仿学习的视觉-语言模型） [05:28] 🖼 Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing（Pico-Banana-400K：面向文本引导图像编辑的大规模高质量数据集） [06:17] 📊 FinSight: Towards Real-World Financial Deep Research（FinSight：迈向真实场景的金融深度研究） [07:06] 🧠 Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues（他们是恋人还是朋友？评估大语言模型在英韩对话中的社会推理能力） [07:43] 🌍 OmniNWM: Omniscient Driving Navigation World Models（OmniNWM：全景驾驶导航全知世界模型） [08:28] 🕳 Attention Sinks in Diffusion Language Models（扩散语言模型中的注意力沉陷现象） [09:04] 📄 olmOCR 2: Unit Test Rewards for Document OCR（olmOCR 2：基于单元测试奖励的文档OCR系统） [09:42] 🧠 KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints（KORE：通过知识导向增强与约束为大模型持续注入知识）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

69

1周前

2025.10.22 | LightMem压缩记忆千倍提速12倍；闭环世界模型微调8万数据反超巨兽

HuggingFace 每日AI论文速递

本期的 14 篇论文如下： [00:19] 🧠 LightMem: Lightweight and Efficient Memory-Augmented Generation（LightMem：轻量高效的记忆增强生成框架） [00:55] 🌀 World-in-World: World Models in a Closed-Loop World（世界中的世界：闭环环境下的世界模型） [01:44] 🖼 UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation（UniGenBench++：面向文本到图像生成的统一语义评测基准） [02:29] 🧪 Chem-R: Learning to Reason as a Chemist（Chem-R：像化学家一样学习推理） [03:10] 🎬 MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation（MoGA：面向端到端长视频生成的分组混合注意力机制） [03:52] 🔍 Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs（任意区域皆可掌握：面向多模态大模型的精准上下文像素级理解） [04:49] 🎬 IF-VidCap: Can Video Caption Models Follow Instructions?（IF-VidCap：视频字幕模型能听懂指令吗？） [05:35] 🚀 Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model（万亿参数思维模型的强化学习扩展之路） [06:21] 🎬 MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues（MT-Video-Bench：面向多轮对话评估多模态大模型视频理解能力的综合基准） [07:12] 🧠 ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning（ssToken：面向大模型微调的自调制语义感知Token筛选方法） [07:43] 🎬 MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models（MUG-V 10B：面向大视频生成模型的高效训练流水线） [08:18] 🎯 ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder（ProCLIP：基于大语言模型嵌入器的渐进式视觉-语言对齐方法） [09:29] 🎬 UltraGen: High-Resolution Video Generation with Hierarchical Attention（UltraGen：基于分层注意力的原生高分辨率视频生成） [10:15] 🔄 DSI-Bench: A Benchmark for Dynamic Spatial Intelligence（DSI-Bench：动态空间智能评测基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

2周前

2025.10.21 | 模型不懂光影折射；小模型也能写报告

HuggingFace 每日AI论文速递

本期的 13 篇论文如下： [00:21] 🪞 PICABench: How Far Are We from Physically Realistic Image Editing?（PICABench：我们离物理真实的图像编辑还有多远？） [01:04] 🤖 DeepAnalyze: Agentic Large Language Models for Autonomous Data Science（DeepAnalyze：面向自主数据科学的智能体大模型） [01:50] 🗜 Glyph: Scaling Context Windows via Visual-Text Compression（Glyph：通过视觉-文本压缩扩展上下文窗口长度） [02:23] 🔍 Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation（面向通用检索增强生成的混合模态检索研究） [03:10] 🔗 When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling（何时集成：定位Token级位置实现稳定高效的大模型集成） [04:09] 🎯 Annotation-Efficient Universal Honesty Alignment（注释高效型通用诚实对齐） [04:49] 🖌 Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback（Uniworld-V2：借助扩散负感知微调与MLLM隐式反馈强化图像编辑） [05:46] 👁 RL makes MLLMs see better than SFT（强化学习让多模态大模型看得比监督微调更清楚） [06:33] 🚀 Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling（视觉自回归模型在推理时扩展上击败扩散模型） [07:09] 🎨 ConsistEdit: Highly Consistent and Precise Training-free Visual Editing（ConsistEdit：面向MM-DiT的高一致免训练视觉编辑） [07:56] 🔄 Deep Self-Evolving Reasoning（深度自演化推理） [08:22] 🧠 Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI（超越流水线：模型原生智能体AI范式转移综述） [09:07] 🔮 Chronos-2: From Univariate to Universal Forecasting（Chronos-2：从单变量到通用预测）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

2周前

2025.10.20 | RPC剪枝提速保准；OmniVinci小数据跨模态称王

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:20] 🧠 A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning（大模型推理中内部概率与自洽性桥接的理论研究） [01:04] 🌐 OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM（OmniVinci：面向全模态理解大模型的架构与数据增强） [01:44] 🎬 Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset（用百万级合成数据集放大指令式视频编辑） [02:28] ✂ NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks（NANO3D：无需训练与掩码的高效3D编辑新方法） [03:05] 🛰 Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery（Skyfall-GS：仅凭卫星影像合成沉浸式3D城市场景） [03:41] ⚠ Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs（情境学习中的突发错位：狭窄示例可让大模型广泛失准） [04:18] 🧬 Latent Diffusion Model without Variational Autoencoder（无需变分自编码器的潜在扩散模型） [04:52] 📸 LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal（LightsOut：基于扩散的延展补全提升镜头眩光去除） [05:30] 🧠 MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning（MorphoBench：随模型推理能力自适应难度的评测基准） [06:14] 🧠 A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning（A²FM：面向工具感知混合推理的自适应智能体基础模型） [06:56] 🗣 Language Models Model Language（语言模型即语言本身） [07:36] 🖼 BLIP3o-NEXT: Next Frontier of Native Image Generation（BLIP3o-NEXT：原生图像生成的下一个前沿） [08:30] 🌐 Paper2Web: Let's Make Your Paper Alive!（Paper2Web：让你的论文“活”起来！） [09:12] 🔬 Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition（面向科学发现的基础模型：从范式增强到范式跃迁） [09:55] 🔍 Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents（探索以进化：通过主动在线探索扩展深度研究智能体的聚合逻辑）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

94

2周前

【周末特辑】10月第3周最火AI论文 | 量化噪声变探索，单卡跑RL；冻结编码器放语义，DiT生成新纪录

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:40] TOP1(🔥154) | 🚀 QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs（QeRL：超越效率——面向大语言模型的量化增强强化学习） [02:19] TOP2(🔥138) | 🧠 Diffusion Transformers with Representation Autoencoders（基于表示自编码器的扩散Transformer） [04:54] TOP3(🔥134) | 🎯 Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model（空间强迫：面向视觉-语言-动作模型的隐式空间表征对齐） [07:55] TOP4(🔥125) | 🖥 D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI（D2E：利用桌面数据规模化视觉-动作预训练以迁移至具身智能） [10:30] TOP5(🔥110) | 📷 Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation（基于相机的统一多模态理解与生成模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

99+

2周前

2025.10.17 | AI眼镜预判式服务；视频生成补想象力

HuggingFace 每日AI论文速递

本期的 11 篇论文如下： [00:25] 👓 AI for Service: Proactive Assistance with AI Glasses（AI服务：AI眼镜的主动式协助） [01:06] 🎬 ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints（ImagerySearch：面向超越语义依赖约束的自适应测试时搜索视频生成） [01:43] 🎯 LaSeR: Reinforcement Learning with Last-Token Self-Rewarding（LaSeR：基于末词元自奖励的强化学习） [02:33] 🧩 TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar（TokDrift：当大模型用子词而代码用语法时） [03:35] 🧠 Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents（基于信息增益的策略优化：一种简单有效的多轮LLM智能体训练方法） [04:04] ⚡ Attention Is All You Need for KV Cache in Diffusion LLMs（扩散式大语言模型只需注意力即可搞定KV缓存） [04:45] 🤥 When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA（当模型撒谎时我们反而学到东西：用PsiloQA实现跨语言细粒度幻觉检测） [05:33] 📄 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model（PaddleOCR-VL：以9亿参数超轻量多模态模型刷新多语言文档解析性能） [06:13] 🧠 VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning（VR-Thinker：通过“边看边想”推理提升视频奖励模型） [06:52] 📐 MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning（MathCanvas：面向多模态数学推理的内生视觉思维链） [07:39] 🧠 COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes（COIG-Writer：高质量中文创意写作数据集，附带思维过程）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99+

2周前

2025.10.16 | UniMoE一统语音音乐；注意力图点亮大模型推理

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:21] 🎧 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE（UniMoE-Audio：基于动态容量MoE的统一语音与音乐生成模型） [00:57] 🔍 Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization（注意力照亮大模型推理：预规划-锚定节奏实现细粒度策略优化） [01:38] ⚡ FlashWorld: High-quality 3D Scene Generation within Seconds（FlashWorld：秒级高质量3D场景生成） [02:06] 🐝 Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs（Bee：高质量语料与全栈套件解锁完全开源多模态大模型） [02:37] 🗣 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue（InteractiveOmni：面向音视频多轮对话的统一全模态模型） [03:24] 🌍 PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning（PhysMaster：通过强化学习掌握视频生成的物理表征） [04:00] 🧪 LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models（LIBERO-Plus：视觉-语言-动作模型鲁棒性深度剖析） [04:43] 🚗 CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving（CVD-STORM：面向自动驾驶的跨视角视频扩散时空重建模型） [05:21] 🔍 Generative Universal Verifier as Multimodal Meta-Reasoner（生成式通用验证器：多模态元推理的反思引擎） [06:07] ⚖ ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs（ParallelBench：探明扩散式大模型并行解码的取舍） [06:43] 🎞 Trace Anything: Representing Any Video in 4D via Trajectory Fields（任意视频4D轨迹场表示：一次前馈即可还原每像素连续时空路径） [07:27] 🌍 Reasoning in Space via Grounding in the World（基于世界锚定的空间推理） [07:54] 🧠 The Role of Computing Resources in Publishing Foundation Model Research（计算资源在基础模型研究发表中的角色） [08:28] ⚖ UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning（UniME-V2：用多模态大模型当裁判，打造通用多模态表征） [09:05] 🤖 InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy（InternVLA-M1：面向通用机器人策略的空间引导视觉-语言-动作框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

2周前

2025.10.15 | 像素级自监督ViT刷新生成基准；多智能体评测网文翻译新标尺

HuggingFace 每日AI论文速递

本期的 14 篇论文如下： [00:20] 🖼 Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training（通过自监督预训练推进端到端像素空间生成建模） [00:53] 📚 DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation（DITING：面向网络小说翻译评测的多智能体基准框架） [01:41] 🌐 Scaling Language-Centric Omnimodal Representation Learning（以语言为中心的跨模态表征扩展学习） [02:29] 🎯 Detect Anything via Next Point Prediction（通过下一点预测检测万物） [03:02] ⚡ FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution（FlashVSR：迈向实时扩散式流媒体视频超分辨率） [03:40] 🎯 Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models（时间对齐引导：扩散模型中的流形采样） [04:16] 🧠 Dr.LLM: Dynamic Layer Routing in LLMs（Dr.LLM：大模型中的动态层级路由） [05:03] 🎯 Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model（空间强迫：面向视觉-语言-动作模型的隐式空间表征对齐） [05:50] 🤖 ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning（ERA：借助具身先验学习与在线强化学习将视觉-语言模型转化为具身智能体） [06:35] 🤖 Robot Learning: A Tutorial（机器人学习教程：从强化学习到多任务通用模型） [07:27] 🔄 SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models（SRUM：面向统一多模态模型的细粒度自奖励机制） [08:01] 🧠 Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models（面向扩散大语言模型的边界引导策略优化：内存高效的强化学习） [09:06] 🖼 UniFusion: Vision-Language Model as Unified Encoder in Image Generation（UniFusion：将视觉-语言模型统一作为图像生成的编码器） [09:43] 🧠 Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks（记忆即行动：面向长程智能体任务的自主上下文策展）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

95

3周前

2025.10.14 | 量化误差变奖励，单卡训32B；面向多模态大模型的音视频评测基准

HuggingFace 每日AI论文速递

本期的 15 篇论文如下： [00:23] 🚀 QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs（QeRL：超越效率——面向大语言模型的量化增强强化学习） [01:22] 🧠 Diffusion Transformers with Representation Autoencoders（基于表示自编码器的扩散Transformer） [02:12] 🎬 OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs（OmniVideoBench：面向全向多模态大模型的音视频协同理解评测基准） [02:41] 🔄 Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States（潜变量精化解码：通过精化信念状态增强基于扩散的语言模型） [03:18] 🌊 RLFR: Extending Reinforcement Learning for LLMs with Flow Environment（RLFR：基于潜流环境扩展大模型强化学习） [04:11] 🔍 Spotlight on Token Perception for Multimodal Reinforcement Learning（多模态强化学习中token感知的光束聚焦） [04:50] 🎬 AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration（AVoCaDO：面向时序编排的音视频联合字幕生成器） [05:25] 🌐 DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training（DiT360：混合训练视角与全景数据的高保真全景图像生成） [05:56] 🧠 Demystifying Reinforcement Learning in Agentic Reasoning（揭开强化学习在智能体推理中的神秘面纱） [06:51] 🧮 Making Mathematical Reasoning Adaptive（让数学推理具备自适应性） [07:26] 🛡 Building a Foundational Guardrail for General Agentic Systems via Synthetic Data（面向通用智能体的基础护栏：基于合成数据的预执行安全框架） [08:05] 🧠 ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems（ACADREASON：用学术研究问题探索推理模型的极限） [08:43] 🎨 InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models（InternSVG：用多模态大模型统一搞定SVG理解、编辑与生成） [09:23] 🧾 FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs（FinAuditing：面向LLM评估的财务分类多文档基准） [10:09] 🧠 GIR-Bench: Versatile Benchmark for Generating Images with Reasoning（GIR-Bench：面向推理图像生成的多功能基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

3周前

2025.10.13 | 桌面交互预训练解锁机器人潜能；统一模型赋予相机空间想象力

HuggingFace 每日AI论文速递

本期的 14 篇论文如下： [00:20] 🖥 D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI（D2E：利用桌面数据规模化视觉-动作预训练以迁移至具身智能） [01:13] 📷 Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation（基于相机的统一多模态理解与生成模型） [01:56] 🎨 TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling（TAG：抑制幻觉的扩散采样切向放大引导） [02:31] 🧠 Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs（多模态提示优化：为何不为多模态大模型释放全模态潜能） [03:05] 🚀 AutoPR: Let's Automate Your Academic Promotion!（AutoPR：让学术晋升一键自动化！） [03:39] 🧭 R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?（R-HORIZON：你的大推理模型在广度与深度上究竟能走多远？） [04:14] 🚀 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels（Webscale-RL：把强化学习数据扩展到预训练体量的自动化流水线） [04:56] 🛰 SpaceVista: All-Scale Visual Spatial Reasoning from mm to km（SpaceVista：毫米到千米全尺度视觉空间推理） [05:37] 🎥 StreamingVLM: Real-Time Understanding for Infinite Video Streams（StreamingVLM：面向无限视频流的实时理解框架） [06:19] 🌐 KORMo: Korean Open Reasoning Model for Everyone（KORMo：人人可用的韩语开放推理模型） [06:42] ♻ Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting（别浪费错误：通过置信度加权利用负RL组） [07:25] 🧠 Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization（从推理到学习的桥梁：以复杂度分布外泛化揭穿幻觉） [08:16] ⚡ DISCO: Diversifying Sample Condensation for Efficient Model Evaluation（DISCO：以模型分歧为导向的样本浓缩加速评测） [08:56] 🚗 Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction（面向开放词汇占用预测的各向异性采样渐进高斯Transformer）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

3周前

【周末特辑】10月第2周最火AI论文 | 递归小模型刷爆推理榜；未来经验点亮零奖励学习

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:33] TOP1(🔥300) | 🧠 Less is More: Recursive Reasoning with Tiny Networks（小而精：用微型网络递归推理） [02:16] TOP2(🔥164) | 🌱 Agent Learning via Early Experience（基于早期经验的主体学习） [04:15] TOP3(🔥105) | 🧠 Apriel-1.5-15b-Thinker（Apriel-1.5-15B-Thinker：以小博大实现前沿多模态推理的15B开源模型） [06:17] TOP4(🔥97) | 🧠 MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization（MM-HELIX：以整体平台与自适应混合策略优化激发多模态长链反思推理） [08:45] TOP5(🔥88) | 🎬 Paper2Video: Automatic Video Generation from Scientific Papers（论文自动生成学术演讲视频）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

3周前