节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

【周末特辑】12月第5周最火AI论文 | 提升医学推理能力，自动化GUI轨迹构建。

本期的 5 篇论文如下： [00:35] TOP1(🔥83) | 🧠 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs（华佗GPT-o1：迈向医学复杂推理的大语言模型） [02:49] TOP2(🔥65) | 🤖 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis（OS-Genesis：通过逆向任务合成自动化GUI代理轨迹构建） [04:50] TOP3(🔥63) | 🎨 1.58-bit FLUX（1.58位FLUX：首个成功量化最先进文本生成图像模型的方法） [07:00] TOP4(🔥60) | 🔍 Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization（解释性指令：迈向统一视觉任务理解与零样本泛化） [09:02] TOP5(🔥53) | 📚 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining（2.5年课堂：用于视觉-语言预训练的多模态教科书）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.01.03 每日AI论文 | 多模态教科书提升视觉语言模型性能，VideoAnydoor实现高保真视频对象插入

本期的 17 篇论文如下： [00:24] 📚 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining（2.5年课堂：用于视觉-语言预训练的多模态教科书） [01:02] 🎥 VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control（VideoAnydoor：高保真视频对象插入与精确运动控制） [01:39] 🎥 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM（VideoRefer套件：通过视频大语言模型推进时空对象理解） [02:13] 🏆 CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings（CodeElo：基于人类可比Elo评分的大语言模型竞赛级代码生成基准测试） [02:52] 🎨 Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models（重建与生成：潜在扩散模型中的优化困境驯服） [03:29] 🤖 ProgCo: Program Helps Self-Correction of Large Language Models（ProgCo：程序助力大语言模型自我修正） [04:03] 🗺 MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models（MapEval：基于地图的基础模型地理空间推理能力评估） [04:41] 🤖 A3: Android Agent Arena for Mobile GUI Agents（A3：移动GUI代理的安卓代理竞技场） [05:21] 🧪 Dynamic Scaling of Unit Tests for Code Reward Modeling（代码奖励建模中单元测试的动态扩展） [05:57] 🛡 MLLM-as-a-Judge for Image Safety without Human Labeling（无需人工标注的图像安全MLLM-as-a-Judge方法） [06:40] 🎥 LTX-Video: Realtime Video Latent Diffusion（LTX-视频：实时视频潜在扩散模型） [07:15] 🗺 MapQaTor: A System for Efficient Annotation of Map Query Datasets（MapQaTor：高效地图查询数据集标注系统） [07:51] 🔍 Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing（通过近期性和过度平滑的视角理解并缓解状态空间模型的瓶颈） [08:29] 🎥 SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration（SeedVR：在扩散Transformer中播种无限，实现通用视频修复） [09:13] 🤖 SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization（SeFAR：基于时间扰动和学习稳定的半监督细粒度动作识别） [09:50] 🧠 Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding（重新思考语言模型中的寻址机制：基于上下文等变位置编码） [10:27] 📊 Population Aware Diffusion for Time Series Generation（面向时间序列生成的群体感知扩散模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2025.01.02 每日AI论文 | 自动化GUI代理轨迹构建，优化推理任务语言模型。

本期的 2 篇论文如下： [00:26] 🤖 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis（OS-Genesis：通过逆向任务合成自动化GUI代理轨迹构建） [01:10] 🧠 Xmodel-2 Technical Report（Xmodel-2技术报告）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2分钟

2024.12.31 每日AI论文 | 解释性指令提升视觉任务泛化，多模态模型优化医学影像泛化。

本期的 10 篇论文如下： [00:25] 🔍 Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization（解释性指令：迈向统一视觉任务理解与零样本泛化） [01:13] 🧠 On the Compositional Generalization of Multimodal LLMs for Medical Imaging（多模态大语言模型在医学影像中的组合泛化研究） [02:02] ⚙ Efficiently Serving LLM Reasoning Programs with Certaindex（高效服务LLM推理程序的Certaindex系统） [02:44] 🎨 Edicho: Consistent Image Editing in the Wild（Edicho：在野外图像中的一致性编辑） [03:22] 🎵 TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization（TangoFlux：基于流匹配和CLAP排序偏好优化的超快速且忠实文本到音频生成） [04:04] 🎥 Bringing Objects to Life: 4D generation from 3D objects（赋予物体生命：从3D物体生成4D内容） [04:47] 🧠 Facilitating large language model Russian adaptation with Learned Embedding Propagation（通过学习嵌入传播促进大语言模型的俄语适应） [05:25] 🤖 HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation（HumanEval Pro与MBPP Pro：评估大语言模型在自调用代码生成上的表现） [06:12] 🤖 Training Software Engineering Agents and Verifiers with SWE-Gym（使用SWE-Gym训练软件工程代理与验证器） [06:52] 🧠 OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System（OneKE：基于Docker化模式引导的LLM代理知识提取系统）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

2024.12.30 每日AI论文 | 华佗GPT-o1提升医学推理，Orient Anything精准估计物体方向。

本期的 8 篇论文如下： [00:30] 🧠 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs（华佗GPT-o1：迈向医学复杂推理的大语言模型） [01:16] 🧭 Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models（定向万物：从渲染3D模型中学习鲁棒的物体方向估计） [02:03] 🔍 Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment（任务偏好优化：通过视觉任务对齐提升多模态大语言模型） [02:50] 🧬 The Superposition of Diffusion Models Using the Itô Density Estimator（使用Itô密度估计器进行扩散模型的叠加） [03:33] 🎨 From Elements to Design: A Layered Approach for Automatic Graphic Design Composition（从元素到设计：一种分层的自动图形设计构图方法） [04:16] 🛡 Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging（通过预调优和后调优模型合并保护微调的大型语言模型） [04:56] 📊 SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images（SBS图表：从分阶段合成图像预训练图表问答） [05:47] 🎥 VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models（VideoMaker：利用视频扩散模型的内在力量实现零样本定制视频生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

【周末特辑】12月第4周最火AI论文 | 鲁棒微调提升大模型抗噪能力，并行生成加速视觉模型效率。

本期的 5 篇论文如下： [00:37] TOP1(🔥78) | 🛡 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response（RobustFT：在噪声响应下的大语言模型的鲁棒监督微调） [02:57] TOP2(🔥47) | ⚡ Parallelized Autoregressive Visual Generation（并行自回归视觉生成） [05:16] TOP3(🔥38) | 🔄 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners（B-STaR：监控和平衡自学习推理器中的探索与利用） [07:23] TOP4(🔥37) | 🧠 Diving into Self-Evolving Training for Multimodal Reasoning（深入自进化训练的多模态推理） [09:53] TOP5(🔥33) | 🧠 Offline Reinforcement Learning for LLM Multi-Step Reasoning（基于离线强化学习的大语言模型多步推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2024.12.27 每日AI论文 | YuLan-Mini提升数据效率，Gist Token优化上下文压缩。

本期的 4 篇论文如下： [00:26] 🧠 YuLan-Mini: An Open Data-efficient Language Model（YuLan-Mini：一个开放的数据高效语言模型） [01:05] 🔍 A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression（银弹还是全注意力妥协？基于Gist Token的上下文压缩全面研究） [01:49] 🤖 Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation（Molar：基于协同过滤对齐的多模态大语言模型增强序列推荐） [02:36] 🔍 MMFactory: A Universal Solution Search Engine for Vision-Language Tasks（MMFactory：面向视觉语言任务的通用解决方案搜索引擎）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

82

2024.12.26 每日AI论文 | Token预算优化推理，Video-Panda提升视频处理效率。

本期的 4 篇论文如下： [00:27] 💡 Token-Budget-Aware LLM Reasoning（基于Token预算的大语言模型推理） [01:07] 🎥 Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models（Video-Panda：无编码器视频语言模型的高效参数对齐方法） [01:49] 🧠 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search（Mulberry：通过集体蒙特卡洛树搜索赋予MLLM类似o1的推理与反思能力） [02:44] 🧬 PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion（PepTune：基于多目标引导的离散扩散生成治疗性肽）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

77

2024.12.25 每日AI论文 | 提升三维场景理解，填补深度信息缺失。

本期的 9 篇论文如下： [00:26] 🧠 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding（3DGraphLLM：结合语义图与大型语言模型进行三维场景理解） [01:11] 🖼 DepthLab: From Partial to Complete（DepthLab：从部分到完整） [01:54] 📊 Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization（傅里叶位置嵌入：增强注意力机制的周期性扩展以实现长度泛化） [02:35] 🎥 DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation（DiTCtrl：探索多模态扩散变压器中的注意力控制以实现无需调优的多提示长视频生成） [03:26] 🤔 In Case You Missed It: ARC 'Challenge' Is Not That Challenging（你可能错过了：ARC '挑战' 并不那么具有挑战性） [04:02] 🧠 ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing（ReMoE：使用ReLU路由的全可微分专家混合模型） [04:41] 🧩 PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models（PartGen：基于多视角扩散模型的部分级三维生成与重建） [05:20] 🧠 SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval（SKETCH：结构化知识增强的文本理解与整体检索） [06:02] 🧠 Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning（通过过程奖励引导的树搜索集成大语言模型以提升复杂推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

2024.12.24 每日AI论文 | 探索与利用平衡，噪声数据处理提升。

本期的 16 篇论文如下： [00:24] 🔄 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners（B-STaR：监控和平衡自学习推理器中的探索与利用） [01:04] 🛡 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response（RobustFT：在噪声响应下的大语言模型的鲁棒监督微调） [01:43] 🧠 Diving into Self-Evolving Training for Multimodal Reasoning（深入自进化训练的多模态推理） [02:29] ⚡ Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching（蒸馏解码1：使用流匹配对图像自回归模型进行一步采样） [03:12] 🎥 Large Motion Video Autoencoding with Cross-modal Video VAE（基于跨模态视频VAE的大运动视频自动编码） [03:56] 🧠 Deliberation in Latent Space via Differentiable Cache Augmentation（潜在空间中的推理增强通过可微缓存扩展） [04:41] 📚 Revisiting In-Context Learning with Long Context Language Models（重新审视长上下文语言模型中的上下文学习） [05:25] 🧠 Outcome-Refining Process Supervision for Code Generation（代码生成中的结果优化过程监督） [06:11] 🧠 DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought（DRT-o1：通过长链思维优化深度推理翻译） [06:48] 📚 LearnLM: Improving Gemini for Learning（学习语言模型：提升Gemini的学习能力） [07:33] ⚠ Agent-SafetyBench: Evaluating the Safety of LLM Agents（Agent-SafetyBench：评估LLM代理的安全性） [08:15] 🧠 OpenAI o1 System Card（OpenAI o1 系统卡片） [09:03] 🧠 NILE: Internal Consistency Alignment in Large Language Models（NILE：大型语言模型中的内部一致性对齐） [09:45] 🤖 OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning（OpenRFT：通过强化微调适应领域特定任务的推理基础模型） [10:26] 🗣 Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding（Friends-MMC：多模态多方对话理解数据集） [10:59] 🌙 PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World（PC代理：当你睡觉时，AI在工作——进入数字世界的认知之旅）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

95

2024.12.23 每日AI论文 | 加速视觉生成，优化多步推理

本期的 10 篇论文如下： [00:22] ⚡ Parallelized Autoregressive Visual Generation（并行自回归视觉生成） [01:05] 🧠 Offline Reinforcement Learning for LLM Multi-Step Reasoning（基于离线强化学习的大语言模型多步推理） [01:43] 🔑 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation（SCOPE：优化长上下文生成中的键值缓存压缩） [02:30] 🚀 CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up（CLEAR：卷积类线性化提升预训练扩散变换器性能） [03:14] 🎥 Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis（驯服多模态联合训练以实现高质量视频到音频合成） [04:01] 🧠 MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design（MixLLM：基于全局混合精度的LLM量化与高效系统设计） [04:37] 🌍 LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps（大型语言模型在翻译中的迷失：M-ALERT揭示跨语言安全差距） [05:23] 🎥 Sequence Matters: Harnessing Video Models in 3D Super-Resolution（序列至关重要：利用视频模型进行3D超分辨率重建） [06:21] 🇳 Fietje: An open, efficient LLM for Dutch（Fietje：一个针对荷兰语的开源高效大型语言模型） [07:14] 👤 IDOL: Instant Photorealistic 3D Human Creation from a Single Image（IDOL：从单张图像即时生成逼真的3D人体模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟