节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

【周末特辑】12月第4周最火AI论文 | 鲁棒微调提升大模型抗噪能力，并行生成加速视觉模型效率。

本期的 5 篇论文如下： [00:37] TOP1(🔥78) | 🛡 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response（RobustFT：在噪声响应下的大语言模型的鲁棒监督微调） [02:57] TOP2(🔥47) | ⚡ Parallelized Autoregressive Visual Generation（并行自回归视觉生成） [05:16] TOP3(🔥38) | 🔄 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners（B-STaR：监控和平衡自学习推理器中的探索与利用） [07:23] TOP4(🔥37) | 🧠 Diving into Self-Evolving Training for Multimodal Reasoning（深入自进化训练的多模态推理） [09:53] TOP5(🔥33) | 🧠 Offline Reinforcement Learning for LLM Multi-Step Reasoning（基于离线强化学习的大语言模型多步推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2024.12.27 每日AI论文 | YuLan-Mini提升数据效率，Gist Token优化上下文压缩。

本期的 4 篇论文如下： [00:26] 🧠 YuLan-Mini: An Open Data-efficient Language Model（YuLan-Mini：一个开放的数据高效语言模型） [01:05] 🔍 A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression（银弹还是全注意力妥协？基于Gist Token的上下文压缩全面研究） [01:49] 🤖 Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation（Molar：基于协同过滤对齐的多模态大语言模型增强序列推荐） [02:36] 🔍 MMFactory: A Universal Solution Search Engine for Vision-Language Tasks（MMFactory：面向视觉语言任务的通用解决方案搜索引擎）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

82

2024.12.26 每日AI论文 | Token预算优化推理，Video-Panda提升视频处理效率。

本期的 4 篇论文如下： [00:27] 💡 Token-Budget-Aware LLM Reasoning（基于Token预算的大语言模型推理） [01:07] 🎥 Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models（Video-Panda：无编码器视频语言模型的高效参数对齐方法） [01:49] 🧠 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search（Mulberry：通过集体蒙特卡洛树搜索赋予MLLM类似o1的推理与反思能力） [02:44] 🧬 PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion（PepTune：基于多目标引导的离散扩散生成治疗性肽）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

77

2024.12.25 每日AI论文 | 提升三维场景理解，填补深度信息缺失。

本期的 9 篇论文如下： [00:26] 🧠 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding（3DGraphLLM：结合语义图与大型语言模型进行三维场景理解） [01:11] 🖼 DepthLab: From Partial to Complete（DepthLab：从部分到完整） [01:54] 📊 Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization（傅里叶位置嵌入：增强注意力机制的周期性扩展以实现长度泛化） [02:35] 🎥 DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation（DiTCtrl：探索多模态扩散变压器中的注意力控制以实现无需调优的多提示长视频生成） [03:26] 🤔 In Case You Missed It: ARC 'Challenge' Is Not That Challenging（你可能错过了：ARC '挑战' 并不那么具有挑战性） [04:02] 🧠 ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing（ReMoE：使用ReLU路由的全可微分专家混合模型） [04:41] 🧩 PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models（PartGen：基于多视角扩散模型的部分级三维生成与重建） [05:20] 🧠 SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval（SKETCH：结构化知识增强的文本理解与整体检索） [06:02] 🧠 Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning（通过过程奖励引导的树搜索集成大语言模型以提升复杂推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

2024.12.24 每日AI论文 | 探索与利用平衡，噪声数据处理提升。

本期的 16 篇论文如下： [00:24] 🔄 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners（B-STaR：监控和平衡自学习推理器中的探索与利用） [01:04] 🛡 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response（RobustFT：在噪声响应下的大语言模型的鲁棒监督微调） [01:43] 🧠 Diving into Self-Evolving Training for Multimodal Reasoning（深入自进化训练的多模态推理） [02:29] ⚡ Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching（蒸馏解码1：使用流匹配对图像自回归模型进行一步采样） [03:12] 🎥 Large Motion Video Autoencoding with Cross-modal Video VAE（基于跨模态视频VAE的大运动视频自动编码） [03:56] 🧠 Deliberation in Latent Space via Differentiable Cache Augmentation（潜在空间中的推理增强通过可微缓存扩展） [04:41] 📚 Revisiting In-Context Learning with Long Context Language Models（重新审视长上下文语言模型中的上下文学习） [05:25] 🧠 Outcome-Refining Process Supervision for Code Generation（代码生成中的结果优化过程监督） [06:11] 🧠 DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought（DRT-o1：通过长链思维优化深度推理翻译） [06:48] 📚 LearnLM: Improving Gemini for Learning（学习语言模型：提升Gemini的学习能力） [07:33] ⚠ Agent-SafetyBench: Evaluating the Safety of LLM Agents（Agent-SafetyBench：评估LLM代理的安全性） [08:15] 🧠 OpenAI o1 System Card（OpenAI o1 系统卡片） [09:03] 🧠 NILE: Internal Consistency Alignment in Large Language Models（NILE：大型语言模型中的内部一致性对齐） [09:45] 🤖 OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning（OpenRFT：通过强化微调适应领域特定任务的推理基础模型） [10:26] 🗣 Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding（Friends-MMC：多模态多方对话理解数据集） [10:59] 🌙 PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World（PC代理：当你睡觉时，AI在工作——进入数字世界的认知之旅）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

95

2024.12.23 每日AI论文 | 加速视觉生成，优化多步推理

本期的 10 篇论文如下： [00:22] ⚡ Parallelized Autoregressive Visual Generation（并行自回归视觉生成） [01:05] 🧠 Offline Reinforcement Learning for LLM Multi-Step Reasoning（基于离线强化学习的大语言模型多步推理） [01:43] 🔑 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation（SCOPE：优化长上下文生成中的键值缓存压缩） [02:30] 🚀 CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up（CLEAR：卷积类线性化提升预训练扩散变换器性能） [03:14] 🎥 Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis（驯服多模态联合训练以实现高质量视频到音频合成） [04:01] 🧠 MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design（MixLLM：基于全局混合精度的LLM量化与高效系统设计） [04:37] 🌍 LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps（大型语言模型在翻译中的迷失：M-ALERT揭示跨语言安全差距） [05:23] 🎥 Sequence Matters: Harnessing Video Models in 3D Super-Resolution（序列至关重要：利用视频模型进行3D超分辨率重建） [06:21] 🇳 Fietje: An open, efficient LLM for Dutch（Fietje：一个针对荷兰语的开源高效大型语言模型） [07:14] 👤 IDOL: Instant Photorealistic 3D Human Creation from a Single Image（IDOL：从单张图像即时生成逼真的3D人体模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

【周末特辑】12月第3周最火AI论文 | Qwen2.5提升LLMs性能，阿波罗优化视频理解。

本期的 5 篇论文如下： [00:40] TOP1(🔥252) | 🤖 Qwen2.5 Technical Report（Qwen2.5技术报告） [02:31] TOP2(🔥127) | 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models（阿波罗：大型多模态模型中的视频理解探索） [04:30] TOP3(🔥86) | 🚀 Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference（更智能、更优、更快、更长：一种现代双向编码器，用于快速、内存高效的长上下文微调和推理） [06:59] TOP4(🔥82) | 🌍 GenEx: Generating an Explorable World（GenEx：生成可探索的世界） [08:58] TOP5(🔥79) | 🧠 Are Your LLMs Capable of Stable Reasoning?（你的大语言模型能够稳定推理吗？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2024.12.20 每日AI论文 | 数据扩增提升LLMs性能，多模态推理框架创新突破

本期的 14 篇论文如下： [00:22] 🤖 Qwen2.5 Technical Report（Qwen2.5技术报告） [01:00] 🧠 Progressive Multimodal Reasoning via Active Retrieval（通过主动检索实现渐进式多模态推理） [01:39] 🌐 MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval（MegaPairs：大规模数据合成用于通用多模态检索） [02:26] 🧠 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks（LongBench v2：面向现实长上下文多任务的深入理解和推理） [03:15] 📊 How to Synthesize Text Data without Model Collapse?（如何合成文本数据而不导致模型崩溃？） [03:56] 🌊 Flowing from Words to Pixels: A Framework for Cross-Modality Evolution（从文字到像素：跨模态演化的框架） [04:37] 🎥 LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis（LeviTor：面向三维轨迹的图像到视频合成） [05:20] 🖼 Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion（可感知功能的对象插入：基于掩码感知的双重扩散） [06:05] 🌐 DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation（DI-PCG：基于扩散的高效逆向程序化内容生成用于高质量3D资产创建） [06:46] 🧠 AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling（AceMath：通过后训练和奖励建模推进前沿数学推理） [07:33] 🧠 Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception（基于视觉专家的描述性字幕增强的多模态感知） [08:14] 🖼 UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency（基于循环编辑一致性的无监督指令图像编辑） [08:54] 🧪 TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation（基于文本的开放分子生成基准测试） [09:36] 🕺 Move-in-2D: 2D-Conditioned Human Motion Generation（二维条件下的生成人体运动）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

78

2024.12.19 每日AI论文 | AI代理任务表现有限，动画制作效率提升。

本期的 18 篇论文如下： [00:24] 🤖 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks（TheAgentCompany：在具有重要现实意义的任务上对LLM代理进行基准测试） [01:06] 🎥 AniDoc: Animation Creation Made Easier（AniDoc：让动画制作更简单） [01:44] 👗 FashionComposer: Compositional Fashion Image Generation（时尚组合器：组合式时尚图像生成） [02:28] 🤖 Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning（高效扩散Transformer策略与专家去噪混合模型在多任务学习中的应用） [03:05] 🌐 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation（提示深度任意模型用于4K分辨率精确度量深度估计） [03:42] 🔄 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN（混合层归一化：通过结合预层归一化和后层归一化释放深层层的潜力） [04:26] 🤖 GUI Agents: A Survey（图形用户界面代理：综述） [05:12] 🌍 AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities（AnySat：适用于任意分辨率、尺度和模态的地球观测模型） [05:51] 📊 RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment（RAG-RewardBench：在检索增强生成中评估奖励模型以实现偏好对齐） [06:40] 🧠 LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer（LLaVA-UHD v2：通过分层窗口Transformer集成高分辨率特征金字塔的多模态大语言模型） [07:30] 🤖 Learning from Massive Human Videos for Universal Humanoid Pose Control（从大规模人类视频中学习通用拟人姿态控制） [08:05] 🤖 ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers（ChatDiT：一种无需训练的任务无关自由形式聊天扩散变换器基线） [08:49] 🎥 VidTok: A Versatile and Open-Source Video Tokenizer（VidTok：一种多功能且开源的视频标记器） [09:28] 🧠 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces（空间思维：多模态大语言模型如何看、记和回忆空间） [10:13] 🔄 CAD-Recode: Reverse Engineering CAD Code from Point Clouds（CAD-Recode：从点云逆向工程CAD代码） [10:54] 🤖 AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge（AntiLeak-Bench：通过自动构建基准测试防止数据污染） [11:39] 🤖 Alignment faking in large language models（大型语言模型中的对齐伪装） [12:19] ⚡ FastVLM: Efficient Vision Encoding for Vision Language Models（FastVLM：高效视觉编码在视觉语言模型中的应用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

2024.12.18 每日AI论文 | 推理能力待提升，多模态模型需优化。

本期的 8 篇论文如下： [00:24] 🧠 Are Your LLMs Capable of Stable Reasoning?（你的LLM是否具备稳定推理能力？） [01:06] 📊 Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models（多维度洞察：大型多模态模型在现实世界个性化中的基准测试） [01:52] 📊 OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain（OmniEval：金融领域全方位自动RAG评估基准） [02:33] 🧠 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers（抽象概念的涌现：Transformer中上下文学习中的概念编码与解码机制） [03:16] 🤖 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents（提议者-代理-评估者（PAE）：为基模型互联网代理实现自主技能发现） [04:00] 📊 VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation（VisDoM：使用多模态检索增强生成的多文档问答与视觉丰富元素） [04:39] 🤔 When to Speak, When to Abstain: Contrastive Decoding with Abstention（何时发言，何时保持沉默：对比解码与放弃机制） [05:18] 🎥 MIVE: New Design and Benchmark for Multi-Instance Video Editing（MIVE：多实例视频编辑的新设计与基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2024.12.17 每日AI论文 | 提升检索生成效率，优化视觉生成评估。

本期的 18 篇论文如下： [00:23] 🧠 RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation（RetroLLM：赋能大型语言模型在生成过程中检索细粒度证据） [01:05] ⚡ Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models（评估代理：高效且可提示的视觉生成模型评估框架） [01:45] 🎨 BrushEdit: All-In-One Image Inpainting and Editing（BrushEdit：一站式图像修复与编辑） [02:27] 🎨 ColorFlow: Retrieval-Augmented Image Sequence Colorization（ColorFlow：检索增强型图像序列着色） [03:10] 🧩 Byte Latent Transformer: Patches Scale Better Than Tokens（字节潜在变换器：补丁尺度优于标记） [03:56] 🧠 Causal Diffusion Transformers for Generative Modeling（因果扩散变换器用于生成建模） [04:33] 🤖 Smaller Language Models Are Better Instruction Evolvers（更小的语言模型是更好的指令进化器） [05:16] 🌟 IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations（IDArb：任意数量输入视图和光照下的内在分解） [06:02] 🌳 SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models（SPaR：通过树搜索优化自我对弈以提升大型语言模型的指令遵循能力） [06:47] 🌌 Wonderland: Navigating 3D Scenes from a Single Image（奇境：从单张图像导航3D场景） [07:32] 🔬 GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs（高斯属性：将物理属性集成到3D高斯分布中与LMMs结合） [08:18] ⚡ SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator（SepLLM：通过将一段内容压缩为一个分隔符来加速大型语言模型） [09:06] 🧠 Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture（奇妙矩阵：结合以实现更高效和有效的基模型架构） [09:46] 👩 StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors（StrandHead：基于头发几何先验的文本生成解耦3D头部虚拟形象） [10:35] 🌐 MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes（MOVIS：增强室内场景多物体新颖视角合成） [11:19] 🎵 Whisper-GPT: A Hybrid Representation Audio Large Language Model（Whisper-GPT：一种混合表示的音频大语言模型） [12:10] 🤖 TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning（TidyBot++：用于机器人学习的开源全向移动机械手） [13:01] 🔒 Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning（只需简单变换即可实现纵向联邦学习中的数据保护）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

14分钟