节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2024.11.25 每日AI论文 | 风格友好SNR采样器提升图像生成，TÜLU 3开源模型性能超越。

本期的 14 篇论文如下： [00:26] 🎨 Style-Friendly SNR Sampler for Style-Driven Generation（风格友好SNR采样器用于风格驱动生成） [01:08] 🚀 TÜLU 3: Pushing Frontiers in Open Language Model Post-Training（TÜLU 3：推动开放语言模型后训练的前沿） [01:53] 🌐 OminiControl: Minimal and Universal Control for Diffusion Transformer（OminiControl：扩散Transformer的最小且通用控制） [02:31] 🛡 A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection（一种应用于离题提示检测的灵活大型语言模型防护开发方法） [03:08] 🧠 Large Multi-modal Models Can Interpret Features in Large Multi-modal Models（大型多模态模型中的特征解释） [03:49] 🎥 VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection（视频浓缩：通过核心帧选择进行细粒度视频推理的大规模思维链数据集） [04:29] 🎮 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games（BALROG：在游戏中评估代理型LLM和VLM的推理能力） [05:13] 🎥 Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction（基于协调的补丁重构高效长视频标记化） [05:56] 👴 MyTimeMachine: Personalized Facial Age Transformation（我的时光机：个性化面部年龄转换） [06:34] 🎥 Novel View Extrapolation with Video Diffusion Priors（基于视频扩散先验的新视角外推） [07:10] 🎥 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement（视频修复：通过错位评估和局部细化改进文本到视频生成） [07:54] ☁ Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images（适应视觉基础模型用于遥感图像中云分割的鲁棒性） [08:31] 🤖 One to rule them all: natural language to bind communication, perception and action（一统天下：自然语言结合通信、感知与行动） [09:15] 🤖 WildLMa: Long Horizon Loco-Manipulation in the Wild（野外长时程移动操作）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

95

【周末特辑】11月第4周最火AI论文 | LLaVA-o1提升多模态推理，Genex优化具身AI规划。

本期的 5 篇论文如下： [00:41] TOP1(🔥93) | 🧠 LLaVA-o1: Let Vision Language Models Reason Step-by-Step（LLaVA-o1：让视觉语言模型逐步推理） [02:41] TOP2(🔥55) | 🌍 Generative World Explorer（生成世界探索者） [05:00] TOP3(🔥44) | 🧠 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization（通过混合偏好优化提升多模态大语言模型的推理能力） [07:11] TOP4(🔥41) | 📚 RedPajama: an Open Dataset for Training Large Language Models（红睡衣：用于训练大型语言模型的开放数据集） [09:20] TOP5(🔥41) | ⚡ SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration（SageAttention2技术报告：用于即插即用推理加速的精确4比特注意力机制）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2024.11.22 每日AI论文 | 混合偏好优化提升推理，多模态自回归预训练创新。

本期的 14 篇论文如下： [00:26] 🧠 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization（通过混合偏好优化提升多模态大语言模型的推理能力） [01:12] 🌐 Multimodal Autoregressive Pre-training of Large Vision Encoders（大规模视觉编码器多模态自回归预训练） [01:55] 🧠 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions（Marco-o1：面向开放式解决方案的开放推理模型） [02:40] 🧠 Hymba: A Hybrid-head Architecture for Small Language Models（Hymba：一种用于小语言模型的混合头架构） [03:22] 🚀 Ultra-Sparse Memory Network（超稀疏内存网络） [03:58] 📚 OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs（开放学者：利用检索增强型语言模型合成科学文献） [04:47] 🧠 Natural Language Reinforcement Learning（自然语言强化学习） [05:26] 🧠 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models（Insight-V：探索多模态大语言模型的长链视觉推理） [06:08] 🤖 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models（我了解这个实体吗？语言模型中的知识意识与幻觉） [06:46] 🌊 Stable Flow: Vital Layers for Training-Free Image Editing（稳定流：无需训练的图像编辑关键层） [07:25] 🌐 UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages（统一爬取：利用Common Crawl为低资源语言的LLM提供经济适用的适应性） [08:03] 🚗 MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control（MagicDriveDiT：基于自适应控制的高分辨率长视频生成用于自动驾驶） [08:44] 🧠 Patience Is The Key to Large Language Model Reasoning（耐心是大型语言模型推理的关键） [09:18] 🌐 Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation（将高斯散射融入扩散去噪器以实现快速且可扩展的单阶段图像到3D生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

2024.11.21 每日AI论文 | 4比特注意力加速显著，视频生成基准全面评估。

本期的 8 篇论文如下： [00:28] ⚡ SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration（SageAttention2技术报告：用于即插即用推理加速的精确4比特注意力机制） [01:10] 📹 VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models（VBench++：全面且多功能的视频生成模型基准套件） [01:51] 🎮 VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation（视频自动竞技场：通过用户模拟评估大型多模态模型在视频分析中的能力） [02:33] 🎯 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory（SAMURAI：利用运动感知记忆机制将分割模型适应于零样本视觉跟踪） [03:10] 🌐 Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents（你的LLM是否秘密地成为互联网的世界模型？基于模型的网络代理规划） [03:52] 🔄 When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training（精度与位置的碰撞：BFloat16在长上下文训练中破坏了RoPE） [04:34] 🎨 Stylecodes: Encoding Stylistic Information For Image Generation（风格编码：为图像生成编码风格信息） [05:11] 🩺 ORID: Organ-Regional Information Driven Framework for Radiology Report Generation（器官-区域信息驱动的放射报告生成框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

82

2024.11.20 每日AI论文 | 图像生成加速，语言模型数据集创新

本期的 7 篇论文如下： [00:33] ⚡ Continuous Speculative Decoding for Autoregressive Image Generation（自回归图像生成的连续推测解码） [01:14] 📚 RedPajama: an Open Dataset for Training Large Language Models（红睡衣：用于训练大型语言模型的开放数据集） [01:58] 🤖 Soft Robotic Dynamic In-Hand Pen Spinning（软体机器人动态手内笔旋转） [02:39] 🚀 ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements（ITACLIP：通过图像、文本和架构增强提升无训练语义分割） [03:13] 🔒 Building Trust: Foundations of Security, Safety and Transparency in AI（构建信任：人工智能中的安全、安全和透明度基础） [03:46] 🔍 SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning（SEAGULL：通过视觉语言指令调优的无参考图像质量评估方法） [04:24] 📊 Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages（评估大型语言模型在印度官方语言中的分词器性能）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

2024.11.19 每日AI论文 | 移动设备高效部署，具身AI虚拟探索

本期的 16 篇论文如下： [00:25] 📱 BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices（BlueLM-V-3B：移动设备上多模态大语言模型的算法与系统协同设计） [01:06] 🌍 Generative World Explorer（生成世界探索者） [01:43] 🔍 Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering（搜索、验证与反馈：通过验证器工程实现下一代基础模型的后训练范式） [02:24] 🎥 AnimateAnything: Consistent and Controllable Animation for Video Generation（动画任何事物：视频生成的连贯可控动画） [03:08] 🧠 Top-$nσ$: Not All Logits Are You Need（Top-$nσ$：并非所有对数都需要） [03:55] 🧠 Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts（Awaker2.5-VL：通过参数高效混合专家稳定扩展多模态大语言模型） [04:40] ⚡ SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers（SmoothCache：一种用于扩散变换器的通用推理加速技术） [05:19] 📚 Drowning in Documents: Consequences of Scaling Reranker Inference（文档淹没：扩展重排序器推理的后果） [06:00] 🩺 Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering（医疗问答系统中检索增强生成系统的综合与实用评估） [06:37] 📱 SlimLM: An Efficient Small Language Model for On-Device Document Assistance（SlimLM：一种用于设备端文档辅助的高效小型语言模型） [07:19] 🎥 VeGaS: Video Gaussian Splatting（视频高斯喷射） [07:50] 🔄 Adaptive Decoding via Latent Preference Optimization（通过潜在偏好优化的自适应解码） [08:27] 🎥 StableV2V: Stablizing Shape Consistency in Video-to-Video Editing（稳定视频编辑：在视频到视频编辑中保持形状一致性） [09:11] 🇩 LLäMmlein: Compact and Competitive German-Only Language Models from Scratch（LLäMmlein：从头开始构建紧凑且有竞争力的德语专用语言模型） [09:43] 👕 FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on（FitDiT：提升高保真虚拟试穿的真实服装细节） [10:18] 📜 Evaluating the role of `Constitutions' for learning from AI feedback（评估‘宪法’在从AI反馈中学习的作用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2024.11.18 每日AI论文 | 视觉语言模型推理提升，图像生成精细控制优化

本期的 6 篇论文如下： [00:28] 🧠 LLaVA-o1: Let Vision Language Models Reason Step-by-Step（LLaVA-o1：让视觉语言模型逐步推理） [01:14] 🎨 Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement（区域感知文本到图像生成：硬绑定与软优化） [01:51] 🌐 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation（高斯任意：交互式点云潜在扩散用于3D生成） [02:25] 🌅 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use（GUI代理的黎明：基于Claude 3.5计算机使用的初步案例研究） [03:00] 📖 Number it: Temporal Grounding Videos like Flipping Manga（像翻阅漫画一样进行视频时间定位） [03:45] 🌍 Xmodel-1.5: An 1B-scale Multilingual LLM（Xmodel-1.5：一个10亿参数的多语言大型语言模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

【周末特辑】11月第3周最火AI论文 | Add-it提升图像插入性能，LLMs实现长上下文自我改进。

本期的 5 篇论文如下： [00:44] TOP1(🔥54) | 🖼 Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models（Add-it：基于预训练扩散模型的图像无训练对象插入） [02:31] TOP2(🔥44) | 🤖 Large Language Models Can Self-Improve in Long-context Reasoning（大型语言模型在长上下文推理中的自我改进） [04:15] TOP3(🔥43) | 🌐 LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models（LLaMA-Mesh：将3D网格生成与语言模型统一） [06:12] TOP4(🔥42) | 🎨 OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision（全能编辑器：通过专家监督构建图像编辑通用模型） [08:01] TOP5(🔥42) | 📚 M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework（M-Longdoc：多模态超长文档理解和检索感知调优框架的基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

2024.11.15 每日AI论文 | 高效图像编辑，3D网格生成

本期的 7 篇论文如下： [00:27] ✨ MagicQuill: An Intelligent Interactive Image Editing System（魔法羽毛笔：智能交互式图像编辑系统） [01:15] 🌐 LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models（LLaMA-Mesh：将3D网格生成与语言模型统一） [01:50] 💾 Cut Your Losses in Large-Vocabulary Language Models（在大词汇量语言模型中减少损失） [02:22] 🏥 ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?（临床基准：LLMs能否在临床预测中超越传统ML模型？） [03:02] 🤖 Hermes: A Large Language Model Framework on the Journey to Autonomous Networks（赫尔墨斯：迈向自主网络的大型语言模型框架） [03:36] 🎥 Sharingan: Extract User Action Sequence from Desktop Recordings（分享眼：从桌面录制中提取用户操作序列） [04:21] 🤔 Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples（一致性模型中的不一致性：更好的ODE求解并不意味着更好的样本）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

95

2024.11.14 每日AI论文 | LLMs自我改进显著，EgoVid-5M数据集创新。

本期的 7 篇论文如下： [00:26] 🤖 Large Language Models Can Self-Improve in Long-context Reasoning（大型语言模型在长上下文推理中的自我改进） [01:09] 🎥 EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation（EgoVid-5M：用于第一人称视频生成的大规模视频动作数据集） [01:58] 🔍 Direct Preference Optimization Using Sparse Feature-Level Constraints（利用稀疏特征级约束进行直接偏好优化） [02:37] 🇫 CamemBERT 2.0: A Smarter French Language Model Aged to Perfection（CamemBERT 2.0：更智能的法语语言模型，完美成熟） [03:18] 🧠 Can sparse autoencoders be used to decompose and interpret steering vectors?（稀疏自编码器能否用于分解和解释转向向量？） [03:58] 🎵 PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation（PerceiverS：一种具有有效分割的多尺度感知器，用于长期表达性符号音乐生成） [04:39] 🎥 Motion Control for Enhanced Complex Action Video Generation（增强复杂动作视频生成的运动控制）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

2024.11.13 每日AI论文 | 三维物体分割新框架，多模态理解生成模型

本期的 6 篇论文如下： [00:28] 🔍 SAMPart3D: Segment Any Part in 3D Objects（SAMPart3D：三维物体任意部分分割） [01:06] 🌐 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation（JanusFlow：统一自回归与校正流的多模态理解与生成） [01:42] 🤔 Stronger Models are NOT Stronger Teachers for Instruction Tuning（更强的模型并非更强的指令调优教师） [02:21] 🌐 Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings（小波潜在扩散（WaLa）：具有紧凑小波编码的十亿参数3D生成模型） [03:02] 📚 BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions（BLIP3-KALE：知识增强的大规模密集字幕） [03:55] 🔍 Hardware and Software Platform Inference（硬件与软件平台推断）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟