HuggingFace 每日AI论文速递 - 节目列表

2024.10.15 每日AI论文 | MMIE推动LVLMs发展，LOKI评估合成数据检测。

本期的 15 篇论文如下：[00:24] 🌐 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models（大规模多模态交错理解基准测试）[01:06] 🤖 LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models（LOKI：基于大型多模态模型的综合合成数据检测基准）[02:01] 🔍 Toward General Instruction-Following Alignment for Retrieval-Augmented Generation（面向检索增强生成的通用指令遵循对齐）[02:36] 📊 MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks（MEGA-Bench：将多模态评估扩展到500多个真实世界任务）[03:12] 🎥 Animate-X: Universal Character Image Animation with Enhanced Motion Representation（Animate-X：增强运动表示的通用角色图像动画）[04:02] 📚 Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models（全能数学：面向大型语言模型的奥林匹克级数学基准）[04:44] 📚 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content（LiveXiv -- 基于Arxiv论文内容的多模态实时基准）[05:29] 🎥 Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention（Cavia：具有视角控制的多视角视频扩散与视角集成注意力）[06:09] ⏳ TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models（时间轴基准：多模态视频模型细粒度时间理解评测）[06:58] 🌊 Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations（基于校正随机微分方程的语义图像反演与编辑）[07:40] 📊 Rethinking Data Selection at Scale: Random Selection is Almost All You Need（重新思考大规模数据选择：随机选择几乎是你所需要的）[08:26] 🌲 Tree of Problems: Improving structured problem solving with compositionality（问题树：通过组合性改进结构化问题解决）[09:13] 📺 TVBench: Redesigning Video-Language Evaluation（TVBench：重塑视频语言评估）[09:54] 🤖 Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies（可泛化的人形机器人操作：改进的三维扩散策略）[10:29] 📚 LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory（长时记忆评估：在长期交互记忆中评估聊天助手）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

2024.10.14 每日AI论文 | 多模态模型Baichuan-Omni开源,Meissonic提升文生图效率

本期的 16 篇论文如下：[00:25] 🌐 Baichuan-Omni Technical Report（百川-Omni 技术报告）[00:59] 🖼 Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis（Meissonic：高效高分辨率文本到图像生成的掩码生成Transformer复兴）[01:41] 🔧 From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning（从通才到专家：通过任务特定视觉指令调整适应视觉语言模型）[02:17] 🎨 EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models（进化导演：利用大规模视觉语言模型接近高级文本到图像生成）[02:53] 🧠 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization（结构化RAG：通过推理时混合信息结构化提升LLMs的知识密集型推理能力）[03:34] 📏 PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness（大语言模型：具备显式位置感知的长度控制与复制粘贴）[04:11] 🌐 Semantic Score Distillation Sampling for Compositional Text-to-3D Generation（语义分数蒸馏采样用于组合式文本到3D生成）[04:47] 🧠 SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights（超级纠正：利用错误驱动的洞察力监督和纠正语言模型）[05:29] 🔄 Mechanistic Permutability: Match Features Across Layers（机制可置换性：跨层匹配特征）[06:07] 🤖 Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining（多智能体协作数据选择以提高LLM预训练效率）[06:45] ⚡ KV Prediction for Improved Time to First Token（KV预测提升首次输出时间）[07:30] 🌐 ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion（零样本对象合成：基于扩散的图像内在特性）[08:13] 🚨 MiRAGeNews: Multimodal Realistic AI-Generated News Detection（多模态现实AI生成新闻检测）[08:52] 🤖 DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models（DA-Code：面向大型语言模型的代理数据科学代码生成基准）[09:30] 📈 I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow（I-Max：最大化预训练校正流变换器的分辨率潜力与投影流）[10:12] 🧠 Mentor-KD: Making Small Language Models Better Multi-step Reasoners（导师-KD：使小型语言模型成为更好的多步推理者）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

11分钟

【周末特辑】10月第2周最火AI论文 | 差分Transformer提升文本处理，L-Mul算法降低能耗。

本期的 5 篇论文如下：[00:37] TOP1(🔥128) | 🔍 Differential Transformer（差分Transformer）[02:38] TOP2(🔥125) | ⚡ Addition is All You Need for Energy-efficient Language Models（加法即所需：高效能语言模型）[04:13] TOP3(🔥84) | 🌐 Aria: An Open Multimodal Native Mixture-of-Experts Model（Aria：一个开放的多模态原生混合专家模型）[06:18] TOP4(🔥73) | 🤖 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments（GLEE：基于语言的经济环境统一框架与基准）[08:25] TOP5(🔥63) | 👤 Personalized Visual Instruction Tuning（个性化视觉指令微调）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

10分钟

83

2024.10.11 每日AI论文 | 数学代码提升推理，前缀量化加速模型

本期的 21 篇论文如下：[00:25] 🧮 MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code（MathCoder2：通过模型翻译的数学代码进行持续预训练以提升数学推理能力）[01:09] 🚀 PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs（前缀量化：静态量化通过LLMs中的前缀异常值超越动态量化）[01:59] 🤖 MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents（MLLM作为检索器：交互式学习多模态检索以增强具身代理）[02:33] 🎨 DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models（DICE：离散逆向可控编辑的多项扩散与掩码生成模型）[03:03] 🔄 Benchmarking Agentic Workflow Generation（代理工作流生成基准测试）[03:44] 🤖 Agent S: An Open Agentic Framework that Uses Computers Like a Human（Agent S：一个使用计算机如人类的开放代理框架）[04:23] 🔄 Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow（修正扩散：在修正流中直线性并非必需）[04:55] 🤖 Intriguing Properties of Large Language and Vision Models（大型语言与视觉模型的引人特性）[05:35] 🎥 Progressive Autoregressive Video Diffusion Models（渐进式自回归视频扩散模型）[06:26] 🌲 Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning（基于MCTS的LLMs自我改进：利用逐步知识与课程偏好学习）[07:10] 🌐 Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality（保留预训练视觉语言模型的多模态能力以提升视觉语言组合性）[07:50] 🤖 GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models（GLOV：引导大型语言模型作为视觉语言模型的隐式优化器）[08:36] 🧩 SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe（SFTMix：利用Mixup方法提升语言模型指令微调）[09:15] 🔄 Emergent properties with repeated examples（重复示例的涌现特性）[09:57] 🤖 Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System（优化基于LLM的多智能体系统的有效性与效率）[10:40] 🎲 Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates（欺骗自动LLM基准测试：空模型实现高胜率）[11:14] 🌐 Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition（无处不在同时进行：LLMs 可以在叠加状态下进行多任务上下文学习）[11:58] 🧬 LPZero: Language Model Zero-cost Proxy Search from Zero（LPZero：从零开始的零成本代理搜索）[12:41] 🌐 MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting（MotionGS：探索显式运动引导的可变形3D高斯喷射）[13:15] 🔍 Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations（扩展你的卷积核：大卷积核设计在卷积神经网络中的通用表示）[13:51] 🖼 DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation（DART：去噪自回归Transformer用于可扩展的文本到图像生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

14分钟

【月末特辑】9月最火AI论文 | 强化学习提升语言模型，代码智能模型表现优异。

本期的 10 篇论文如下：[00:40] TOP1(🔥129) | 🤖 Training Language Models to Self-Correct via Reinforcement Learning（通过强化学习训练语言模型进行自我修正）[02:41] TOP2(🔥121) | 🚀 Qwen2.5-Coder Technical Report（Qwen2.5-Coder技术报告）[04:44] TOP3(🔥96) | 🌐 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models（Molmo 和 PixMo：用于最先进多模态模型的开放权重和开放数据）[06:30] TOP4(🔥95) | 🖼 Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing（引导与重缩放：无调参自引导机制实现高效真实图像编辑）[08:23] TOP5(🔥86) | 🧠 Attention Heads of Large Language Models: A Survey（大型语言模型注意力头：一项综述）[10:17] TOP6(🔥85) | 🎥 Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency（Loopy：驯服音频驱动的人像化身与长期运动依赖）[11:56] TOP7(🔥81) | 🌐 OmniGen: Unified Image Generation（全能生成：统一图像生成模型）[13:51] TOP8(🔥81) | 🧠 Emu3: Next-Token Prediction is All You Need（Emu3：下一个词预测是所有你需要的）[15:45] TOP9(🔥78) | 📄 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model（通用OCR理论：通过统一端到端模型迈向OCR-2.0）[17:59] TOP10(🔥77) | 🧠 OLMoE: Open Mixture-of-Experts Language Models（OLMoE：开放式混合专家语言模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

20分钟

87

2024.10.10 每日AI论文 | LLMs经济游戏表现各异，个性化视觉指令提升AI互动。

本期的 43 篇论文如下：[00:23] 🤖 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments（GLEE：基于语言的经济环境统一框架与基准）[01:09] 👤 Personalized Visual Instruction Tuning（个性化视觉指令微调）[01:48] 🌍 Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation（迈向世界模拟器：基于物理常识的视频生成基准）[02:35] 🖼 IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation（迭代组合感知反馈学习：从模型库中提升文本到图像生成）[03:17] 🔍 Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate（解码大型视觉语言模型中的跨模态对齐与模态集成率）[03:54] 🌐 Aria: An Open Multimodal Native Mixture-of-Experts Model（Aria：一个开放的多模态原生混合专家模型）[04:29] 🌐 Pixtral 12B（Pixtral 12B）[05:09] 🎥 Pyramidal Flow Matching for Efficient Video Generative Modeling（金字塔流匹配用于高效视频生成建模）[05:49] 🔗 Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning（揭示视觉表示学习中的骨干-优化器耦合偏差）[06:29] 🎥 MM-Ego: Towards Building Egocentric Multimodal LLMs（MM-Ego：构建以自我为中心的多模态大型语言模型）[07:07] 🔄 One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation（一种初始化方法统治所有：通过解释方差适应进行微调）[07:51] 📖 Story-Adapter: A Training-free Iterative Framework for Long Story Visualization（故事适配器：一种无需训练的迭代框架用于长故事可视化）[08:33] 🚀 Self-Boosting Large Language Models with Synthetic Preference Data（利用合成偏好数据自我提升大型语言模型）[09:13] 🚀 Falcon Mamba: The First Competitive Attention-free 7B Language Model（猎鹰曼巴：首个无注意力机制的7B语言模型）[09:53] 🎨 TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation（TweedieMix：改进基于扩散的图像/视频生成中的多概念融合）[10:24] ⏳ Temporal Reasoning Transfer from Text to Video（从文本到视频的时间推理迁移）[10:54] 🎥 TRACE: Temporal Grounding Video LLM via Causal Event Modeling（TRACE：通过因果事件建模实现视频时间定位的大型语言模型）[11:30] 📊 Data Selection via Optimal Control for Language Models（通过最优控制进行语言模型数据选择）[12:07] 🤖 Response Tuning: Aligning Large Language Models without Instruction（响应调优：无需指令对齐大型语言模型）[12:49] 🤖 CursorCore: Assist Programming through Aligning Anything（CursorCore：通过对齐任何内容辅助编程）[13:36] 🎥 ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler（ViBiDSampler：利用双向扩散采样器增强视频插值）[14:16] 🗣 Mixed-Session Conversation with Egocentric Memory（带有自我中心记忆的混合会话）[14:57] 🎮 ING-VP: MLLMs cannot Play Easy Vision-based Games Yet（ING-VP：多模态大语言模型在视觉游戏中的表现仍不尽人意）[15:41] 🔓 AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs（AutoDAN-Turbo：一种用于策略自我探索以破解LLMs的终身代理）[16:26] 🎥 T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design（T2V-Turbo-v2：通过数据、奖励和条件引导设计增强视频生成模型后训练）[17:00] 📖 Collective Critics for Creative Story Generation（创意故事生成的集体批评框架）[17:36] 🎵 Diversity-Rewarded CFG Distillation（多样性奖励的CFG蒸馏）[18:16] 🧠 Retrieval-Augmented Decision Transformer: External Memory for In-context RL（检索增强决策变压器：上下文强化学习的外部记忆）[18:57] 🎙 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching（F5-TTS：基于流匹配生成流畅且忠实语音的童话生成器）[19:32] 🎹 FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance（《致爱丽丝：捕捉并物理合成钢琴演奏手部动作》）[20:20] 🧠 Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning（整体遗忘基准：文本到图像扩散模型遗忘的多方面评估）[21:01] 🧬 Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning（多模态大语言模型用于逆向分子设计与逆合成规划）[21:38] 🎥 BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way（BroadWay：无需训练提升文本到视频生成模型）[22:21] 🚨 Multimodal Situational Safety（多模态情境安全）[22:56] 💥 Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders（幻觉AI劫持攻击：大型语言模型与恶意代码推荐器）[23:38] 🛠 Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach（Seeker：利用基于LLM的多代理方法增强代码中的异常处理）[24:18] 🌐 Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control（联合生成多视角一致的PBR纹理：协作控制方法）[24:55] 🤖 TinyEmo: Scaling down Emotional Reasoning via Metric Projection（TinyEmo：通过度量投影缩小情感推理）[25:29] 🧠 MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders（心理竞技场：通过自我对弈训练语言模型用于心理健康障碍的诊断与治疗）[26:08] 🎭 TextToon: Real-Time Text Toonify Head Avatar from Single Video（文本转卡通：从单视频实时生成卡通化头部虚拟形象）[26:49] 🤖 Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA（伟大的思想是否一致？探究CAIMIRA框架下的人机问答互补性）[27:28] 📊 MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering（MLE-bench：评估机器学习代理在机器学习工程中的表现）[28:03] 🧠 Does Spatial Cognition Emerge in Frontier Models?（空间认知在前沿模型中是否出现？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

29分钟

2024.10.09 每日AI论文 | 长上下文生成能力评估，指令多样性影响泛化

本期的 9 篇论文如下：[00:28] 📚 LongGenBench: Long-context Generation Benchmark（长上下文生成基准：LongGenBench）[01:11] 🌐 $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization（仅限IF：揭示指令多样性对泛化的决定性影响）[01:50] 📊 RevisEval: Improving LLM-as-a-Judge via Response-Adapted References（RevisEval：通过响应自适应参考改进LLM作为评判者）[02:35] 🌟 A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation（视觉语言智能的火花：用于高效细粒度图像生成的二维自回归Transformer）[03:25] 🎥 Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models（基于视频的大型语言模型：细化视频中的细粒度时间定位）[04:00] 🎨 ControlAR: Controllable Image Generation with Autoregressive Models（ControlAR：可控图像生成的自回归模型）[04:45] 🔍 Hyper-multi-step: The Truth Behind Difficult Long-context Tasks（超多步：困难长上下文任务背后的真相）[05:21] 🤖 MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions（MA-RLHF：基于宏动作的人类反馈强化学习）[06:03] 📊 EBES: Easy Benchmarking for Event Sequences（EBES：事件序列的简易基准测试）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

7分钟

96

2024.10.08 每日AI论文 | 差分Transformer优化注意力，LLM幻觉研究揭示错误模式。

本期的 21 篇论文如下：[00:26] 🔍 Differential Transformer（差分Transformer）[01:04] 🧠 LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations（大语言模型知多于表：关于LLM幻觉的内在表征）[01:50] 📹 VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide（视频指南：通过教师指导提升视频扩散模型无需训练）[02:28] 📈 FAN: Fourier Analysis Networks（傅里叶分析网络）[03:05] 🏥 Named Clinical Entity Recognition Benchmark（命名临床实体识别基准）[03:37] 🔬 ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery（科学智能基准：面向数据驱动科学发现的语言智能体严格评估）[04:19] 🎶 UniMuMo: Unified Text, Music and Motion Generation（统一文本、音乐与动作生成）[04:55] 🔍 TLDR: Token-Level Detective Reward Model for Large Vision Language Models（TLDR：大视觉语言模型的令牌级侦探奖励模型）[05:35] 🎵 Presto! Distilling Steps and Layers for Accelerating Music Generation（快速！加速音乐生成的步骤和层级蒸馏）[06:08] 🖥 Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents（像人类一样导航数字世界：GUI代理的通用视觉基础）[06:49] 🖼 OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction（全能展台：通过多模态指令学习图像合成的潜在控制）[07:29] 🌀 MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion（MonST3R：一种在动态场景中估计几何的简单方法）[08:09] 🧠 LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning（LLaMA-Berry：O1类奥林匹克级数学推理的成对优化）[08:50] 📊 MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs（MathHay：LLMs长上下文数学推理自动化基准）[09:39] 📊 GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models（GSM-符号化：理解大型语言模型在数学推理中的局限性）[10:34] 🤖 Autonomous Character-Scene Interaction Synthesis from Text Instruction（从文本指令自主合成角色场景互动）[11:12] 🧩 TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles（TurtleBench：通过真实世界的Yes/No谜题评估顶级语言模型）[12:00] 🤖 Grounding Language in Multi-Perspective Referential Communication（多视角指称通信中的语言接地）[12:48] 🎯 SePPO: Semi-Policy Preference Optimization for Diffusion Alignment（SePPO：扩散模型对齐的半策略偏好优化）[13:25] 🧩 What Matters for Model Merging at Scale?（大规模模型合并的关键因素是什么？）[14:02] 📊 SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification（SELECT：图像分类数据策展策略的大规模基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

15分钟

2024.10.07 每日AI论文 | 高效能语言模型节能新算法，视觉语言模型推理能力待提升。

本期的 12 篇论文如下：[00:25] ⚡ Addition is All You Need for Energy-efficient Language Models（加法即所需：高效能语言模型）[01:03] 🧠 NL-Eye: Abductive NLI for Images（NL-Eye：图像的溯因自然语言推理）[01:40] 🔍 Selective Attention Improves Transformer（选择性注意力提升Transformer）[02:17] ⚡ Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding（加速自回归文本到图像生成：无训练的推测性雅可比解码）[02:48] 🤖 Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise（导师助手：一种用于扩展实时专家知识的人机协作方法）[03:27] 🩺 A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond（医学图像分析中的Mamba架构综合调查：分类、分割、恢复及超越）[04:12] 🎨 RoCoTex: A Robust Method for Consistent Texture Synthesis with Diffusion Models（RoCoTex：一种基于扩散模型的鲁棒一致纹理合成方法）[04:59] 🧠 Erasing Conceptual Knowledge from Language Models（从语言模型中消除概念知识）[05:37] 📈 MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction（MIGA：基于专家组聚合的混合模型用于股票市场预测）[06:16] 🤖 CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction（CANVAS：常识感知导航系统用于直观人机交互）[06:54] 🌳 NRGBoost: Energy-Based Generative Boosted Trees（NRGBoost：基于能量的生成增强树）[07:37] 🤖 GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs（GenSim2：利用多模态和推理LLMs扩展机器人数据生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

8分钟

【周末特辑】10月第1周最火AI论文 | Emu3模型表现卓越，最弱环节定律制约LLMs。

本期的 5 篇论文如下：[00:47] TOP1(🔥73) | 🧠 Emu3: Next-Token Prediction is All You Need（Emu3：下一个词预测是所有你需要的）[02:42] TOP2(🔥48) | 🔗 Law of the Weakest Link: Cross Capabilities of Large Language Models（最弱环节定律：大型语言模型的跨能力）[04:26] TOP3(🔥45) | 🌐 MIO: A Foundation Model on Multimodal Tokens（MIO：基于多模态标记的基础模型）[06:26] TOP4(🔥44) | 🌐 Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models（重访大规模图像-标题数据在预训练多模态基础模型中的应用）[08:27] TOP5(🔥43) | 🧠 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning（MM1.5：多模态大语言模型微调的方法、分析与洞察）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

10分钟

98

2024.10.04 每日AI论文 | 字幕类型影响模型表现，长视频生成技术突破。

本期的 19 篇论文如下：[00:24] 🔄 Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models（重新审视大规模图像-文本数据在多模态基础模型预训练中的作用）[01:04] 🎥 Loong: Generating Minute-level Long Videos with Autoregressive Language Models（使用自回归语言模型生成分钟级长视频）[01:39] 🎥 Video Instruction Tuning With Synthetic Data（使用合成数据进行视频指令调优）[02:18] 🧐 LLaVA-Critic: Learning to Evaluate Multimodal Models（LLaVA-Critic：学习评估多模态模型）[02:56] 🔍 Contrastive Localized Language-Image Pre-Training（对比本地化语言-图像预训练）[03:31] 🌱 VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment（VinePPO：通过精细化的信用分配解锁LLM推理的RL潜力）[04:07] 🌟 Depth Pro: Sharp Monocular Metric Depth in Less Than a Second（Depth Pro：不到一秒内实现锐利的单目度量深度）[04:51] 🔗 Large Language Models as Markov Chains（大型语言模型作为马尔可夫链）[05:26] 🧠 CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling（CLIP-MoE：通过多样化多重升级构建CLIP的专家混合模型）[06:03] 🔄 Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models（消除扩散模型中高指导尺度引起的过饱和和伪影）[06:51] 🔄 Training Language Models on Synthetic Edit Sequences Improves Code Synthesis（在合成编辑序列上训练语言模型改进代码合成）[07:36] ⚡ SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration（SageAttention：用于即插即用推理加速的精确8位注意力机制）[08:14] 🌐 MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis（MVGS：多视角调节的高斯喷射用于新视角合成）[08:54] 📚 L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?（L-CiteEval：长上下文模型是否真正利用上下文进行响应？）[09:38] 🩺 MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation（利用预训练大型语言模型层增强医学图像分割）[10:24] 🎥 Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos（Vinoground: 通过短视频密集时间推理审视大型多模态模型）[11:01] 🗣 Distilling an End-to-End Voice Assistant Without Instruction Training Data（无需指令训练数据的端到端语音助手蒸馏）[11:46] ♟ Learning the Latent Rules of a Game from Data: A Chess Story（从数据中学习游戏的潜在规则：一个国际象棋的故事）[12:29] 🎵 Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data（Synthio：使用合成数据增强小规模音频分类数据集）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递在小宇宙查看该单集文稿

13分钟

97