节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2024.12.20 每日AI论文 | 数据扩增提升LLMs性能，多模态推理框架创新突破

本期的 14 篇论文如下： [00:22] 🤖 Qwen2.5 Technical Report（Qwen2.5技术报告） [01:00] 🧠 Progressive Multimodal Reasoning via Active Retrieval（通过主动检索实现渐进式多模态推理） [01:39] 🌐 MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval（MegaPairs：大规模数据合成用于通用多模态检索） [02:26] 🧠 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks（LongBench v2：面向现实长上下文多任务的深入理解和推理） [03:15] 📊 How to Synthesize Text Data without Model Collapse?（如何合成文本数据而不导致模型崩溃？） [03:56] 🌊 Flowing from Words to Pixels: A Framework for Cross-Modality Evolution（从文字到像素：跨模态演化的框架） [04:37] 🎥 LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis（LeviTor：面向三维轨迹的图像到视频合成） [05:20] 🖼 Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion（可感知功能的对象插入：基于掩码感知的双重扩散） [06:05] 🌐 DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation（DI-PCG：基于扩散的高效逆向程序化内容生成用于高质量3D资产创建） [06:46] 🧠 AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling（AceMath：通过后训练和奖励建模推进前沿数学推理） [07:33] 🧠 Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception（基于视觉专家的描述性字幕增强的多模态感知） [08:14] 🖼 UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency（基于循环编辑一致性的无监督指令图像编辑） [08:54] 🧪 TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation（基于文本的开放分子生成基准测试） [09:36] 🕺 Move-in-2D: 2D-Conditioned Human Motion Generation（二维条件下的生成人体运动）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

78

2024.12.19 每日AI论文 | AI代理任务表现有限，动画制作效率提升。

本期的 18 篇论文如下： [00:24] 🤖 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks（TheAgentCompany：在具有重要现实意义的任务上对LLM代理进行基准测试） [01:06] 🎥 AniDoc: Animation Creation Made Easier（AniDoc：让动画制作更简单） [01:44] 👗 FashionComposer: Compositional Fashion Image Generation（时尚组合器：组合式时尚图像生成） [02:28] 🤖 Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning（高效扩散Transformer策略与专家去噪混合模型在多任务学习中的应用） [03:05] 🌐 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation（提示深度任意模型用于4K分辨率精确度量深度估计） [03:42] 🔄 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN（混合层归一化：通过结合预层归一化和后层归一化释放深层层的潜力） [04:26] 🤖 GUI Agents: A Survey（图形用户界面代理：综述） [05:12] 🌍 AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities（AnySat：适用于任意分辨率、尺度和模态的地球观测模型） [05:51] 📊 RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment（RAG-RewardBench：在检索增强生成中评估奖励模型以实现偏好对齐） [06:40] 🧠 LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer（LLaVA-UHD v2：通过分层窗口Transformer集成高分辨率特征金字塔的多模态大语言模型） [07:30] 🤖 Learning from Massive Human Videos for Universal Humanoid Pose Control（从大规模人类视频中学习通用拟人姿态控制） [08:05] 🤖 ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers（ChatDiT：一种无需训练的任务无关自由形式聊天扩散变换器基线） [08:49] 🎥 VidTok: A Versatile and Open-Source Video Tokenizer（VidTok：一种多功能且开源的视频标记器） [09:28] 🧠 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces（空间思维：多模态大语言模型如何看、记和回忆空间） [10:13] 🔄 CAD-Recode: Reverse Engineering CAD Code from Point Clouds（CAD-Recode：从点云逆向工程CAD代码） [10:54] 🤖 AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge（AntiLeak-Bench：通过自动构建基准测试防止数据污染） [11:39] 🤖 Alignment faking in large language models（大型语言模型中的对齐伪装） [12:19] ⚡ FastVLM: Efficient Vision Encoding for Vision Language Models（FastVLM：高效视觉编码在视觉语言模型中的应用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

2024.12.18 每日AI论文 | 推理能力待提升，多模态模型需优化。

本期的 8 篇论文如下： [00:24] 🧠 Are Your LLMs Capable of Stable Reasoning?（你的LLM是否具备稳定推理能力？） [01:06] 📊 Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models（多维度洞察：大型多模态模型在现实世界个性化中的基准测试） [01:52] 📊 OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain（OmniEval：金融领域全方位自动RAG评估基准） [02:33] 🧠 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers（抽象概念的涌现：Transformer中上下文学习中的概念编码与解码机制） [03:16] 🤖 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents（提议者-代理-评估者（PAE）：为基模型互联网代理实现自主技能发现） [04:00] 📊 VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation（VisDoM：使用多模态检索增强生成的多文档问答与视觉丰富元素） [04:39] 🤔 When to Speak, When to Abstain: Contrastive Decoding with Abstention（何时发言，何时保持沉默：对比解码与放弃机制） [05:18] 🎥 MIVE: New Design and Benchmark for Multi-Instance Video Editing（MIVE：多实例视频编辑的新设计与基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

2024.12.17 每日AI论文 | 提升检索生成效率，优化视觉生成评估。

本期的 18 篇论文如下： [00:23] 🧠 RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation（RetroLLM：赋能大型语言模型在生成过程中检索细粒度证据） [01:05] ⚡ Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models（评估代理：高效且可提示的视觉生成模型评估框架） [01:45] 🎨 BrushEdit: All-In-One Image Inpainting and Editing（BrushEdit：一站式图像修复与编辑） [02:27] 🎨 ColorFlow: Retrieval-Augmented Image Sequence Colorization（ColorFlow：检索增强型图像序列着色） [03:10] 🧩 Byte Latent Transformer: Patches Scale Better Than Tokens（字节潜在变换器：补丁尺度优于标记） [03:56] 🧠 Causal Diffusion Transformers for Generative Modeling（因果扩散变换器用于生成建模） [04:33] 🤖 Smaller Language Models Are Better Instruction Evolvers（更小的语言模型是更好的指令进化器） [05:16] 🌟 IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations（IDArb：任意数量输入视图和光照下的内在分解） [06:02] 🌳 SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models（SPaR：通过树搜索优化自我对弈以提升大型语言模型的指令遵循能力） [06:47] 🌌 Wonderland: Navigating 3D Scenes from a Single Image（奇境：从单张图像导航3D场景） [07:32] 🔬 GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs（高斯属性：将物理属性集成到3D高斯分布中与LMMs结合） [08:18] ⚡ SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator（SepLLM：通过将一段内容压缩为一个分隔符来加速大型语言模型） [09:06] 🧠 Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture（奇妙矩阵：结合以实现更高效和有效的基模型架构） [09:46] 👩 StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors（StrandHead：基于头发几何先验的文本生成解耦3D头部虚拟形象） [10:35] 🌐 MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes（MOVIS：增强室内场景多物体新颖视角合成） [11:19] 🎵 Whisper-GPT: A Hybrid Representation Audio Large Language Model（Whisper-GPT：一种混合表示的音频大语言模型） [12:10] 🤖 TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning（TidyBot++：用于机器人学习的开源全向移动机械手） [13:01] 🔒 Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning（只需简单变换即可实现纵向联邦学习中的数据保护）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

14分钟

2024.12.16 每日AI论文 | 视频理解新突破，AI探索3D环境。

本期的 14 篇论文如下： [00:23] 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models（阿波罗：大型多模态模型中的视频理解探索） [01:11] 🌍 GenEx: Generating an Explorable World（GenEx：生成可探索的世界） [01:50] 🌐 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding（协同生成-VL：基于视觉专家和令牌折叠的图像理解与生成） [02:37] 🩺 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities（BiMediX2：多模态生物医学专家大模型） [03:21] 🤖 Large Action Models: From Inception to Implementation（大规模动作模型：从构想到实现） [04:09] 🎥 InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption（实例感知结构化字幕：通过实例感知结构化字幕提升文本到视频生成） [04:56] 🌟 FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion（FreeScale：通过无调谐尺度融合释放扩散模型的分辨率） [05:42] 🎯 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation（ObjectMate：面向对象插入与主体驱动生成任务的循环先验方法） [06:21] 🔥 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing（FireFlow：图像语义编辑的快速校正流反演） [07:09] 🎵 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation（基于显式桥梁和检索增强的多模态音乐生成） [07:56] 🎨 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers（FluxSpace：在修正流变换器中解耦语义编辑） [08:44] 📊 SCBench: A KV Cache-Centric Analysis of Long-Context Methods（SCBench：以KV缓存为中心的长上下文方法分析） [09:27] 🧠 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs（SmolTulu：更高的学习率与批量大小的比率可以提升SLMs的推理能力） [10:05] 🩺 Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images（Prompt2Perturb (P2P): 基于文本引导扩散的乳腺超声图像对抗攻击）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

【周末特辑】12月第2周最火AI论文 | 扩展策略提升模型性能，多模态系统优化长期交互。

本期的 5 篇论文如下： [00:43] TOP1(🔥95) | 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling（扩展开源多模态模型性能边界：模型、数据与测试时扩展） [03:01] TOP2(🔥65) | 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions（InternLM-XComposer2.5-OmniLive：一个用于长期流式视频和音频交互的综合多模态系统） [05:09] TOP3(🔥64) | 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation（揭开强化学习代理中记忆复杂性的分类与评估方法） [07:29] TOP4(🔥61) | 🎥 STIV: Scalable Text and Image Conditioned Video Generation（STIV：可扩展的文本与图像条件视频生成） [09:46] TOP5(🔥53) | 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning（ProcessBench：识别数学推理中的过程错误）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

2024.12.13 每日AI论文 | 多模态系统提升长期交互，phi-4优化STEM问答表现。

本期的 23 篇论文如下： [00:23] 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions（InternLM-XComposer2.5-OmniLive：一个用于长期流式视频和音频交互的综合多模态系统） [01:03] 🧠 Phi-4 Technical Report（Phi-4 技术报告） [01:43] 🧠 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions（欧几里得：通过合成高保真视觉描述提升多模态大语言模型） [02:27] 🌐 Multimodal Latent Language Modeling with Next-Token Diffusion（多模态潜在语言建模与下一词扩散） [03:10] 🌐 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM（EasyRef：基于多模态大语言模型的扩散模型通用化图像参考） [03:57] 🌐 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials（AgentTrek：通过网络教程引导回放的代理轨迹合成） [04:43] 🌟 Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion（神经光装置：利用多光源扩散解锁精确物体法线和材质估计） [05:24] 📱 SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training（SnapGen：通过高效架构和训练驯服高分辨率文本到图像模型以适应移动设备） [06:02] 🔬 PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations（PIG：物理信息高斯函数作为自适应参数化网格表示） [06:49] 📊 Learned Compression for Compressed Learning（压缩学习中的学习压缩） [07:32] 🎙 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition（Lyra：一个高效且以语音为中心的全认知框架） [08:20] 📊 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios（RuleArena：在现实场景中评估LLMs规则引导推理能力的基准） [09:08] 👀 Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders（Gaze-LLE：通过大规模学习编码器进行注视目标估计） [10:02] 🧠 JuStRank: Benchmarking LLM Judges for System Ranking（JuStRank：基准测试用于系统排名的LLM评判器） [10:43] 🧠 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation（OLA-VLM：通过辅助嵌入蒸馏提升多模态大语言模型的视觉感知能力） [11:34] 📚 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective（版权材料对大型语言模型的影响：挪威视角） [12:16] 🔗 Word Sense Linking: Disambiguating Outside the Sandbox（词义链接：超越沙盒的消歧） [12:58] 🌐 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction（FreeSplatter：无姿态高斯喷射用于稀疏视图三维重建） [13:42] 🎥 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation（DisPose：解耦姿态引导的可控人体图像动画） [14:26] 🖼 LoRACLR: Contrastive Adaptation for Customization of Diffusion Models（LoRACLR：对比适应用于扩散模型的定制化） [15:21] 🧭 SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts（SAME：学习基于状态自适应混合专家的通用语言引导视觉导航） [16:05] 🌟 Arbitrary-steps Image Super-resolution via Diffusion Inversion（基于扩散反演的任意步图像超分辨率） [16:46] 📚 Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages（Shiksha：面向印度语言的技术领域翻译数据集与模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

18分钟

90

2024.12.12 每日AI论文 | 多视角视频生成突破，复杂场景模型提升

本期的 14 篇论文如下： [00:23] 🎥 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints（SynCamMaster：同步多视角视频生成） [01:07] 🌐 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations（LAION-SG：用于训练复杂图像-文本模型的增强型大规模数据集与结构化注释） [01:51] 🌐 POINTS1.5: Building a Vision-Language Model towards Real World Applications（POINTS1.5：构建面向实际应用的视觉语言模型） [02:28] 🎨 Learning Flow Fields in Attention for Controllable Person Image Generation（在注意力中学习流场用于可控人物图像生成） [03:11] 🎥 StyleMaster: Stylize Your Video with Artistic Generation and Translation（风格大师：艺术生成与转换的视频风格化） [04:00] 🔍 Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction（生成密集化：学习在高保真泛化三维重建中密集化高斯分布） [04:46] 🎥 StreamChat: Chatting with Streaming Video（流媒体聊天：与流媒体视频互动） [05:28] 🧠 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark（3DSRBench：一个综合的3D空间推理基准） [06:12] 🏃 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation（Mogo：用于高质量3D人体运动生成的RQ分层因果Transformer） [07:01] 🧠 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models（KaSA：知识感知奇异值适应大型语言模型） [07:40] 🖼 FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models（FlowEdit：基于预训练流模型的无逆向文本编辑） [08:17] 🎨 StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements（StyleStudio：基于文本的风格迁移与风格元素选择性控制） [09:03] 🌍 MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation（MIT-10M：大规模多语言图像翻译并行语料库） [09:50] 🚀 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel（自引导数据飞轮的语言引导导航学习）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

2024.12.11 每日AI论文 | 代码模型评估改进，视频生成技术突破

本期的 23 篇论文如下： [00:25] 🧑 Evaluating and Aligning CodeLLMs on Human Preference（评估与对齐代码大语言模型的人类偏好） [01:19] 🎥 STIV: Scalable Text and Image Conditioned Video Generation（STIV：可扩展的文本与图像条件视频生成） [01:59] 🎨 DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation（DiffSensei：连接多模态大语言模型与扩散模型以实现定制化漫画生成） [02:39] 🔒 Hidden in the Noise: Two-Stage Robust Watermarking for Images（隐藏在噪声中：图像的两阶段鲁棒水印技术） [03:19] 🎥 UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics（UniReal：通过学习真实世界动态实现通用图像生成与编辑） [04:04] 📄 OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations（全向文档基准：多样PDF文档解析的综合评估） [04:50] 🎨 FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models（FiVA：用于文本到图像扩散模型的细粒度视觉属性数据集） [05:32] 🎥 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation（3D轨迹大师：掌握视频生成中的多实体三维运动） [06:09] 🧠 Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation（框架表示假设：多标记语言模型的可解释性与概念引导文本生成） [06:55] 🧠 Perception Tokens Enhance Visual Reasoning in Multimodal Language Models（感知令牌增强多模态语言模型的视觉推理能力） [07:41] 🎥 Video Motion Transfer with Diffusion Transformers（基于扩散变换器的视频运动迁移） [08:23] 🚀 EMOv2: Pushing 5M Vision Model Frontier（EMOv2：推动5M规模视觉模型前沿） [09:02] 🛡 Granite Guardian（花岗岩守护者） [09:44] 🌟 ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance（ILLUME：让您的LLMs看见、绘制并自我增强） [10:30] 🎥 ObjCtrl-2.5D: Training-free Object Control with Camera Poses（ObjCtrl-2.5D：无需训练的对象控制与相机姿态） [11:21] 🚀 LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation（LoRA.rar：通过超网络学习合并LoRA以实现主题-风格条件图像生成） [12:12] 📱 MoViE: Mobile Diffusion for Video Editing（MoViE：移动设备上的扩散模型视频编辑） [12:46] 🧬 Chimera: Improving Generalist Model with Domain-Specific Experts（奇美拉：通过特定领域专家提升通用模型） [13:28] 🌐 Fully Open Source Moxin-7B Technical Report（全开源Moxin-7B技术报告） [14:09] 📱 Mobile Video Diffusion（移动视频扩散） [14:45] 🤖 Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation（情境化反驳言论：适应、个性化与评估策略） [15:24] 🤖 Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment（最大化对齐与最小化反馈：高效学习视觉运动机器人策略对齐的奖励） [16:15] 🔒 A New Federated Learning Framework Against Gradient Inversion Attacks（一种对抗梯度反演攻击的新型联邦学习框架）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

17分钟

94

2024.12.10 每日AI论文 | 识别数学推理错误，评估强化学习记忆。

本期的 9 篇论文如下： [00:23] 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning（ProcessBench：识别数学推理中的过程错误） [01:13] 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation（揭开强化学习代理中记忆复杂性的分类与评估方法） [01:58] 🧠 Training Large Language Models to Reason in a Continuous Latent Space（在连续潜在空间中训练大型语言模型进行推理） [02:38] 🌐 Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models（探索多粒度概念注释在多模态大语言模型中的应用） [03:22] 🎥 Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation（Divot：基于扩散模型的视频理解与生成） [04:09] 🎥 You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale（所见即所得：在无姿态视频上大规模学习3D创作） [04:53] 🌍 Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space（地球的全局与密集嵌入：潜在空间中的Major TOM浮动） [05:31] 🌐 Robust Multi-bit Text Watermark with LLM-based Paraphrasers（基于LLM的鲁棒多比特文本水印） [06:15] 🤖 CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction（CARP：通过粗到细自回归预测进行视觉运动策略学习）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

82

2024.12.09 每日AI论文 | 提升多模态模型性能，优化文本到视频生成质量。

本期的 11 篇论文如下： [00:27] 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling（扩展开源多模态模型性能边界：模型、数据与测试时扩展） [00:58] 🎥 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment（利用人类反馈进行文本到视频模型对齐） [01:41] 🧠 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale（MAmmoTH-VL：大规模指令调优激发多模态推理） [02:24] 🤖 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases（EXAONE 3.5：面向实际应用的大型语言模型系列） [03:26] 🤖 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation（Moto：作为机器人操作桥梁语言的潜在运动标记） [04:10] 🚀 APOLLO: SGD-like Memory, AdamW-level Performance（APOLLO：类似SGD的内存，AdamW级别的性能） [04:49] ⚡ SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion（SwiftEdit：通过一步扩散实现闪电般快速的文本引导图像编辑） [05:26] 🎥 GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration（GenMAC：基于多智能体协作的组合式文本到视频生成） [06:07] ⏱ Mind the Time: Temporally-Controlled Multi-Event Video Generation（注意时间：时间控制的多事件视频生成） [06:42] 🏠 2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction（2DGS-Room：基于种子引导的2D高斯喷射与几何约束的高保真室内场景重建） [07:20] 🗣 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling（DEMO：通过细粒度元素建模重构对话交互）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟