节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2024.11.08 每日AI论文 | 开放编码器提升代码生成，ReCapture优化视频轨迹

本期的 14 篇论文如下： [00:25] 🔧 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models（开放编码器：顶级代码大语言模型的开放食谱） [01:03] 🎥 ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning（ReCapture：使用掩码视频微调生成用户提供视频的生成性摄像机控制） [01:46] ⚡ BitNet a4.8: 4-bit Activations for 1-bit LLMs（BitNet a4.8：1位大语言模型的4位激活） [02:25] 🎥 DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion（DimensionX：从单张图像生成可控视频扩散的3D和4D场景） [03:04] 🤖 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models（混合变压器：多模态基础模型的稀疏与可扩展架构） [03:39] 🧠 Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model（灭霸：通过融入心灵技能增强对话代理的大型语言模型） [04:21] 🎥 TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation（TIP-I2V：百万级真实文本与图像提示数据集用于图像到视频生成） [05:05] 🤖 DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation（DynaMem：开放世界移动操作的在线动态时空语义记忆） [05:40] 🧵 Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?（针穿线：LLMs能否在近百万规模的文本中追踪线索？） [06:22] 👀 GazeGen: Gaze-Driven User Interaction for Visual Content Generation（GazeGen：基于注视驱动的用户交互视觉内容生成） [07:03] 🌐 RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval（RetrieveGPT：融合提示与数学模型以增强代码混合信息检索） [07:49] 🎥 SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation（SG-I2V：图像到视频生成中的自引导轨迹控制） [08:29] 🎥 VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos（视频GLaMM：一种用于视频中像素级视觉定位的大型多模态模型） [09:03] ⚡ SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models（SVDQuant：通过低秩成分吸收异常值的4比特扩散模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

97

8个月前

2024.11.07 每日AI论文 | 数据污染影响模型评估，结构化推理提升LLMs性能

本期的 4 篇论文如下： [00:28] 🔍 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination（文本与图像均泄露！多模态大语言模型数据污染的系统分析） [01:07] 🤖 Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level（大型语言模型协调结构化推理达到Kaggle大师级别） [01:53] 🧠 Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models（多项式组合激活函数：释放大型语言模型的动态） [02:28] 🔄 Self-Consistency Preference Optimization（自一致性偏好优化）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

3分钟

82

8个月前

2024.11.06 每日AI论文 | HTML提升RAG性能，分子图助手优化多模态任务

本期的 11 篇论文如下： [00:30] 📄 HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems（HtmlRAG：在RAG系统中，HTML比纯文本更适合建模检索知识） [01:12] 🧬 LLaMo: Large Language Model-based Molecular Graph Assistant（基于大型语言模型的分子图助手） [01:52] 🤖 DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution（DeeR-VLA：动态推理多模态大语言模型以实现高效机器人执行） [02:28] 🤖 Sample-Efficient Alignment for LLMs（LLM的高效对齐方法） [03:01] 🚦 Controlling Language and Diffusion Models by Transporting Activations（通过传输激活控制语言和扩散模型） [03:49] 🌟 DreamPolish: Domain Score Distillation With Progressive Geometry Generation（梦幻抛光：基于渐进几何生成的领域分数蒸馏） [04:32] 🦓 Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge（斑马-羊驼：一种用于普及罕见病知识的上下文感知大型语言模型） [05:12] 👕 GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details（GarVerseLOD：利用多层次细节数据集从单张自然图像中进行高保真3D服装重建） [05:46] 🔍 Correlation of Object Detection Performance with Visual Saliency and Depth Estimation（目标检测性能与视觉显著性和深度估计的相关性） [06:28] 🔄 Adaptive Length Image Tokenization via Recurrent Allocation（通过递归分配实现自适应长度图像标记化） [07:01] 🧠 Inference Optimal VLMs Need Only One Visual Token but Larger Models（推断最优的视觉语言模型仅需一个视觉标记但需要更大的模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99

8个月前

2024.11.05 每日AI论文 | AndroidLab提升代理性能，WebRL优化网络任务表现。

本期的 17 篇论文如下： [00:26] 🤖 AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents（AndroidLab：Android自主代理的训练与系统基准测试） [01:15] 🌐 WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning（WebRL：通过自进化在线课程强化学习训练LLM网络代理） [01:55] 🌐 Training-free Regional Prompting for Diffusion Transformers（无需训练的扩散变换器区域提示） [02:36] 🌍 Survey of Cultural Awareness in Language Models: Text and Beyond（语言模型中的文化意识调查：文本与超越） [03:15] 🤖 Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent（混元-大：腾讯开源的520亿激活参数模型） [03:52] 📊 DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models（DynaMath：评估视觉语言模型数学推理鲁棒性的动态视觉基准） [04:29] 🎥 How Far is Video Generation from World Model: A Physical Law Perspective（视频生成与世界模型有多远：物理定律视角） [05:08] ⚡ Adaptive Caching for Faster Video Generation with Diffusion Transformers（基于扩散变换器的自适应缓存加速视频生成） [05:48] 🦖 DynaSaur: Large Language Agents Beyond Predefined Actions（DynaSaur：超越预定义动作的大型语言模型代理） [06:26] 🎥 GenXD: Generating Any 3D and 4D Scenes（GenXD：生成任意3D和4D场景） [07:01] 📊 Sparsing Law: Towards Large Language Models with Greater Activation Sparsity（稀疏化定律：迈向更大激活稀疏性的大语言模型） [07:45] 📚 LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models（LIBMoE：大型语言模型中混合专家的综合基准库） [08:26] 🎥 PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance（提示引导下的多样化视频序列理解） [09:08] ⚖ "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization（给我BF16还是给我死亡？LLM量化中的精度-性能权衡） [09:48] 🌌 Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models（解码暗物质：用于解释基础模型中罕见概念的专用稀疏自编码器） [10:36] 🎨 MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D（MVPaint：同步多视角扩散用于3D绘画） [11:14] 🌍 Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks（天鹅与阿拉伯MTEB：方言感知、以阿拉伯语为中心、跨语言和跨文化的嵌入模型与基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

92

8个月前

2024.11.04 每日AI论文 | OS-ATLAS提升GUI代理性能，CAF优化生成模型效率。

本期的 17 篇论文如下： [00:25] 🤖 OS-ATLAS: A Foundation Action Model for Generalist GUI Agents（OS-ATLAS：通用GUI代理的基础动作模型） [01:07] ⚙ Constant Acceleration Flow（恒定加速度流） [01:53] 🍅 TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models（番茄：评估多模态基础模型在视觉时间推理能力） [02:33] 🎨 Randomized Autoregressive Visual Generation（随机自回归视觉生成） [03:10] 🧠 Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation（边学习边适应：通过智能工具使用适应性将LLMs应用于科学问题） [03:50] 📚 Personalization of Large Language Models: A Survey（大型语言模型的个性化：综述） [04:29] 🖼 In-Context LoRA for Diffusion Transformers（上下文LoRA用于扩散变换器） [05:09] ⚡ SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba State Space Models（SambaMixer：使用Mamba状态空间模型预测锂离子电池健康状态） [05:54] 🤖 Survey of User Interface Design and Interaction Techniques in Generative AI Applications（生成式AI应用中的用户界面设计与交互技术综述） [06:32] 🧶 HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models（HelloMeme：将空间编织注意力整合到扩散模型中以嵌入高层次和丰富保真度的条件） [07:07] 🌐 M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation（M2rc-Eval：大规模多语言仓库级代码补全评估） [07:44] 🌆 CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes（城市高斯V2：大规模场景的高效几何精确重建） [08:22] 🔄 GPT or BERT: why not both?（GPT还是BERT：为何不两者兼得？） [09:02] 🎭 Face Anonymization Made Simple（面部匿名化变得简单） [09:40] 📊 Zipfian Whitening（齐夫白化） [10:19] 📚 WikiNER-fr-gold: A Gold-Standard NER Corpus（WikiNER-fr-gold：一个金标准命名实体识别语料库） [10:53] 🧠 GRS-QA -- Graph Reasoning-Structured Question Answering Dataset（图推理结构化问答数据集）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

87

8个月前

【周末特辑】11月第1周最火AI论文 | 多模态遗忘新基准CLEAR，GPT-4o系统卡片详解。

本期的 5 篇论文如下： [00:41] TOP1(🔥191) | 🧠 CLEAR: Character Unlearning in Textual and Visual Modalities（CLEAR：文本与视觉模态中的字符遗忘） [02:58] TOP2(🔥70) | 🤖 GPT-4o System Card（GPT-4o系统卡片） [04:50] TOP3(🔥50) | 🔍 Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders（解构SDXL Turbo：使用稀疏自编码器解释文本到图像模型） [06:53] TOP4(🔥49) | 🗣 CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation（CORAL：多轮对话增强生成基准测试） [08:44] TOP5(🔥48) | 🚀 ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting（ROCKET-1：利用视觉-时间上下文提示掌握开放世界交互）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

8个月前

2024.11.01 每日AI论文 | 稀疏自编码器提升图像模型可解释性，梯度视角揭示LLMs层级差异。

本期的 11 篇论文如下： [00:27] 🔍 Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders（解构SDXL Turbo：使用稀疏自编码器解释文本到图像模型） [01:05] 🧠 What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective（LLMs训练中快速与慢速思考的层级差异：梯度视角） [01:43] 🔍 A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents（基于指针网络的多标签多类别意图联合提取与检测方法） [02:23] 🔄 Constraint Back-translation Improves Complex Instruction Following of Large Language Models（约束反向翻译提升大型语言模型复杂指令遵循能力） [02:59] 📄 Language Models can Self-Lengthen to Generate Long Texts（语言模型能够自我延长以生成长文本） [03:35] 📊 BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays（BenchX：胸部X光片医学视觉-语言预训练统一基准框架） [04:17] 💾 BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments（BitStack：在可变内存环境中压缩大型语言模型的细粒度大小控制） [05:04] 🤖 Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks（探索未知：基于聊天的个性化探索任务协作界面） [05:40] 🤖 SelfCodeAlign: Self-Alignment for Code Generation（自代码对齐：代码生成中的自对齐方法） [06:18] 🎥 DELTA: Dense Efficient Long-range 3D Tracking for any video（DELTA：高效密集长程3D视频追踪） [06:57] 🎥 Learning Video Representations without Natural Videos（无需自然视频即可学习视频表示）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

94

8个月前

2024.10.31 每日AI论文 | 多轮对话评估新基准，机器人任务高效推理模型。

本期的 5 篇论文如下： [00:29] 🗣 CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation（CORAL：多轮对话增强生成基准测试） [01:09] 🤖 A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks（大型递归动作模型：xLSTM为机器人任务实现快速推理） [01:50] 🔍 Stealing User Prompts from Mixture of Experts（从混合专家模型中窃取用户提示） [02:26] 🩺 AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels（自动医疗信息检索：无需相关标签的有效零样本检索） [02:58] 🔄 TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters（TokenFormer：重新思考Transformer的扩展与模型参数的标记化）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

4分钟

99+

9个月前

2024.10.30 每日AI论文 | 多模态遗忘挑战大，AutoKaggle提升效率。

本期的 8 篇论文如下： [00:33] 🧠 CLEAR: Character Unlearning in Textual and Visual Modalities（CLEAR：文本与视觉模态中的字符遗忘） [01:10] 🤖 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions（AutoKaggle：一种用于自主数据科学竞赛的多智能体框架） [01:46] 🤖 SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization（社交GPT：通过贪婪段优化提示LLMs进行社交关系推理） [02:26] 🌐 OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization（开放式网络航海者：通过迭代现实世界探索、反馈和优化构建多模态网络代理） [03:13] 🧠 Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning（Flow-DPO：通过在线多智能体学习提升LLM数学推理能力） [03:52] 🚀 ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference（ShadowKV：高吞吐量长上下文LLM推理的KV缓存优化） [04:31] 🤖 Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset（机器人预训练机器人：基于大规模机器人数据集的以操作为中心的机器人表示） [05:17] 🤖 Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning（基于人机协作强化学习的精确灵巧机器人操作）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

74

9个月前

2024.10.29 每日AI论文 | 波兰语模型性能提升，异构代理系统创新。

本期的 17 篇论文如下： [00:24] 🇵 Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation（Bielik 7B v0.1：波兰语言模型——开发、洞察与评估） [01:00] 🤖 AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant（AgentStore：可扩展的异构代理作为专业化通才计算机助手集成） [01:39] 🤖 GPT-4o System Card（GPT-4o系统卡片） [02:21] 📄 Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction（文档解析揭秘：结构化信息提取的技术、挑战与前景） [03:08] 🤖 LongReward: Improving Long-context Large Language Models with AI Feedback（长奖励：通过AI反馈提升长上下文大语言模型） [03:43] 🎥 MarDini: Masked Autoregressive Diffusion for Video Generation at Scale（MarDini：大规模视频生成的掩码自回归扩散模型） [04:22] 🌟 DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation（DreamClear：高容量真实世界图像修复与隐私安全数据集构建） [05:10] 🧩 GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation（GrounDiT：基于噪声补丁移植的扩散变换器空间定位） [05:49] 📚 A Survey of Small Language Models（小语言模型综述） [06:23] 💾 COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training（COAT：压缩优化器状态和激活以实现高效的FP8训练） [06:58] ⚡ Fast Best-of-N Decoding via Speculative Rejection（基于推测拒绝的快速最佳N解码） [07:36] 🔍 Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines（视觉搜索助手：赋能视觉-语言模型作为多模态搜索引擎） [08:25] 🎥 LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior（LARP：利用学习到的自回归生成先验进行视频标记化） [09:00] 🤖 Neural Fields in Robotics: A Survey（机器人学中的神经场：综述） [09:40] 🗣 Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction（对话2流程：预训练软对比动作驱动句子嵌入用于自动对话流程提取） [10:15] 🩺 Language Models And A Second Opinion Use Case: The Pocket Professional（语言模型与第二意见应用案例：口袋专家） [10:55] 🤖 Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation（利用局部性提升机器人操作的样本效率）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

86

9个月前

2024.10.28 每日AI论文 | 视觉-时间提示提升交互，连续扩散模型优化语音合成

本期的 13 篇论文如下： [00:25] 🚀 ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting（ROCKET-1：利用视觉-时间上下文提示掌握开放世界交互） [01:14] 🗣 Continuous Speech Synthesis using per-token Latent Diffusion（基于每标记潜在扩散的连续语音合成） [01:55] ⚡ Teach Multimodal LLMs to Comprehend Electrocardiographic Images（教授多模态大语言模型理解心电图图像） [02:39] 🌐 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data（无限多模态：通过大规模高质量指令数据扩展多模态性能） [03:23] ⚡ FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality（FasterCache：无训练视频扩散模型加速与高质量生成） [03:56] 🎧 MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark（大规模多任务音频理解与推理基准） [04:34] 🧠 Counting Ability of Large Language Models and Impact of Tokenization（大型语言模型的计数能力及其对分词的影响） [05:08] 🧠 Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning（通过先决学习利用虚构合成数据提升LLM事实性） [05:46] 🤖 Reflection-Bench: probing AI intelligence with reflection（反射-基准：通过反射探测AI智能） [06:23] 🤖 Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback（混合偏好：学习路由实例以进行人机反馈） [06:57] 🔍 Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration（利用未标注的先验数据进行高效在线探索） [07:35] 🔍 Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance（LLM是否优于报告？检测标签错误并减轻其对模型性能的影响） [08:15] 🤖 Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling（基于图神经网络的动态三维高斯跟踪用于神经动力学建模）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

86

9个月前

【周末特辑】10月第4周最火AI论文 | 少样本NeRF高效收敛，长视频分割精度提升。

本期的 5 篇论文如下： [00:44] TOP1(🔥79) | ⚡ FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors（节俭NeRF：无学习先验的少样本新视角合成快速收敛） [02:42] TOP2(🔥60) | 🌳 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree（SAM2Long：通过无训练记忆树增强SAM 2以实现长视频分割） [04:19] TOP3(🔥58) | 🚀 Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss（打破内存壁垒：对比损失的近无限批量规模扩展） [06:11] TOP4(🔥55) | 🤖 CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution（指南针评判者-1：一体化评判模型助力模型评估与进化） [08:28] TOP5(🔥52) | 💼 UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models（UCFE：面向用户的大语言模型金融专业能力基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

9个月前