节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2025.01.13 | OmniManip实现通用机器人操作，VideoRAG提升视频检索生成性能。

本期的 10 篇论文如下： [00:24] 🤖 OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints（OmniManip：通过以对象为中心的交互原语作为空间约束实现通用机器人操作） [01:02] 🎥 VideoRAG: Retrieval-Augmented Generation over Video Corpus（VideoRAG：基于视频语料库的检索增强生成） [01:38] 🎥 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?（OVO-Bench：你的视频大语言模型离现实世界在线视频理解还有多远？） [02:26] 🧠 LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs（LlamaV-o1：重新思考大语言模型中的逐步视觉推理） [03:01] 🧠 Enabling Scalable Oversight via Self-Evolving Critic（通过自进化批评实现可扩展监督） [03:34] 🎥 ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning（ConceptMaster：无需测试时调优的扩散变换器模型上的多概念视频定制） [04:09] 🎥 Multi-subject Open-set Personalization in Video Generation（多主体开放集个性化视频生成） [04:47] 🔍 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding（ReFocus：视觉编辑作为结构化图像理解的思维链） [05:23] 🤖 Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains（多智能体微调：通过多样化推理链实现自我改进） [06:00] 🦠 Infecting Generative AI With Viruses（感染生成式人工智能的病毒）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

9个月前

【周末特辑】1月第1周最火AI论文 | 小型模型超越大型模型，REINFORCE++简化对齐方法

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:39] TOP1(🔥173) | 🧠 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking（rStar-Math：小型语言模型通过自我进化的深度思考掌握数学推理） [03:03] TOP2(🔥71) | 🚀 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models（REINFORCE++：一种简单高效的大语言模型对齐方法） [05:17] TOP3(🔥63) | 🧠 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though（迈向LLMs中的系统2推理：学习如何通过元思维链进行思考） [07:35] TOP4(🔥57) | 🔬 Agent Laboratory: Using LLM Agents as Research Assistants（智能体实验室：利用LLM智能体作为研究助手） [09:41] TOP5(🔥52) | 🌍 Cosmos World Foundation Model Platform for Physical AI（物理AI的宇宙世界基础模型平台）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

9个月前

2025.01.10 每日AI论文 | GAN训练简化性能提升，视频自回归预训练竞争力显著。

HuggingFace 每日AI论文速递

本期的 7 篇论文如下： [00:23] 🧠 The GAN is dead; long live the GAN! A Modern GAN Baseline（GAN已死；GAN万岁！一个现代的GAN基线） [01:02] 🎥 An Empirical Study of Autoregressive Pre-training from Videos（视频自回归预训练的实证研究） [01:49] 🚗 Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives（视觉语言模型是否准备好用于自动驾驶？从可靠性、数据和指标角度的实证研究） [02:32] 🔍 On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis（关于视觉自回归模型的计算极限与可证明高效准则：细粒度复杂度分析） [03:14] 🌍 Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model（Centurio：大型视觉语言模型多语言能力的驱动因素研究） [03:50] 📜 Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models（构建历史土耳其语自然语言处理的基础：资源与模型） [04:26] 🔒 Entropy-Guided Attention for Private LLMs（熵引导注意力机制在私有大语言模型中的应用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

99+

9个月前

2025.01.09 每日AI论文 | 小型模型自我进化超越GPT-3，多模态模型提升数学推理能力。

HuggingFace 每日AI论文速递

本期的 11 篇论文如下： [00:25] 🧠 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking（rStar-Math：小型语言模型通过自我进化的深度思考掌握数学推理） [01:06] 🧠 URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics（URSA：理解与验证多模态数学中的思维链推理） [01:45] 🧠 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though（迈向LLMs中的系统2推理：学习如何通过元思维链进行思考） [02:25] 🔬 Agent Laboratory: Using LLM Agents as Research Assistants（智能体实验室：利用LLM智能体作为研究助手） [03:02] 🔬 LLM4SR: A Survey on Large Language Models for Scientific Research（LLM4SR：大语言模型在科学研究中的应用综述） [03:44] 🔍 GeAR: Generation Augmented Retrieval（生成增强检索） [04:22] 🤖 InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection（InfiGUIAgent：具备原生推理与反思能力的多模态通用GUI代理） [05:02] 🐦 Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation（Chirpy3D：基于连续部件潜变量的创造性3D鸟类生成） [05:41] 🖼 SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images（SPAR3D：基于单图像的稳定点感知三维物体重建） [06:17] 🧠 DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization（DPO核：一种语义感知、核增强且富含散度的直接偏好优化范式） [06:55] 🌳 EpiCoder: Encompassing Diversity and Complexity in Code Generation（EpiCoder：在代码生成中涵盖多样性与复杂性）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

9个月前

2025.01.08 每日AI论文 | REINFORCE++提升大模型对齐效率，MotionBench优化视频运动理解

HuggingFace 每日AI论文速递

本期的 11 篇论文如下： [00:24] 🚀 REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models（REINFORCE++：一种简单高效的大语言模型对齐方法） [01:00] 🎥 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models（MotionBench：用于评估和改进视觉语言模型细粒度视频运动理解的基准） [01:40] 🔍 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos（Sa2VA：将SAM2与LLaVA结合以实现图像和视频的密集基础理解） [02:21] 🌍 Cosmos World Foundation Model Platform for Physical AI（物理AI的宇宙世界基础模型平台） [03:01] 🔍 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token（LLaVA-Mini：使用单一视觉标记的高效图像与视频大型多模态模型） [03:40] 🎥 Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control（扩散作为着色器：支持多样化视频生成控制的3D感知视频扩散） [04:22] 🎥 MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting（MoDec-GS：全局到局部运动分解与时间间隔调整用于紧凑动态3D高斯泼溅） [05:05] 📊 PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides（PPTAgent：超越文本到幻灯片的演示文稿生成与评估） [05:42] 🎭 MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control（MagicFace：基于动作单元控制的高保真面部表情编辑） [06:17] 🎥 Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers（魔镜：基于视频扩散变换器的身份保持视频生成） [06:52] 🐬 Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback（海豚：通过思考、实践和反馈实现闭环开放式自动研究）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

86

9个月前

2025.01.07 每日AI论文 | STAR提升视频超分辨率时空一致性，BoostStep增强大模型数学推理能力。

HuggingFace 每日AI论文速递

本期的 16 篇论文如下： [00:24] 🎥 STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution（STAR：基于文本到视频模型的空间-时间增强用于现实世界视频超分辨率） [01:06] 🧮 BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning（BoostStep：通过改进单步推理提升大语言模型的数学能力） [01:44] 🤖 Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction（Dispider：通过解耦感知、决策和反应实现视频大语言模型的主动实时交互） [02:19] 🧠 Personalized Graph-Based Retrieval for Large Language Models（基于个性化图检索的大语言模型增强生成） [02:54] 🧠 Test-time Computing: from System-1 Thinking to System-2 Thinking（测试时计算：从系统1思维到系统2思维） [03:34] 🦠 METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring（METAGENE-1：用于疫情监测的宏基因组基础模型） [04:13] 🎥 GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking（GS-DiT：通过高效密集3D点跟踪推进伪4D高斯场视频生成） [04:48] 🎥 Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation（通过掩码：基于掩码的运动轨迹用于图像到视频生成） [05:27] 🎥 TransPixar: Advancing Text-to-Video Generation with Transparency（TransPixar：利用透明度推进文本到视频生成） [06:06] 🎥 Ingredients: Blending Custom Photos with Video Diffusion Transformers（成分：将定制照片与视频扩散变换器融合） [06:45] 🔍 DepthMaster: Taming Diffusion Models for Monocular Depth Estimation（DepthMaster：驯服扩散模型用于单目深度估计） [07:24] 🛡 Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models（Auto-RT：自动红队策略探索用于大型语言模型的越狱） [08:04] 🔍 ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use（ToolHop：用于评估大语言模型在多跳工具使用中的查询驱动基准） [08:43] 🔍 Scaling Laws for Floating Point Quantization Training（浮点量化训练的缩放定律） [09:19] 🎤 Samba-asr state-of-the-art speech recognition leveraging structured state-space models（Samba-ASR：利用结构化状态空间模型实现最先进的语音识别） [09:59] 🎨 AutoPresent: Designing Structured Visuals from Scratch（AutoPresent：从零开始设计结构化视觉内容）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

10个月前

2025.01.06 每日AI论文 | EnerVerse提升机器人操作规划能力，VITA-1.5优化实时视觉语音交互。

HuggingFace 每日AI论文速递

本期的 8 篇论文如下： [00:24] 🤖 EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation（EnerVerse：面向机器人操作的具身未来空间构想） [00:58] 🤖 VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction（VITA-1.5：迈向GPT-4o级别的实时视觉与语音交互） [01:33] 🤔 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM（Virgo：关于复现o1类多模态大语言模型的初步探索） [02:11] 🤖 SDPO: Segment-Level Direct Preference Optimization for Social Agents（SDPO：面向社交代理的片段级直接偏好优化） [02:51] 🎨 VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation（VisionReward：基于细粒度多维人类偏好的图像与视频生成学习） [03:31] 🧬 Graph Generative Pre-trained Transformer（图生成预训练变换器） [04:04] 🌍 LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models（LUSIFER：基于大语言模型的语言通用空间集成增强多语言嵌入） [04:44] 🔬 BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery（BoxingGym：自动化实验设计与模型发现进展的基准测试）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

5分钟

99+

10个月前

【月末特辑】12月最火AI论文 | Qwen2.5提升大语言模型性能，阿波罗优化视频理解效率。

HuggingFace 每日AI论文速递

本期的 10 篇论文如下： [00:31] TOP1(🔥335) | 🤖 Qwen2.5 Technical Report（Qwen2.5技术报告） [02:44] TOP2(🔥136) | 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models（阿波罗：大型多模态模型中的视频理解探索） [05:01] TOP3(🔥123) | 🚀 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling（通过模型、数据和测试时扩展提升开源多模态模型的性能边界） [07:18] TOP4(🔥121) | 🔄 PaliGemma 2: A Family of Versatile VLMs for Transfer（PaliGemma 2：多功能视觉语言模型的迁移研究） [09:38] TOP5(🔥116) | 🚀 Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference（更智能、更优、更快、更长：一种现代双向编码器，用于快速、内存高效的长上下文微调和推理） [12:21] TOP6(🔥108) | 🚀 SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance（SNOOPI：超强一步扩散蒸馏与适当引导） [14:42] TOP7(🔥105) | 🔍 VisionZip: Longer is Better but Not Necessary in Vision Language Models（视觉压缩：视觉语言模型中长度并非必要优势） [16:51] TOP8(🔥96) | 🧠 Phi-4 Technical Report（Phi-4 技术报告） [18:55] TOP9(🔥92) | 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions（InternLM-XComposer2.5-OmniLive：一个用于长期流式视频和音频交互的综合多模态系统） [21:02] TOP10(🔥91) | 🧠 Are Your LLMs Capable of Stable Reasoning?（你的大语言模型能够稳定推理吗？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

23分钟

99+

10个月前

【周末特辑】12月第5周最火AI论文 | 提升医学推理能力，自动化GUI轨迹构建。

HuggingFace 每日AI论文速递

本期的 5 篇论文如下： [00:35] TOP1(🔥83) | 🧠 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs（华佗GPT-o1：迈向医学复杂推理的大语言模型） [02:49] TOP2(🔥65) | 🤖 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis（OS-Genesis：通过逆向任务合成自动化GUI代理轨迹构建） [04:50] TOP3(🔥63) | 🎨 1.58-bit FLUX（1.58位FLUX：首个成功量化最先进文本生成图像模型的方法） [07:00] TOP4(🔥60) | 🔍 Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization（解释性指令：迈向统一视觉任务理解与零样本泛化） [09:02] TOP5(🔥53) | 📚 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining（2.5年课堂：用于视觉-语言预训练的多模态教科书）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

10个月前

2025.01.03 每日AI论文 | 多模态教科书提升视觉语言模型性能，VideoAnydoor实现高保真视频对象插入

HuggingFace 每日AI论文速递

本期的 17 篇论文如下： [00:24] 📚 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining（2.5年课堂：用于视觉-语言预训练的多模态教科书） [01:02] 🎥 VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control（VideoAnydoor：高保真视频对象插入与精确运动控制） [01:39] 🎥 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM（VideoRefer套件：通过视频大语言模型推进时空对象理解） [02:13] 🏆 CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings（CodeElo：基于人类可比Elo评分的大语言模型竞赛级代码生成基准测试） [02:52] 🎨 Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models（重建与生成：潜在扩散模型中的优化困境驯服） [03:29] 🤖 ProgCo: Program Helps Self-Correction of Large Language Models（ProgCo：程序助力大语言模型自我修正） [04:03] 🗺 MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models（MapEval：基于地图的基础模型地理空间推理能力评估） [04:41] 🤖 A3: Android Agent Arena for Mobile GUI Agents（A3：移动GUI代理的安卓代理竞技场） [05:21] 🧪 Dynamic Scaling of Unit Tests for Code Reward Modeling（代码奖励建模中单元测试的动态扩展） [05:57] 🛡 MLLM-as-a-Judge for Image Safety without Human Labeling（无需人工标注的图像安全MLLM-as-a-Judge方法） [06:40] 🎥 LTX-Video: Realtime Video Latent Diffusion（LTX-视频：实时视频潜在扩散模型） [07:15] 🗺 MapQaTor: A System for Efficient Annotation of Map Query Datasets（MapQaTor：高效地图查询数据集标注系统） [07:51] 🔍 Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing（通过近期性和过度平滑的视角理解并缓解状态空间模型的瓶颈） [08:29] 🎥 SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration（SeedVR：在扩散Transformer中播种无限，实现通用视频修复） [09:13] 🤖 SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization（SeFAR：基于时间扰动和学习稳定的半监督细粒度动作识别） [09:50] 🧠 Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding（重新思考语言模型中的寻址机制：基于上下文等变位置编码） [10:27] 📊 Population Aware Diffusion for Time Series Generation（面向时间序列生成的群体感知扩散模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

10个月前

2025.01.02 每日AI论文 | 自动化GUI代理轨迹构建，优化推理任务语言模型。

HuggingFace 每日AI论文速递

本期的 2 篇论文如下： [00:26] 🤖 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis（OS-Genesis：通过逆向任务合成自动化GUI代理轨迹构建） [01:10] 🧠 Xmodel-2 Technical Report（Xmodel-2技术报告）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

2分钟

99+

10个月前

2024.12.31 每日AI论文 | 解释性指令提升视觉任务泛化，多模态模型优化医学影像泛化。

HuggingFace 每日AI论文速递

本期的 10 篇论文如下： [00:25] 🔍 Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization（解释性指令：迈向统一视觉任务理解与零样本泛化） [01:13] 🧠 On the Compositional Generalization of Multimodal LLMs for Medical Imaging（多模态大语言模型在医学影像中的组合泛化研究） [02:02] ⚙ Efficiently Serving LLM Reasoning Programs with Certaindex（高效服务LLM推理程序的Certaindex系统） [02:44] 🎨 Edicho: Consistent Image Editing in the Wild（Edicho：在野外图像中的一致性编辑） [03:22] 🎵 TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization（TangoFlux：基于流匹配和CLAP排序偏好优化的超快速且忠实文本到音频生成） [04:04] 🎥 Bringing Objects to Life: 4D generation from 3D objects（赋予物体生命：从3D物体生成4D内容） [04:47] 🧠 Facilitating large language model Russian adaptation with Learned Embedding Propagation（通过学习嵌入传播促进大语言模型的俄语适应） [05:25] 🤖 HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation（HumanEval Pro与MBPP Pro：评估大语言模型在自调用代码生成上的表现） [06:12] 🤖 Training Software Engineering Agents and Verifiers with SWE-Gym（使用SWE-Gym训练软件工程代理与验证器） [06:52] 🧠 OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System（OneKE：基于Docker化模式引导的LLM代理知识提取系统）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

10个月前