HuggingFace 每日AI论文速递 - 节目列表

2025.12.30 | ERC耦合路由与专家;LiveTalk实时视频对话

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:24] 🔗 Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss(通过辅助损失耦合专家混合模型中的专家与路由器) [01:07] 🎬 LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation(LiveTalk:通过改进的策略内蒸馏实现实时多模态交互式视频扩散) [01:55] 🌍 Yume-1.5: A Text-Controlled Interactive World Generation Model(Yume-1.5:一种文本控制的交互式世界生成模型) [02:30] 🔍 SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents(SmartSnap:自验证智能体的主动证据寻求范式) [02:59] 🔮 Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation(扩散模型知晓透明度:将视频扩散模型重新用于透明物体的深度与法线估计) [03:40] 🎯 SpotEdit: Selective Region Editing in Diffusion Transformers(SpotEdit:扩散变换器中的选择性区域编辑) [04:23] 🚀 Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone(Dream-VL与Dream-VLA:基于扩散语言模型骨干的开放视觉-语言与视觉-语言-动作模型) [05:09] 🔍 GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models(GRAN-TED:为扩散模型生成鲁棒、对齐且细致的文本嵌入) [05:56] 🤖 Act2Goal: From World Model To General Goal-conditioned Policy(Act2Goal:从世界模型到通用目标条件策略) [06:31] ⚡ Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion(Stream-DiffVSR:基于自回归扩散的低延迟可流式视频超分辨率) [06:59] 🌐 Web World Models(Web世界模型) [07:34] 🚀 DiRL: An Efficient Post-Training Framework for Diffusion Language Models(DiRL:一种高效的扩散语言模型后训练框架) [08:19] 🎬 Video-BrowseComp: Benchmarking Agentic Video Research on Open Web(Video-BrowseComp:面向开放网络的智能体视频研究基准测试) [09:02] 🧠 Training AI Co-Scientists Using Rubric Rewards(使用评分标准奖励训练AI科研助手) [09:39] 🧩 Monadic Context Engineering(单子上下文工程) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
45
4个月前

2025.12.29 | 鸟瞰式检索提效小模型;4D扩散一键插入逼真物体

HuggingFace 每日AI论文速递

本期的 13 篇论文如下: [00:27] 🧠 Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding(面向提升长文本理解的思维景观感知检索增强生成) [01:07] 🎬 InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion(InsertAnywhere:连接4D场景几何与扩散模型以实现逼真的视频对象插入) [01:46] 🤖 MAI-UI Technical Report: Real-World Centric Foundation GUI Agents(MAI-UI技术报告:面向真实世界的通用图形用户界面智能体) [02:22] 👁 UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture(UniPercept:迈向跨美学、质量、结构与纹理的统一感知级图像理解) [03:04] 🎨 ProEdit: Inversion-based Editing From Prompts Done Right(ProEdit:基于反演的提示编辑的正确方法) [03:58] ⏱ TimeBill: Time-Budgeted Inference for Large Language Models(TimeBill:面向大语言模型的时间预算推理框架) [04:37] 🧠 See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning(少看,看对:用于多模态推理的双向感知塑造) [05:16] 🌦 Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding(Omni-Weather:用于天气生成与理解的多模态统一基础模型) [05:48] 🧠 SVBench: Evaluation of Video Generation Models on Social Reasoning(SVBench:视频生成模型在社会推理能力上的评估) [06:27] 🔍 InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search(InSight-o3:赋能多模态基础模型实现广义视觉搜索) [07:15] 🎨 SlideTailor: Personalized Presentation Slide Generation for Scientific Papers(SlideTailor:面向科研论文的个性化演示文稿幻灯片生成) [08:11] 🤖 SWE-RM: Execution-free Feedback For Software Engineering Agents(SWE-RM:面向软件工程智能体的无执行反馈机制) [08:48] ⚡ A 58-Addition, Rank-23 Scheme for General 3x3 Matrix Multiplication(一种用于通用3x3矩阵乘法的58次加法、秩23方案) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
88
4个月前

2025.12.25 | 四维动态理解刷新VLM;单卡200倍速生成高清视频

HuggingFace 每日AI论文速递

本期的 14 篇论文如下: [00:20] 🧠 Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models(学习在四维空间中推理:视觉语言模型的动态空间理解) [01:11] ⚡ TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times(TurboDiffusion:将视频扩散模型加速100-200倍) [01:52] 🧭 T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation(T2AV-Compass:迈向文本到音视频生成的统一评估) [02:38] 🎬 DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation(DreaMontage:基于任意帧引导的单镜头视频生成) [03:21] 🔍 Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models(超越记忆:一个多模态序数回归基准揭示视觉语言模型中的流行度偏差) [04:07] 🎬 HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming(HiStream:通过消除冗余的流式处理实现高效高分辨率视频生成) [04:52] 🚀 Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning(Nemotron 3 Nano:用于智能体推理的开放、高效混合专家Mamba-Transformer模型) [05:38] 🔍 TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior(TokSuite:衡量分词器选择对语言模型行为的影响) [06:12] 🚀 NVIDIA Nemotron 3: Efficient and Open Intelligence(NVIDIA Nemotron 3:高效且开放的智能模型) [06:57] 🎬 Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations(基于下一帧预测的学习:自回归视频建模编码有效表示) [07:27] 🎬 Streaming Video Instruction Tuning(流式视频指令微调) [08:02] 🧠 Multi-hop Reasoning via Early Knowledge Alignment(通过早期知识对齐实现多跳推理) [08:43] 📊 SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios(SWE-EVO:在长周期软件演化场景中评估编码智能体的基准) [09:24] 🏆 LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics(LLM瑞士轮:通过竞争性瑞士制动态聚合多基准性能) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
82
4个月前

2025.12.24 | 语义蓝图提速视频生成;逐层剖析炼出强策略

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:19] 🎬 SemanticGen: Video Generation in Semantic Space(SemanticGen:在语义空间中的视频生成) [01:01] 🔍 Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies(自底向上策略优化:你的语言模型策略中暗含内部策略) [01:48] 🧠 SpatialTree: How Spatial Abilities Branch Out in MLLMs(SpatialTree:多模态大语言模型中的空间能力如何分支发展) [02:23] 🤖 LongVideoAgent: Multi-Agent Reasoning with Long Videos(LongVideoAgent:基于多智能体推理的长视频理解) [03:06] 🧠 MemEvolve: Meta-Evolution of Agent Memory Systems(MemEvolve:智能体记忆系统的元进化) [03:46] 🔍 Step-DeepResearch Technical Report(Step-DeepResearch技术报告) [04:22] 🎧 SAM Audio: Segment Anything in Audio(SAM Audio:音频中的任意分割) [05:00] 🚀 INTELLECT-3: Technical Report(INTELLECT-3:技术报告) [05:30] 🔍 FaithLens: Detecting and Explaining Faithfulness Hallucination(FaithLens:检测与解释忠实性幻觉) [06:07] 🧠 Reinforcement Learning for Self-Improving Agent with Skill Library(基于技能库与强化学习的自进化智能体研究) [06:53] 📊 QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models(QuantiPhy:评估视觉语言模型物理推理能力的定量基准) [07:38] 🔊 Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems(Simulstream:用于流式语音到文本翻译系统评估与演示的开源工具包) [08:18] 🧠 Active Intelligence in Video Avatars via Closed-loop World Modeling(通过闭环世界建模实现视频化身的主动智能) [08:55] 🔬 Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation(基于多LLM与双重可靠性度量的主题分析:结合Cohen's Kappa与语义相似度进行定性研究验证) [09:32] ⚠ Toxicity Ahead: Forecasting Conversational Derailment on GitHub(毒性预警:预测GitHub对话中的脱轨行为) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
4个月前

2025.12.23 | 数据工厂提效;棱镜假说统合

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:22] ⚙ DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI(DataFlow:面向数据为中心AI时代的统一数据准备与工作流自动化LLM驱动框架) [01:04] 🔍 The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding(棱镜假说:通过统一自编码协调语义与像素表示) [01:50] 🎬 Region-Constraint In-Context Generation for Instructional Video Editing(区域约束的上下文生成用于教学视频编辑) [02:33] 🎥 Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation(无限单应性变换作为相机控制视频生成的鲁棒条件) [03:08] 🔍 QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation(QuCo-RAG:基于预训练语料的动态检索增强生成不确定性量化) [03:58] 🤔 Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction(大型语言模型能否评估学生困境?基于能力模拟的人机难度对齐用于试题难度预测) [04:35] 🧭 LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry(LoGoPlanner:基于定位与度量感知视觉几何的导航策略) [05:13] 🎬 WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion(WorldWarp:利用异步视频扩散传播三维几何) [06:08] 🔍 UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models(UCoder:通过内部探测大语言模型实现无监督代码生成) [06:45] 🧬 GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators(GenEnv:基于难度对齐的大语言模型智能体与环境模拟器协同进化框架) [07:22] 🎨 Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs(推理调色板:通过潜在情境化调节推理以实现(视觉)语言模型的可控探索) [07:56] ⚡ LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding(LoPA:通过前瞻并行解码扩展扩散大语言模型推理) [08:38] 📱 MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments(MobileWorld:在智能体-用户交互与MCP增强环境中评测自主移动智能体) [09:20] ⚖ Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital(它能对上吗?迈向风险投资领域的自主法律智能体) [10:00] 🎬 StoryMem: Multi-shot Long Video Storytelling with Memory(StoryMem:基于记忆的多镜头长视频故事讲述) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
4个月前

2025.12.22 | PhysBrain用第一人称视频让AI学会动手;大模型离科学家AI还差得远

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:24] 🧠 PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence(PhysBrain:以人类第一人称数据为桥梁,从视觉语言模型迈向物理智能) [01:05] 🔬 Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows(通过科学家对齐的工作流程探究大语言模型的科学通用智能) [01:34] 🧠 When Reasoning Meets Its Laws(当推理遇见其定律) [02:16] 🧠 Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience(Seed-Prover 1.5:通过经验学习掌握本科级定理证明) [03:02] 🧠 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation(4D-RGPT:通过感知蒸馏实现区域级4D理解) [03:51] 🎨 Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing(语义与重建皆重要:让表征编码器为文本到图像生成与编辑做好准备) [04:30] ⚖ Are We on the Right Way to Assessing LLM-as-a-Judge?(我们评估LLM作为评判者的方法正确吗?) [05:05] 📡 RadarGen: Automotive Radar Point Cloud Generation from Cameras(RadarGen:基于摄像头的汽车雷达点云生成) [05:54] 🔬 Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers(语言模型的物理学:第4.1部分,架构设计与Canon层的魔力) [06:41] 🎬 HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering(HERBench:视频问答中多证据整合的基准测试) [07:26] 🔍 GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation(GroundingME:通过多维评估揭示MLLMs中的视觉基础能力差距) [08:06] ⚙ SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories(SWE-Bench++:一种从开源仓库可扩展生成软件工程基准的框架) [08:39] 🧠 Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs(Turn-PPO:基于回合级优势估计与PPO的智能体大语言模型多轮强化学习优化) [09:14] ⚡ StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models(StageVAR:面向视觉自回归模型的阶段感知加速) [09:48] 🤖 An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges(视觉-语言-动作模型剖析:从模块、里程碑到挑战) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
4个月前

2025.12.19 | Kling-Omni一统视频生成;LLaDA2.0百亿扩散模型

HuggingFace 每日AI论文速递

本期的 14 篇论文如下: [00:26] 🎬 Kling-Omni Technical Report(Kling-Omni技术报告) [01:02] 🚀 LLaDA2.0: Scaling Up Diffusion Language Models to 100B(LLaDA2.0:将扩散语言模型扩展至1000亿参数) [01:41] 🔮 Next-Embedding Prediction Makes Strong Vision Learners(下一嵌入预测构建强大的视觉学习器) [02:27] 👓 StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors(StereoPilot:通过生成先验学习统一且高效的立体转换) [02:58] 🎬 Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model(Seedance 1.5 pro:一个原生音视频联合生成基础模型) [03:34] 🔭 Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation(全景深度估计基础模型:深度任意全景) [04:11] 📸 Generative Refocusing: Flexible Defocus Control from a Single Image(生成式重聚焦:从单张图像实现灵活散焦控制) [04:56] 🤖 Adaptation of Agentic AI(智能体人工智能的适应性研究) [05:36] ⚗ Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection(炼金术士:通过元梯度数据选择提升文本到图像模型训练效率) [06:12] 🛡 DeContext as Defense: Safe Image Editing in Diffusion Transformers(以去上下文为防御:扩散变换器中的安全图像编辑) [06:58] 🧭 N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models(N3D-VLM:原生3D基础实现视觉语言模型中的精确空间推理) [07:49] 🎨 The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text(世界即画布:用参考图像、轨迹和文本绘制可提示事件) [08:30] 🔧 AdaTooler-V: Adaptive Tool-Use for Images and Videos(AdaTooler-V:面向图像与视频的自适应工具使用) [09:19] 🤔 Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward(探索与利用之辩:通过裁剪、熵与虚假奖励重新审视RLVR) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
4个月前

2025.12.18 | 校准步长奖励砍成本;扩散草稿自回归验证提速

HuggingFace 每日AI论文速递

本期的 14 篇论文如下: [00:25] 🤖 Step-GUI Technical Report(Step-GUI技术报告) [00:59] ⚡ DEER: Draft with Diffusion, Verify with Autoregressive Models(DEER:基于扩散模型生成草稿,基于自回归模型验证) [01:31] ⚡ Fast and Accurate Causal Parallel Decoding using Jacobi Forcing(使用雅可比强制实现快速准确的因果并行解码) [02:10] 🚀 HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices(HyperVL:面向边缘设备的高效动态多模态大语言模型) [02:48] 🎬 IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning(IC-Effect:基于上下文学习的精确高效视频特效编辑) [03:30] 🔍 Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning(Skyra:基于可感知视觉伪影推理的AI生成视频检测) [04:03] 🧠 Universal Reasoning Model(通用推理模型) [04:45] 🔍 Robust and Calibrated Detection of Authentic Multimedia Content(鲁棒且可校准的真实多媒体内容检测) [05:33] 🧭 Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning(大型语言模型能否引导自身探索?基于梯度引导的强化学习用于LLM推理) [06:14] 🌍 FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition(FiNERweb:用于可扩展多语言命名实体识别的数据集与工具集) [06:54] 📊 MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence(MMSI-Video-Bench:一个面向视频空间智能的综合性基准测试) [07:47] 🔄 DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models(DiffusionVL:将任意自回归模型转化为扩散视觉语言模型) [08:24] 🧠 SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning(SAGE:通过强化学习训练智能任意时域代理以进行长视频推理) [09:02] 🎬 End-to-End Training for Autoregressive Video Diffusion via Self-Resampling(通过自重采样实现自回归视频扩散模型的端到端训练) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
4个月前

加入我们的 Discord

与播客爱好者一起交流

立即加入

扫描微信二维码

添加微信好友,获取更多播客资讯

微信二维码

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧