HuggingFace 每日AI论文速递 - 节目列表

【月末特辑】11月最火AI论文 | Kandinsky 5.0全家桶开源;视频生成让模型边播边想

HuggingFace 每日AI论文速递

本期的 10 篇论文如下: [00:35] TOP1(🔥219) | 🎨 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation(Kandinsky 5.0:用于图像和视频生成的基础模型家族) [02:45] TOP2(🔥207) | 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm(用视频思考:视频生成作为统一多模态推理新范式) [04:58] TOP3(🔥191) | 🌍 Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds(Lumine:在3D开放世界中打造通才智能体的开源方案) [07:26] TOP4(🔥166) | ⚡ ROOT: Robust Orthogonalized Optimizer for Neural Network Training(ROOT:面向神经网络训练的鲁棒正交化优化器) [09:37] TOP5(🔥156) | 🚀 MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling(MiroThinker:通过模型、上下文与交互扩展,将开源研究智能体性能推向新边界) [11:54] TOP6(🔥151) | 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆) [13:55] TOP7(🔥131) | 🏅 P1: Mastering Physics Olympiads with Reinforcement Learning(用强化学习攻克物理奥赛) [16:01] TOP8(🔥131) | 🍲 Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance(“汤”级模型:简单加权平均即可让大语言模型性能跃升) [18:03] TOP9(🔥126) | 🧠 Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B(小模型大逻辑:多样性驱动优化唤醒VibeThinker-1.5B的大模型推理力) [20:14] TOP10(🔥121) | 🚀 Diffusion Language Models are Super Data Learners(扩散语言模型是超级数据学习者) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

23分钟
99+
5个月前

2025.12.02 | 代码智能四步落地;LongVT长视频精准理解

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:20] 🧠 From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence(从代码基础模型到智能体与应用:代码智能实用指南) [01:05] 🎬 LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling(LongVT:通过原生工具调用激励“长视频思考”) [01:43] 🔍 Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights(Envision:面向因果世界过程洞察的统一理解与生成基准) [02:21] 🧠 Stabilizing Reinforcement Learning with LLMs: Formulation and Practices(利用大语言模型稳定强化学习的公式与实践) [02:59] 🔍 How Far Are We from Genuinely Useful Deep Research Agents?(我们距离真正有用的深度研究智能体还有多远?) [03:47] ⚖ What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards(视频生成中的重力考量?基于可验证奖励的后训练牛顿定律应用) [04:24] 🔍 The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment(一致性批评家:通过参考引导的注意力对齐纠正生成图像中的不一致性) [05:14] 🎬 Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout(Infinity-RoPE:从自回归自展开中涌现的可控动作无限视频生成) [05:58] 🔗 TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models(TUNA:驯服统一视觉表征以构建原生统一多模态模型) [06:41] 🧠 Rectifying LLM Thought from Lens of Optimization(从优化视角修正大语言模型的思维过程) [07:16] ⚡ Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning(Flash-DMD:基于高效蒸馏与联合强化学习实现高保真少步图像生成) [07:54] 🚀 LFM2 Technical Report(LFM2 技术报告) [08:31] 🤖 GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation(GR-RL:迈向灵巧与精准的长时程机器人操作) [09:09] 🎬 InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision(InternVideo-Next:迈向无需视频文本监督的通用视频基础模型) [09:44] ⚡ VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference(VLASH:基于未来状态感知的异步推理实现实时视觉-语言-动作模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
5个月前

2025.12.01 | Z-Image小参高效夺冠;REASONEDIT先思后画登顶

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:26] 🚀 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(Z-Image:基于单流扩散Transformer的高效图像生成基础模型) [01:00] 🤔 REASONEDIT: Towards Reasoning-Enhanced Image Editing Models(REASONEDIT:迈向推理增强的图像编辑模型) [01:25] 🎬 AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(AnyTalker:通过交互性精炼实现可扩展的多人物对话视频生成) [01:59] 🌉 Vision Bridge Transformer at Scale(大规模视觉桥接变换器) [02:35] 🔍 Architecture Decoupling Is Not All You Need For Unified Multimodal Model(架构解耦并非统一多模态模型的全部所需) [03:23] ⚡ DiP: Taming Diffusion Models in Pixel Space(DiP:在像素空间驾驭扩散模型) [03:49] 🧠 Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models(每个令牌都重要:在大型语言模型中泛化1600万超长上下文) [04:19] 🤖 DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action(DualVLA:通过部分解耦推理与动作构建可泛化的具身智能体) [05:02] ⚡ Adversarial Flow Models(对抗性流模型) [05:29] 🔬 Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(解耦的DMD:CFG增强为矛,分布匹配为盾) [06:10] 🎥 Captain Safari: A World Engine(Captain Safari:一种世界引擎) [06:43] 🌍 World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models(框架中的世界:理解文化混合作为视觉语言模型的新挑战) [07:20] 🔍 The Collapse of Patches(图像块坍缩) [07:50] 🔍 RefineBench: Evaluating Refinement Capability of Language Models via Checklists(RefineBench:基于检查表评估语言模型精炼能力) [08:23] 🦷 OralGPT-Omni: A Versatile Dental Multimodal Large Language Model(OralGPT-Omni:一个通用的牙科多模态大语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
99+
5个月前

2025.11.27 | 俄语多模态评测补空白;潜协作提速14%

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:22] 🔍 Multimodal Evaluation of Russian-language Architectures(俄语多模态架构的评估框架) [01:15] 🧠 Latent Collaboration in Multi-Agent Systems(多智能体系统中的潜在协作) [01:47] 🌍 Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation(Inferix:基于块扩散的新一代世界模拟推理引擎) [02:18] 🎭 Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy(和谐:通过跨任务协同实现音频与视频生成的统一) [03:10] 📄 NVIDIA Nemotron Parse 1.1(英伟达Nemotron解析1.1) [03:46] 🧠 Monet: Reasoning in Latent Visual Space Beyond Images and Language(Monet:超越图像与语言的潜在视觉空间推理) [04:25] ⚡ Terminal Velocity Matching(终端速度匹配) [05:03] 📊 Revisiting Generalization Across Difficulty Levels: It's Not So Easy(重新审视跨难度级别的泛化能力:并非易事) [05:42] 🤖 MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots(MobileVLA-R1:强化移动机器人的视觉-语言-动作能力) [06:25] ⚡ Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs(基于轨迹采样对的连续时间一致性图像自由时间步蒸馏) [06:59] 🎮 UniGame: Turning a Unified Multimodal Model Into Its Own Adversary(UniGame:将统一多模态模型转化为其自身的对抗者) [07:47] 🧩 SPHINX: A Synthetic Environment for Visual Perception and Reasoning(SPHINX:用于视觉感知与推理的合成环境) [08:33] ⚡ Block Cascading: Training Free Acceleration of Block-Causal Video Models(块级联:免训练的块因果视频模型加速) [09:12] 🏙 RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale(RAISECity:面向城市尺度的现实对齐三维世界生成多模态智能体框架) [09:58] 📊 I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation(I-GLIDE:基于输入组的退化估计潜在健康指标) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
87
5个月前

2025.11.26 | 大模型育种进化框架开源;MedSAM-3听懂临床精准分割

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:17] 🧬 GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms(GigaEvo:基于大语言模型与进化算法的开源优化框架) [00:57] 🔬 MedSAM3: Delving into Segment Anything with Medical Concepts(MedSAM3:深入探索基于医学概念的通用分割模型) [01:34] 🔍 Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning(Agent0-VL:探索工具集成视觉语言推理的自进化智能体) [02:03] 🎨 iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation(iMontage:统一、通用、高度动态的多对多图像生成) [02:38] 🕺 SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation(SteadyDancer:基于首帧保持的协调连贯人体图像动画) [03:18] 🔍 Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward(理解是否真正指导统一多模态模型的生成?从分析到前进路径) [04:04] 🤖 GigaWorld-0: World Models as Data Engine to Empower Embodied AI(GigaWorld-0:世界模型作为数据引擎赋能具身AI) [04:44] 🎯 Soft Adaptive Policy Optimization(软自适应策略优化) [05:14] 🎬 UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers(UltraViCo:突破视频扩散变换器的外推极限) [05:55] 🎯 SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space(SSA:通过特征空间中对齐全注意力和稀疏注意力输出的稀疏稀疏注意力) [06:51] 🎨 OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation(OmniAlpha:面向统一多任务RGBA生成的序列到序列框架) [07:41] 🎬 ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding(ReDirector:使用旋转相机编码创建任意长度视频重拍) [08:13] 🖼 VQ-VA World: Towards High-Quality Visual Question-Visual Answering(VQ-VA世界:迈向高质量视觉问题-视觉回答) [09:06] 🔍 HunyuanOCR Technical Report(幻方OCR技术报告) [09:48] 🏙 MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts(MajutsuCity:语言驱动美学自适应城市生成与可控3D资产及布局) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
95
5个月前

2025.11.25 | 即时编译让记忆无损;AutoEnv自动挑环境提两成

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:25] 🧠 General Agentic Memory Via Deep Research(通过深度研究的通用代理记忆) [00:52] 🧪 AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning(AutoEnv:用于跨环境智能体学习的自动化环境测量) [01:24] 🤖 Computer-Use Agents as Judges for Generative User Interface(以计算机使用代理作为生成式用户界面的评判者) [01:55] 🎨 DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation(DeCo:用于端到端图像生成的频率解耦像素扩散) [02:24] 🎨 UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios(UltraFlux:面向高质量原生4K文本到图像跨多样宽高比的数据-模型协同设计) [03:10] 🔍 DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research(DR Tulu:基于演化评分标准的深度研究强化学习) [03:46] 🎬 In-Video Instructions: Visual Signals as Generative Control(视频内指令:视觉信号作为生成控制) [04:24] 📊 Budget-Aware Tool-Use Enables Effective Agent Scaling(预算感知的工具使用实现有效的智能体扩展) [05:12] 🎬 Plan-X: Instruct Video Generation via Semantic Planning(Plan-X:通过语义规划指导视频生成) [05:54] 🧪 M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark(M3-Bench:多模态、多跳、多线程工具使用MLLM智能体基准) [06:25] 🤖 Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO(多智能体深度研究:使用M-GRPO训练多智能体系统) [07:24] 🎬 HunyuanVideo 1.5 Technical Report(混元视频1.5技术报告) [07:56] 🧠 Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens(视觉思维链:通过连续视觉标记教导视觉语言模型更好地观察与思考) [08:36] 🧠 MIST: Mutual Information Via Supervised Training(MIST:通过监督训练实现互信息估计) [09:07] 🎨 Controllable Layer Decomposition for Reversible Multi-Layer Image Generation(可控层分解用于可逆多层图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
5个月前

2025.11.24 | 开源7B模型刷新多模态推理;GeoVista小模型精准地理定位

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:21] 🧠 OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe(OpenMMReasoner:以开放通用方案推动多模态推理前沿) [01:04] 🌍 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization(GeoVista:用于地理定位的Web增强智能视觉推理) [01:41] 🎯 SAM 3: Segment Anything with Concepts(SAM 3:基于概念的通用分割模型) [02:31] 📊 Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story(揭示文本的内在维度:从学术摘要到创意故事) [03:09] 🧠 O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents(O-Mem:面向个性化、长周期、自进化智能体的全能记忆系统) [03:43] 🦜 Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs(鹦鹉:输出真相的说服与一致性鲁棒性评级——一个面向大语言模型的谄媚鲁棒性基准) [04:26] 🧠 RynnVLA-002: A Unified Vision-Language-Action and World Model(RynnVLA-002:统一的视觉-语言-动作与世界模型) [05:19] 🧠 VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models(VisMem:潜在视觉记忆解锁视觉语言模型潜力) [05:51] 🌍 WorldGen: From Text to Traversable and Interactive 3D Worlds(WorldGen:从文本到可遍历交互式3D世界) [06:34] 🎨 Loomis Painter: Reconstructing the Painting Process(Loomis Painter:重建绘画过程) [07:06] 🔮 Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight(Mantis:具有解耦视觉预测能力的多功能视觉-语言-动作模型) [07:48] 🎨 InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization(InstructMix2Mix:通过多视图模型个性化实现一致的稀疏视图编辑) [08:21] 🔬 OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists(全能科学家:迈向人类与AI科学家共同进化的生态系统) [09:07] 🧬 MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging(MergeDNA:基于动态标记化的上下文感知基因组建模) [09:41] 🔍 Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination(Video-R4:通过视觉反刍增强文本丰富视频推理) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
5个月前

2025.11.21 | V-ReasonBench考视频模型推理;Step-Audio-R1让语音越“想”越强

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:22] 📊 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models(V-ReasonBench:面向视频生成模型的统一推理基准套件) [01:06] 🧠 Step-Audio-R1 Technical Report(Step-Audio-R1技术报告) [01:48] 🧭 Scaling Spatial Intelligence with Multimodal Foundation Models(通过多模态基础模型扩展空间智能) [02:18] 🎬 First Frame Is the Place to Go for Video Content Customization(首帧是实现视频内容定制化的关键所在) [02:49] 🎬 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO(视频即答案:使用联合GRPO预测并生成下一视频事件) [03:29] 🔮 SAM 3D: 3Dfy Anything in Images(SAM 3D:图像中任意物体的三维化) [04:03] 🚀 MiMo-Embodied: X-Embodied Foundation Model Technical Report(MiMo-Embodied:跨具身基础模型技术报告) [04:38] 🧠 Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation(边生成边思考:在视觉生成中交织文本推理) [05:10] 🏆 TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval(TurkColBERT:土耳其信息检索中稠密与延迟交互模型的基准研究) [05:53] 🌀 Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs(Nemotron Elastic:迈向高效多合一推理大语言模型) [06:26] 🚀 SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models(自参考策略优化:面向视觉-语言-动作模型) [07:09] 🎬 TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding(TimeViper:一种用于高效长视频理解的混合Mamba-Transformer视觉语言模型) [07:46] 🔬 SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking(SAM2S:通过语义长期跟踪实现手术视频中的任意分割) [08:23] 🎨 NaTex: Seamless Texture Generation as Latent Color Diffusion(NaTex:作为潜在颜色扩散的无缝纹理生成) [08:58] 📐 PartUV: Part-Based UV Unwrapping of 3D Meshes(PartUV:基于部件分割的3D网格UV展开方法) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
83
5个月前

加入我们的 Discord

与播客爱好者一起交流

立即加入

扫描微信二维码

添加微信好友,获取更多播客资讯

微信二维码

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧