HuggingFace 每日AI论文速递 - 节目列表

2025.10.24 | AdaSPEC挑40% token提速两成;AutoPage 10美分生成交互网页

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:23] 🎯 AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders(AdaSPEC:面向高效推测解码的选择性知识蒸馏) [00:57] 🤖 Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1(低成本人机协作论文一键成页:低于0.1美元) [01:35] 🔍 Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence(Open-o3视频:显式时空证据支撑的开放域视频推理) [02:06] 🎬 HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives(HoloCine:端到端生成多镜头长时电影级叙事视频) [02:52] 🌀 Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall(绕过离散扩散采样墙的确定性捷径) [03:33] 💎 Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values(每个问题都有它的价值:显式人类价值驱动的强化学习) [04:06] ⚖ The Massive Legal Embedding Benchmark (MLEB)(大规模法律嵌入评测基准(MLEB)) [04:48] 🔍 DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion(DyPE:面向超高分辨率扩散模型的动态位置外推方法) [05:33] 🕵 Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence(柯南:像侦探一样在多尺度视觉证据上渐进式推理) [06:12] 🤖 Search Self-play: Pushing the Frontier of Agent Capability without Supervision(搜索自博弈:无需监督即可拓展智能体能力边界) [06:56] 🎭 Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations(探究大音频语言模型在说话人情绪变化下的安全漏洞) [07:42] 🖼 LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas(LayerComposer:基于空间感知分层画布的交互式个性化文生图) [08:10] 🎧 SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models(SAKE:面向大型音频-语言模型听觉属性知识编辑的探索) [08:51] 🖼 ARGenSeg: Image Segmentation with Autoregressive Image Generation Model(ARGenSeg:基于自回归图像生成的图像分割) [09:39] 🧩 Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets(Seed3D 1.0:从单张图像生成高保真、可仿真的3D资产) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
6个月前

2025.11.07 | 视频推理新范式;图像互动促思维

HuggingFace 每日AI论文速递

本期的 12 篇论文如下: [00:21] 🎬 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm(用视频思考:视频生成作为统一多模态推理新范式) [00:58] 🧠 V-Thinker: Interactive Thinking with Images(V-Thinker:与图像互动的思维推理) [01:39] 🧠 Scaling Agent Learning via Experience Synthesis(基于经验合成的智能体规模化强化学习) [02:23] 🧠 Cambrian-S: Towards Spatial Supersensing in Video(Cambrian-S:迈向视频中的空间超感) [03:06] 🖥 GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents(GUI-360°:面向计算机使用智能体的大规模综合数据集与评测基准) [03:51] 📄 NVIDIA Nemotron Nano V2 VL(NVIDIA Nemotron Nano V2 VL:面向文档与长视频理解的高效视觉语言模型) [04:28] 🎟 The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms(多头注意力机制的强彩票假设) [05:12] 🕵 Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts(基准设计者应“在测试集上训练”以暴露可利用的非视觉捷径) [05:48] ⚽ Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots(人形机器人视觉驱动反应式足球技能学习) [06:18] 🔍 Contamination Detection for VLMs using Multi-Modal Semantic Perturbation(基于多模态语义扰动的视觉语言模型污染检测) [06:53] 🎧 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics(如何借助源语言感知的神经机器翻译指标评估语音翻译) [07:32] 🚀 RDMA Point-to-Point Communication for LLM Systems(面向LLM系统的RDMA点对点通信) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

8分钟
95
6个月前

2025.11.06 | 扩散模型省数据;音视频对口型

HuggingFace 每日AI论文速递

本期的 9 篇论文如下: [00:17] 🚀 Diffusion Language Models are Super Data Learners(扩散语言模型是超级数据学习者) [01:06] 🎬 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions(统一音视频生成的不对称跨模态交互方法) [01:42] 🧩 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation(LEGO-Eval:面向具身3D环境合成工具增强细粒度评测) [02:25] 📊 Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning(Orion-MSP:面向表格上下文学习的多尺度稀疏注意力机制) [03:15] 📊 TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models(TabTune:面向表格基础模型推理与微调的一站式统一库) [03:46] 🦾 Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects(Kinematify:开放词汇的高自由度关节物体合成) [04:30] 🧠 MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity(MME-CC:一项面向多模态认知能力的挑战性评测基准) [05:06] 📈 LiveTradeBench: Seeking Real-World Alpha with Large Language Models(LiveTradeBench:用大模型在真实市场里挖掘超额收益) [05:55] 🔍 Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation(多模态嵌入器自适应决定何时增强查询的所罗门方法) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

7分钟
64
6个月前

2025.11.05 | 向量草图测代码;先画后想补视觉

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:21] 🖼 VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation(VCode:以SVG为符号视觉表征的多模态代码评测基准) [01:12] 🧠 When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought(当可视化成为推理第一步:MIRA视觉思维链基准测试) [01:48] ⚖ When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs(当模态冲突时:单模态推理不确定性如何左右多模态大模型的偏好) [02:36] 🪙 Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR(更短却更好:用易题作长度正则化实现节俭推理) [03:11] 🧠 Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer(Brain-IT:基于脑交互Transformer的fMRI图像重建) [03:49] 👁 Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization(别让VLA变盲:对齐视觉表征实现分布外泛化) [04:33] 🎨 LTD-Bench: Evaluating Large Language Models by Letting Them Draw(LTD-Bench:让大模型画画来测评空间推理力) [05:15] 🤖 TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System(TWIST2:可扩展、便携且全面的人形机器人数据采集系统) [06:01] 🗜 Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models(视觉输入能否被压缩?面向大型多模态模型的视觉Token压缩基准) [06:46] 🏆 CodeClash: Benchmarking Goal-Oriented Software Engineering(CodeClash:面向目标的软件工程基准测试) [07:29] 🎭 VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models(VidEmo:面向情感中心视频基础模型的情感树推理) [08:03] 🧠 BRAINS: A Retrieval-Augmented System for Alzheimer's Detection and Monitoring(BRAINS:用于阿尔茨海默病检测与监测的检索增强系统) [08:42] 📊 ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension(ChartM³:面向图表理解的多维多步视觉推理数据构建的多阶段代码驱动流水线) [09:45] 📊 TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data(TabDSR:表格复杂数值推理的分解-清洗-推理框架) [10:17] 🤖 iFlyBot-VLA Technical Report(iFlyBot-VLA技术报告:大规模视觉-语言-动作模型新框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
6个月前

2025.11.04 | 超稀疏MoE激活万亿参数;视觉模型看图胜GNN

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:23] 🧠 Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation(全激活赋能:将通用推理模型扩展到万亿参数的开放语言基座) [01:03] 👁 The Underappreciated Power of Vision Models for Graph Structural Understanding(被低估的视觉模型在图结构理解中的强大潜能) [01:38] 💡 UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback(UniLumos:基于物理可信反馈的统一图像与视频快速重打光框架) [02:37] 🕸 Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph(将测试时计算最优扩展泛化为可优化的图) [03:11] 🤖 PHUMA: Physically-Grounded Humanoid Locomotion Dataset(PHUMA:基于物理的人形机器人运动数据集) [03:48] 🔭 ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use(ToolScope:面向视觉引导与长程工具使用的智能体框架) [04:30] 🧠 UniREditBench: A Unified Reasoning-based Image Editing Benchmark(UniREditBench:基于统一推理的图像编辑评测基准) [05:23] 🔄 ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation(ROVER:统一多模态生成中的双向跨模态推理基准测试) [06:04] 🌍 Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum(迈向通用视频检索:通过合成多模态金字塔课程泛化视频嵌入) [06:44] 🌍 World Simulation with Video Foundation Models for Physical AI(基于视频基础模型的物理AI世界仿真) [07:20] 🧠 TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning(TIR-Bench:面向“图像思维”智能体推理的综合评测基准) [08:03] 🧭 NaviTrace: Evaluating Embodied Navigation of Vision-Language Models(NaviTrace:评测视觉-语言模型具身导航能力) [08:45] 📏 Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench(视觉语言模型能否胜任?基于MeasureBench的视觉测量读数基准测试) [09:23] 🧭 Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models(激活多模态大语言模型的空间推理能力) [10:07] 🐱 LongCat-Flash-Omni Technical Report(LongCat-Flash-Omni技术报告:5600亿参数开源全模态实时音视频交互模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
6个月前

2025.11.03 | OS-Sentinel实时守护手机操作安全;ThinkMorph让小模型边想边画

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:21] 🛡 OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows(OS-Sentinel:在真实工作流中通过混合验证提升移动GUI代理安全性) [01:13] 🧠 ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning(ThinkMorph:多模态交错思维链中的涌现特性) [01:49] ⚔ INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats(INT对决FP:细粒度低比特量化格式的综合研究) [02:38] 🤖 $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models(π_RL:面向流式视觉-语言-动作模型的在线强化学习微调) [03:26] 🚀 Continuous Autoregressive Language Models(连续自回归语言模型) [03:54] 🧭 Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning(Spatial-SSRL:通过自监督强化学习增强空间理解) [04:37] 🎯 HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration(HyperClick:通过不确定性校准推动可靠GUI定位) [05:15] 🎯 Defeating the Training-Inference Mismatch via FP16(用FP16打败训练-推理失配) [05:52] 🪜 Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals(分阶段DMD:在子区间内做分数匹配实现少步分布匹配蒸馏) [06:28] 🧭 Revisiting Multimodal Positional Encoding in Vision-Language Models(再探视觉-语言模型中的多模态位置编码) [07:09] ⚡ Higher-order Linear Attention(高阶线性注意力机制) [07:55] 🌐 Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model(双流扩散助力世界模型增强视觉-语言-动作模型) [08:36] 🔬 The Denario project: Deep knowledge AI agents for scientific discovery(Denario项目:面向科学发现的深度知识AI智能体) [09:14] 🎯 Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning(面向具身决策的多模态大模型视觉后门攻击:对比触发学习方法) [09:51] 🏙 Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery(Mask-to-Height:基于YOLOv11的联合建筑实例分割与高度分类架构) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
6个月前

【月末特辑】10月最火AI论文 | 幼龙BDH稀疏可解释;迷你递归7兆碾压大模型

HuggingFace 每日AI论文速递

本期的 10 篇论文如下: [00:30] TOP1(🔥522) | 🐣 The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain(幼龙破壳: Transformer 与大脑模型之间缺失的环节) [02:31] TOP2(🔥462) | 🧠 Less is More: Recursive Reasoning with Tiny Networks(小而精:用微型网络递归推理) [04:48] TOP3(🔥255) | 🌱 Agent Learning via Early Experience(基于早期经验的主体学习) [07:04] TOP4(🔥182) | 🔄 Scaling Latent Reasoning via Looped Language Models(通过循环语言模型扩展潜在推理能力) [09:11] TOP5(🔥170) | 🔥 MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use(MCPMark:面向真实且全面的MCP应用场景的压力测试基准) [11:18] TOP6(🔥169) | 🚀 QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs(QeRL:超越效率——面向大语言模型的量化增强强化学习) [13:10] TOP7(🔥167) | 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations(Concerto:2D-3D联合自监督学习涌现空间表征) [15:38] TOP8(🔥160) | 🧠 Diffusion Transformers with Representation Autoencoders(基于表示自编码器的扩散Transformer) [17:59] TOP9(🔥144) | 🧠 A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning(大模型推理中内部概率与自洽性桥接的理论研究) [20:09] TOP10(🔥142) | 🎯 Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model(空间强迫:面向视觉-语言-动作模型的隐式空间表征对齐) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

22分钟
99+
6个月前

2025.10.31 | Emu3.5统一预测时空;扩散提示驱动机器人

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:26] 🌍 Emu3.5: Native Multimodal Models are World Learners(Emu3.5:原生多模态世界模型让AI看懂并预测未来) [01:04] 🤖 Exploring Conditions for Diffusion models in Robotic Control(探索扩散模型在机器人控制中的条件化策略) [01:42] 🎬 Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark(视频模型已准备好做零样本推理了吗?基于MME-CoF基准的实证研究) [02:22] ⚡ Kimi Linear: An Expressive, Efficient Attention Architecture(Kimi线性:一种富有表现力的高效注意力架构) [02:55] 🧮 AMO-Bench: Large Language Models Still Struggle in High School Math Competitions(AMO-Bench:大语言模型在高中数学奥赛级难题前仍举步维艰) [03:35] 🕺 The Quest for Generalizable Motion Generation: Data, Model, and Evaluation(可泛化动作生成之路:数据、模型与评测) [04:17] 🌐 Surfer 2: The Next Generation of Cross-Platform Computer Use Agents(Surfer 2:下一代跨平台计算机使用智能体) [04:42] 🌍 OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes(OmniX:从统一全景生成与感知到可渲染3D场景) [05:21] 🤝 The Era of Agentic Organization: Learning to Organize with Language Models(智能体组织时代:用语言模型学会协同) [05:57] 🧠 Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning(监督式强化学习:从专家轨迹到逐步推理) [06:32] 🕹 Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games(智能体能征服网络吗?探索 ChatGPT Atlas 在网络游戏中的能力边界) [07:10] 🏥 EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis(EHR-R1:面向电子健康记录分析的推理增强型基础语言模型) [07:55] 📄 OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation(OmniLayout:基于LLM的粗到细通用文档版面生成) [08:38] 🎯 MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency(MIRO:多奖励条件预训练提升文本到图像生成质量与效率) [09:09] 🤖 Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets(Magentic市集:一个用于研究智能代理市场的开源环境) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
6个月前

2025.10.30 | 看图写码7B逆袭;视频思维RL破局

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:22] 👁 JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence(JanusCoder:面向代码智能的基础视觉-编程接口) [01:00] 🧠 Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning(Video-Thinker:用强化学习点燃“视频思维”) [01:55] 🔄 ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization(ReForm:基于前瞻性有界序列优化的反思式自动化形式化) [02:42] 🔄 Scaling Latent Reasoning via Looped Language Models(通过循环语言模型扩展潜在推理能力) [03:22] 🧠 Reasoning-Aware GRPO using Process Mining(基于过程挖掘的推理感知GRPO方法) [03:52] 🎬 VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning(VFXMaster:通过上下文学习解锁动态视觉特效生成) [04:33] 🏆 The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution(工具十项全能:面向多样、真实、长周期任务的语言智能体基准测试) [05:11] 🖼 RegionE: Adaptive Region-Aware Generation for Efficient Image Editing(RegionE:面向高效图像编辑的自适应区域感知生成) [06:22] 🎮 ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks(ChronoPlay:面向游戏RAG评测的双动态与真实性建模框架) [06:58] 🧭 Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks(大模型时代的多模态空间推理:综述与基准) [07:44] 🔗 PairUni: Pairwise Training for Unified Multimodal Language Models(PairUni:面向统一多模态语言模型的成对训练) [08:33] ⚡ Parallel Loop Transformer for Efficient Test-Time Computation Scaling(并行循环Transformer:零延迟的测试时计算扩展) [09:08] 🚗 Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks(重新审视驾驶世界模型:面向感知任务的合成数据生成器) [09:55] 🧬 ODesign: A World Model for Biomolecular Interaction Design(ODesign:面向生物分子相互作用设计的全原子生成式世界模型) [10:31] 🧬 Evolving Diagnostic Agents in a Virtual Clinical Environment(虚拟临床环境中进化诊断智能体) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
96
6个月前

2025.10.29 | 通义深度研究报告;小模型折记忆胜671B巨模型

HuggingFace 每日AI论文速递

本期的 10 篇论文如下: [00:23] 🔍 Tongyi DeepResearch Technical Report(通义深度研究报告:面向长程深度信息检索任务的智能体大模型) [01:00] 🧠 AgentFold: Long-Horizon Web Agents with Proactive Context Management(AgentFold:面向长程任务的主动式上下文管理智能体) [01:36] 🤖 RoboOmni: Proactive Robot Manipulation in Omni-modal Context(RoboOmni:全模态上下文下的主动机器人操作) [02:33] 🎮 Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents(Game-TARS:面向可扩展通才多模态游戏智能体的预训练基础模型) [03:05] 🎬 Uniform Discrete Diffusion with Metric Path for Video Generation(面向视频生成的度量路径均匀离散扩散模型) [03:42] 🛠 OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents(OSWorld-MCP:评测计算机代理调用MCP工具能力的基准) [04:28] 🎨 Group Relative Attention Guidance for Image Editing(基于群组相对注意力引导的图像编辑方法) [05:14] 🚀 WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking(WebLeaper:通过富信息搜索赋能网络智能体效率与效能) [06:04] 🧭 Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance(MoE路由关乎成败:显式路由引导扩散Transformer扩容) [07:01] 🧠 ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking(并行缪斯:面向深度信息搜寻的主体化并行思考) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

8分钟
99+
6个月前

2025.10.28 | Point Transformer无标对齐长空间;代码递归统一粗细粒度

HuggingFace 每日AI论文速递

本期的 15 篇论文如下: [00:23] 🎼 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations(Concerto:2D-3D联合自监督学习涌现空间表征) [01:06] 🧩 ReCode: Unify Plan and Action for Universal Granularity Control(ReCode:用递归代码统一规划与行动,实现通用粒度控制) [01:44] 🤖 A Survey of Data Agents: Emerging Paradigm or Overstated Hype?(数据智能体全景透视:新范式还是泡沫?) [02:23] 🌾 FARMER: Flow AutoRegressive Transformer over Pixels(基于像素流自回归变换器的可逆生成模型) [03:07] 🤖 VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting(VITA-E:能同时看、听、说、做的自然具身交互框架) [03:45] 🎭 Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation(前瞻锚定:在音频驱动人体动画中保持角色身份) [04:17] 🤖 ACG: Action Coherence Guidance for Flow-based VLA models(面向流式VLA模型的动作连贯性引导) [04:56] 🔍 $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker(E²Rank:你的文本嵌入也能成为高效列表级重排器) [05:40] 🌐 Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences(全模态奖励模型:用自由格式偏好迈向通用奖励建模) [06:30] 🔍 PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity(PixelRefer:任意粒度时空目标指代的统一框架) [07:06] 🧠 Knocking-Heads Attention(敲头注意力:让多头彼此“敲一敲”) [07:42] 🧩 IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction(IGGT:面向语义三维重建的实例锚定几何Transformer) [08:30] 🎯 The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation(多选一最优:用max@k优化将强化学习与Best-of-N采样对齐) [09:14] 🥯 LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation(LightBagel:面向统一多模态理解与生成的轻量级双重融合框架) [09:51] 🧠 LimRank: Less is More for Reasoning-Intensive Information Reranking(LimRank:少即是多的推理密集型信息重排序) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
84
6个月前

加入我们的 Discord

与播客爱好者一起交流

立即加入

扫描微信二维码

添加微信好友,获取更多播客资讯

微信二维码

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧