2025.05.09 | 多模态推理模型发展综述;通用智能评估框架提出

本期的 15 篇论文如下: [00:22] 🧠 Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models(感知、推理、思考与规划:大型多模态推理模型综述) [00:57] 🤖 On Path to Multimodal Generalist: General-Level and General-Bench(迈向多模态通用智能:通用水平与通用基准) [01:40] 🤖 Flow-GRPO: Training Flow Matching Models via Online RL(Flow-GRPO:通过在线强化学习训练Flow Matching模型) [02:23] 🧠 Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models(作为裁判的感知代理:评估大型语言模型中的高阶社会认知) [03:05] 🧠 Scalable Chain of Thoughts via Elastic Reasoning(基于弹性推理的可扩展思维链) [03:41] 🔍 FG-CLIP: Fine-Grained Visual and Textual Alignment(FG-CLIP:细粒度视觉与文本对齐) [04:19] 🏞 3D Scene Generation: A Survey(三维场景生成:综述) [05:02] 🧮 ICon: In-Context Contribution for Automatic Data Selection(ICon:用于自动数据选择的上下文贡献度学习) [05:39] 🎬 StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant(StreamBridge:将离线视频大语言模型转化为主动流式助手) [06:19] 🤖 LiftFeat: 3D Geometry-Aware Local Feature Matching(LiftFeat: 三维几何感知局部特征匹配) [06:56] 🧱 Generating Physically Stable and Buildable LEGO Designs from Text(基于文本生成物理稳定且可搭建的乐高设计) [07:38] 🧠 X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains(X-Reasoner:迈向跨模态和领域的通用推理) [08:22] 🌐 Crosslingual Reasoning through Test-Time Scaling(基于测试时缩放的跨语言推理) [09:04] 🖼 PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes(PlaceIt3D:语言引导的真实3D场景物体放置) [09:42] 🌐 BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese(BrowseComp-ZH:中文环境下评估大型语言模型网页浏览能力的基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
73
3天前

2025.05.08 | 多模态模型整合潜力大;零搜索提升LLMs效率。

本期的 14 篇论文如下: [00:21] 💡 Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities(统一多模态理解与生成模型:进展、挑战与机遇) [01:02] 🤖 ZeroSearch: Incentivize the Search Capability of LLMs without Searching(零搜索:无需搜索即可激励大型语言模型的搜索能力) [01:50] 🤔 Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models(超越识别:评估视觉语言模型中的视觉视角采纳能力) [02:31] 🎬 HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation(HunyuanCustom:一种用于定制视频生成的多模态驱动架构) [03:15] 🧩 PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer(PrimitiveAnything:基于自回归Transformer的人工3D图元组合生成) [04:04] 🤖 Benchmarking LLMs' Swarm intelligence(大型语言模型群集智能基准测试) [04:49] 🤔 Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving(超越定理证明:形式化问题求解的公式、框架与基准) [05:26] 🤖 OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation(OpenHelix:机器人操作的双系统VLA模型的简要调查、实证分析和开源实现) [05:58] 🌐 OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution(OmniGIRL:一个用于GitHub问题解决的多语言和多模态基准) [06:36] 🖥 OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents(OSUniverse:多模态GUI导航AI智能体的基准测试) [07:19] 🧠 Knowledge Augmented Complex Problem Solving with Large Language Models: A Survey(大型语言模型赋能知识增强的复杂问题求解:一项综述) [08:04] 🎛 R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training(R&B:面向高效基础模型训练的领域重组与数据混合平衡) [08:48] 🤝 Cognitio Emergens: Agency, Dimensions, and Dynamics in Human-AI Knowledge Co-Creation(涌现认知:人机知识共创中的能动性、维度与动态) [09:26] 📹 Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection(不确定性加权图像-事件多模态融合的视频异常检测) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
95
4天前

2025.05.07 | 多模态思维链提升模型性能;零数据自博弈强化推理能力。

本期的 14 篇论文如下: [00:24] 🧠 Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning(基于强化微调的统一多模态思维链奖励模型) [01:10] 🤖 Absolute Zero: Reinforced Self-play Reasoning with Zero Data(绝对零度:零数据下的强化自博弈推理) [01:52] 🤸 FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios(FlexiAct:面向异构场景的灵活动作控制) [02:33] 🚀 RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale(RADLADS:大规模线性注意力解码器的快速注意力蒸馏) [03:07] 🚀 RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference(RetroInfer:一种用于可扩展长文本LLM推理的向量存储方法) [03:45] 👁 Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading(从阅读中的眼动解码开放式信息搜寻目标) [04:30] 🗜 An Empirical Study of Qwen3 Quantization(Qwen3量化的实证研究) [05:09] ⚽ Multi-Agent System for Comprehensive Soccer Understanding(用于综合足球理解的多智能体系统) [05:52] 🗣 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model(VITA-Audio:用于高效大型语音-语言模型的快速交错跨模态Token生成) [06:36] 🗺 Geospatial Mechanistic Interpretability of Large Language Models(大型语言模型的地理空间机制可解释性) [07:12] 🧑 InfoVids: Reimagining the Viewer Experience with Alternative Visualization-Presenter Relationships(InfoVids:通过另类可视化-演示者关系重塑观看者体验) [07:54] 🤖 Invoke Interfaces Only When Needed: Adaptive Invocation for Large Language Models in Question Answering(仅在需要时调用接口:用于问答中大语言模型的自适应调用) [08:32] 🥽 HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation(HoloTime:驾驭视频扩散模型生成全景4D场景) [09:18] 🤖 Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant(Auto-SLURP:一个用于评估智能个人助理中多智能体框架的基准数据集) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
5天前

2025.05.06 | Voila实现低延迟全双工对话;RM-R1提升大模型推理奖励。

本期的 15 篇论文如下: [00:22] 🤖 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play(Voila:用于实时自主交互和语音角色扮演的语音-语言基础模型) [01:09] 🤔 RM-R1: Reward Modeling as Reasoning(RM-R1:将奖励建模视为推理) [01:52] 🧠 Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers(野外Grokking:用于Transformer真实世界多跳推理的数据增强) [02:32] 🧮 FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models(FormalMATH:大规模语言模型的形式化数学推理基准) [03:17] ✂ ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations(ReplaceMe:基于层剪枝和线性变换的网络简化) [03:59] 🧠 Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL(通过拒绝采样和强化学习中的梯度方差最小化优化思维链推理器) [04:39] 🚀 Practical Efficiency of Muon for Pretraining(Muon在预训练中的实际效率) [05:18] ⚙ A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency(大语言模型推理引擎综述:优化与效率的视角) [06:01] 🤖 R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning(R1-奖励:通过稳定强化学习训练多模态奖励模型) [06:44] 🤔 Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents(随机应变:基于强化学习的社交智能体自适应思考) [07:24] 🤖 SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations(SkillMimic-V2:从稀疏和嘈杂的示范中学习鲁棒且可泛化的交互技能) [08:03] 🤖 Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning(基于强化学习的LLM自主推理与工具集成) [08:50] 🖼 SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing(SuperEdit:修正并促进基于指令的图像编辑的监督) [09:30] 🧮 Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities(大语言模型低精度训练:方法、挑战与机遇) [10:11] 🎨 Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction(Ming-Lite-Uni:自然多模态交互统一架构的进展) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
6天前

2025.05.01 | 阿拉伯语变音难题新解;深度推理模型能力增强

本期的 14 篇论文如下: [00:21] 🗣 Sadeed: Advancing Arabic Diacritization Through Small Language Model(Sadeed:通过小型语言模型推进阿拉伯语变音) [01:05] 🔎 WebThinker: Empowering Large Reasoning Models with Deep Research Capability(WebThinker:利用深度研究能力增强大型推理模型) [01:43] 🧮 Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math(Phi-4-Mini-Reasoning:探索小型推理语言模型在数学方面的极限) [02:20] 💡 Softpick: No Attention Sink, No Massive Activations with Rectified Softmax(Softpick:一种使用修正Softmax且无注意力陷阱、无大规模激活的方法) [03:00] 🤔 Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think(超越最终答案:你的推理轨迹揭示了超乎你想象的信息) [03:38] 🧠 Phi-4-reasoning Technical Report(Phi-4-reasoning 技术报告) [04:21] 🧩 COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning(COMPACT:组合式的原子到复杂视觉能力调优) [04:59] 💡 Taming the Titans: A Survey of Efficient LLM Inference Serving(驯服泰坦:高效LLM推理服务综述) [05:34] 🤖 Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions(用于角色动画的生成式人工智能:技术、应用与未来方向的综合综述) [06:09] 🤖 RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning(RoboVerse:面向可扩展和泛化机器人学习的统一平台、数据集和基准) [06:49] 🎬 ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction(ReVision:基于显式3D物理建模的高质量、低成本复杂运动与交互视频生成) [07:32] 🛡 Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report(Llama-3.1-FoundationAI-SecurityLLM-Base-8B 技术报告) [08:08] 🩻 UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation(UniBiomed:用于Grounded生物医学图像解读的通用基础模型) [08:53] 🗳 Selecting Optimal Candidate Profiles in Adversarial Environments Using Conjoint Analysis and Machine Learning(在对抗环境中利用联合分析和机器学习选择最优候选人形象) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
65
1周前

2025.04.30 | 多模态检索增强生成;单样本强化学习提升推理。

本期的 12 篇论文如下: [00:24] 🔍 UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities(通用RAG:基于多模态、多粒度异构语料库的检索增强生成) [01:06] 🧠 Reinforcement Learning for Reasoning in Large Language Models with One Training Example(单样本强化学习赋能大语言模型推理) [01:52] 🧠 ReasonIR: Training Retrievers for Reasoning Tasks(ReasonIR:训练用于推理任务的检索器) [02:31] 🤖 Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models(迈向评估性思考:基于演化奖励模型的元策略优化) [03:20] 🤖 TesserAct: Learning 4D Embodied World Models(TesserAct:学习4D具身世界模型) [04:01] 🎭 The Leaderboard Illusion(排行榜的幻觉) [04:37] 🖼 YoChameleon: Personalized Vision and Language Generation(Yo'Chameleon:个性化的视觉与语言生成) [05:17] 🛡 Certified Mitigation of Worst-Case LLM Copyright Infringement(大语言模型最坏情况版权侵权的认证缓解) [05:50] 🎭 ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting(ISDrama:基于多模态提示的沉浸式空间戏剧生成) [06:29] 🧩 X-Fusion: Introducing New Modality to Frozen Large Language Models(X-Fusion:为冻结的大型语言模型引入新模态) [07:14] 🎭 Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation(解耦身份,协同情感:相关感知的情感语音头像生成) [07:53] 🌳 TreeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering(TreeHop:为多跳问答高效生成和过滤下一跳查询嵌入) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

8分钟
99+
1周前

2025.04.29 | RepText提升多语言文本渲染;LLM改进手机GUI自动化。

本期的 11 篇论文如下: [00:23] ✍ RepText: Rendering Visual Text via Replicating(RepText:通过复制渲染视觉文本) [01:02] 📱 LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects(LLM驱动的手机GUI代理:进展与展望) [01:44] 🔐 CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges(CipherBank:通过密码学挑战探索大型语言模型推理能力的边界) [02:30] 🤔 Clinical knowledge in LLMs does not translate to human interactions(大型语言模型中的临床知识未能转化为人际互动) [03:16] ⬇ Group Downsampling with Equivariant Anti-aliasing(群等变抗锯齿降采样) [03:59] 📐 TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving(TrustGeoGen:用于可信多模态几何问题求解的可扩展且形式验证的数据引擎) [04:39] 🤖 SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning(SPC:通过对抗博弈演进自博弈评论器以提升大型语言模型推理能力) [05:30] 🖼 Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency(基于显式视觉依赖的多模态数学推理能力基准测试) [06:15] 🚀 MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention(MMInference:通过模态感知置换稀疏注意力加速长文本VLM的预填充) [06:49] 🔑 ICL CIPHERS: Quantifying "Learning'' in In-Context Learning via Substitution Ciphers(ICL密码:通过替换密码量化上下文学习中的“学习”) [07:30] 💡 ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development(ChiseLLM:释放推理LLM在Chisel敏捷硬件开发中的力量) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

8分钟
99+
1周前

2025.04.28 | 视频相机运动理解提升;多模态推理模型优化

本期的 11 篇论文如下: [00:22] 🎥 Towards Understanding Camera Motions in Any Video(迈向理解任意视频中的相机运动) [01:04] 🧠 Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning(Skywork R1V2:用于推理的多模态混合强化学习) [01:49] 💡 BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs(BitNet v2:用于1-bit LLM的具有哈达玛变换的原生4-bit激活) [02:28] 🌍 VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension(VideoVista-CulturalLingo:360°视野——弥合视频理解中的文化、语言和领域差异) [03:13] 🗣 Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark(大型语言模型能否助力多模态语言分析?MMLA:一个综合性的基准) [03:48] 🤔 The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs(稀疏前沿:Transformer LLM 中的稀疏注意力权衡) [04:23] 🎬 Subject-driven Video Generation via Disentangled Identity and Motion(基于解耦身份与运动的主体驱动视频生成) [05:00] 🧠 DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models(DianJin-R1:评估并提升大型语言模型中的金融推理能力) [05:34] 🔲 DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency(DC-SAM:通过双重一致性实现图像和视频中的上下文分割) [06:12] 🔊 Kimi-Audio Technical Report(Kimi-Audio技术报告) [06:43] 🇮 Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation(优化意大利语大型语言模型:通过词汇调整减少Token冗余并提高效率) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

8分钟
99+
2周前
EarsOnMe

加入我们的 Discord

与播客爱好者一起交流

立即加入

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧