2025.06.24 | 法线光照新方法提升细节;多模态生成模型表现优异。

本期的 15 篇论文如下: [00:24] 💡 Light of Normals: Unified Feature Representation for Universal Photometric Stereo(法线光照:用于通用光度立体的统一特征表示) [01:00] 🎨 OmniGen2: Exploration to Advanced Multimodal Generation(OmniGen2:迈向更高级的多模态生成探索) [01:39] ✍ LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning(LongWriter-Zero:通过强化学习掌握超长文本生成) [02:17] 🎭 Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset(幻影数据:面向通用主题一致性视频生成数据集) [02:58] 🧠 RLPR: Extrapolating RLVR to General Domains without Verifiers(RLPR:将RLVR推广到无验证器的一般领域) [03:36] 🧠 ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs(ReasonFlux-PRM:LLM中用于长链思维推理的轨迹感知PRM) [04:11] 🤖 OAgents: An Empirical Study of Building Effective Agents(OAgents:构建有效智能体的实证研究) [04:52] 🖼 Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations(视觉即方言:通过文本对齐表征统一视觉理解与生成) [05:31] 🎬 VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory(VMem:基于Surfel索引视图记忆的交互式一致视频场景生成) [06:06] 🧑 LettinGo: Explore User Profile Generation for Recommendation System(LettinGo:探索用于推荐系统的用户画像生成) [06:48] 🔀 ReDit: Reward Dithering for Improved LLM Policy Optimization(ReDit:通过奖励抖动改进LLM策略优化) [07:29] 💡 FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning(FinCoT:将思维链扎根于专家金融推理) [08:08] 🎬 ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs(ViDAR:基于视频扩散的单目输入四维重建) [08:47] 🖼 Auto-Regressively Generating Multi-View Consistent Images(自回归生成多视角一致性图像) [09:35] 💡 SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation(SlimMoE:通过专家精简和知识蒸馏实现大型MoE模型的结构化压缩) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
63
1个月前

2025.06.23 | DnD降低计算开销;视觉引导提升RAG性能。

本期的 12 篇论文如下: [00:23] 🧲 Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights(拖拽式大语言模型:零样本提示到权重) [01:04] 🖼 Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding(视觉引导分块:增强RAG的多模态文档理解方案) [01:49] 🔀 PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models(PAROAttention:视觉生成模型中高效稀疏和量化注意力的模式感知重排序) [02:30] 🤖 VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning(VIKI-R:通过强化学习协调具身多智能体合作) [03:08] 🎮 Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition(Hunyuan-GameCraft:基于混合历史条件的高动态交互式游戏视频生成) [03:48] 🖼 DreamCube: 3D Panorama Generation via Multi-plane Synchronization(DreamCube:基于多平面同步的3D全景图生成) [04:26] 🖼 Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details(Hunyuan3D 2.5:迈向具有极致细节的高保真3D资产生成) [05:06] 💽 InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding(InfiniPot-V:面向流视频理解的内存约束KV缓存压缩) [05:48] 🖼 Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material(Hunyuan3D 2.1:从图像到具有生产级PBR材质的高保真3D资产) [06:36] 🧠 UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation(UniFork:探索模态对齐以实现统一的多模态理解与生成) [07:16] ⚖ Reranking-based Generation for Unbiased Perspective Summarization(基于重排序生成方法的无偏视角摘要) [07:52] 🚗 Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation(基于交错自回归运动和场景生成的长期交通仿真) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
92
1个月前

2025.06.19 | SEKAI数据集提升视频生成;原型推理增强LLM泛化能力。

本期的 15 篇论文如下: [00:22] 🌍 Sekai: A Video Dataset towards World Exploration(Sekai:一个面向世界探索的视频数据集) [01:02] 💡 ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs(原型推理:作为大型语言模型中通用推理基础的原型) [01:43] 💡 GenRecal: Generation after Recalibration from Large to Small Vision-Language Models(GenRecal:从大型到小型视觉-语言模型的重校准后生成) [02:24] 🗣 BUT System for the MLC-SLM Challenge(用于MLC-SLM挑战赛的BUT系统) [03:10] 🤖 Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence(具身Web智能体:连接物理与数字领域,实现集成智能) [03:57] 💡 Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation(自由形式生成中基于语义感知的开放式R1训练奖励) [04:43] 🔬 SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification(SciVer:评估多模态科学声明验证中的基础模型) [05:26] 🚀 Truncated Proximal Policy Optimization(截断近端策略优化) [06:04] 🖼 PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers(PictSure:预训练嵌入对上下文学习图像分类器的影响) [06:37] 🖼 CoMemo: LVLMs Need Image Context with Image Memory(CoMemo:LVLM需要带有图像记忆的图像上下文) [07:21] 🤖 SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence(群体智能代理:迈向基于群体智能的全自动代理系统生成) [08:01] 🧠 MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models(MoTE:面向内存高效的大型多模态模型的三元专家混合) [08:45] 🛡 OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents(OS-Harm:衡量计算机使用Agent安全性的基准) [09:34] 🏞 ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies(ImmerseGen:基于代理引导的、使用Alpha纹理代理的沉浸式世界生成) [10:09] 🤝 FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models(FedNano:面向预训练多模态大语言模型的轻量级联邦调优) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
66
1个月前

2025.06.18 | MultiFinBen揭示金融模型局限;测试时计算提升LLM Agent性能。

本期的 15 篇论文如下: [00:23] 📊 MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation(MultiFinBen:一个多语言、多模态和难度感知的金融领域大语言模型评估基准) [01:03] 🤖 Scaling Test-time Compute for LLM Agents(扩展LLM Agent的测试时计算) [01:38] 🎼 CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following(CMI-Bench:一个评估音乐指令跟随的综合性基准) [02:16] 💬 LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs(LongLLaDA:解锁扩散语言模型中的长文本能力) [02:57] 🤔 Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs(基于可验证奖励的强化学习隐式地激励基础大语言模型中的正确推理) [03:40] 🧠 Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team(Xolver: 像奥林匹克团队一样利用整体经验进行多智能体推理) [04:20] 🗣 Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model(Stream-Omni:与大型语言-视觉-语音模型的同时多模态交互) [05:02] ⚕ Efficient Medical VIE via Reinforcement Learning(基于强化学习的高效医学视觉信息抽取) [05:40] 🤔 Reasoning with Exploration: An Entropy Perspective(基于探索的推理:一个熵的视角) [06:18] 🧠 QFFT, Question-Free Fine-Tuning for Adaptive Reasoning(QFFT:用于自适应推理的无问题微调) [06:52] 🎨 Align Your Flow: Scaling Continuous-Time Flow Map Distillation(对齐你的流:扩展连续时间流映射蒸馏) [07:27] 🧪 Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure(大语言模型能否为算法问题生成高质量测试用例?TestCase-Eval:容错覆盖和暴露的系统性评估) [08:07] 🤖 Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees(有保证的猜测:一种基于语言建模的CISC到RISC代码转换方法,并提供测试保证) [08:58] 🛠 CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios(CRITICTOOL:评估大型语言模型在工具调用错误场景中的自我批判能力) [09:38] 📊 xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations(xbench:通过与职业对齐的真实世界评估追踪Agent的生产力提升) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
89
1个月前

2025.06.17 | MiniMax-M1提升推理性能;多模态模型认知测试创新。

本期的 15 篇论文如下: [00:22] 💡 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention(MiniMax-M1:利用闪电注意力高效扩展测试时计算) [01:00] 🔬 Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning(科学家的首次考试:通过感知、理解和推理来探索多模态大型语言模型的认知能力) [01:47] 🧐 DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents(DeepResearch Bench:一个面向深度研究Agent的综合性评测基准) [02:28] 🧠 DoTA-RAG: Dynamic of Thought Aggregation RAG(思想动态聚合RAG:一种用于大规模网络知识索引的检索增强生成系统) [03:08] 🧠 Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning(Ego-R1:用于超长第一视角视频推理的工具链式思考) [03:52] 💡 Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency(等等,我们不需要“等等”!移除思考Token提升推理效率) [04:28] 🤖 TaskCraft: Automated Generation of Agentic Tasks(任务工坊:自动化生成自主Agent任务) [05:04] 🤯 Discrete Diffusion in Large Language and Multimodal Models: A Survey(大型语言和多模态模型中的离散扩散:一项综述) [05:42] 🪞 Test3R: Learning to Reconstruct 3D at Test Time(Test3R:测试时学习三维重建) [06:25] 🖼 VGR: Visual Grounded Reasoning(VGR:视觉基础推理) [07:06] 🤖 PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization(PersonaFeedback:一个大规模的人工标注的个性化基准) [07:50] 🤖 From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding(从真实到合成:通过属性化基础生成数百万条多样化且复杂的用户指令) [08:32] 🤖 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models(BridgeVLA: 基于输入-输出对齐的视觉-语言模型高效3D操作学习) [09:11] 🧠 Language Surgery in Multilingual Large Language Models(多语言大型语言模型中的语言手术) [09:44] 🤖 AI Agent Behavioral Science(人工智能体行为科学) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
1个月前

2025.06.16 | 跨模态合成新视角图像;策略依从型智能体抗攻击

本期的 15 篇论文如下: [00:23] 🖼 Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation(基于跨模态注意力提炼的对齐新视角图像与几何体合成) [01:02] 🛡 Effective Red-Teaming of Policy-Adherent Agents(有效对抗策略依从型智能体) [01:39] 🔄 The Diffusion Duality(扩散二元性) [02:20] 🤖 LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?(LiveCodeBench Pro:奥林匹克竞赛奖牌获得者如何评价大型语言模型在算法竞赛中的表现?) [03:09] 🧠 pLSTM: parallelizable Linear Source Transition Mark networks(pLSTM:可并行化的线性源转移马尔可夫网络) [03:50] 🖼 A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation(高质量的图文交错生成数据集与可靠评估) [04:36] 🧠 Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache(超越同质注意力:通过傅里叶近似KV缓存实现内存高效的LLM) [05:16] 🤖 SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending(SkillBlender: 面向通用人形机器人全身Loco-操作的技能融合) [06:00] 🧠 SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning(SwS:基于自感知弱点驱动的问题合成,用于提升大型语言模型在强化学习中的推理能力) [06:42] 🛡 Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning(利用解耦理解和引导式CoT推理检测有害模因) [07:17] 🎬 DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO(DeepVideo-R1:通过难度感知回归GRPO进行视频强化微调) [07:59] ⚙ Configurable Preference Tuning with Rubric-Guided Synthetic Data(基于规则引导合成数据的可配置偏好调整) [08:41] 👁 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs(ViCrit:一种用于VLM中视觉感知的可验证强化学习代理任务) [09:29] 🔄 A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data(一种利用TTS合成数据增强ASR的自精炼框架) [10:16] 🔍 Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings(稠密检索器在简单查询上可能失效:揭示嵌入的粒度困境) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
69
1个月前

2025.06.13 | 医学推理模型新范式;自动化构建软件工程数据集

本期的 15 篇论文如下: [00:22] 🩺 ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning(ReasonMed:一个用于推进医学推理的37万多智能体生成数据集) [01:12] 🏭 SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks(SWE-Factory:你的问题解决训练数据和评估基准自动化工厂) [01:55] 🖼 Text-Aware Image Restoration with Diffusion Models(基于扩散模型的文本感知图像修复) [02:36] 🎬 VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos(VRBench:长篇叙事视频中多步骤推理的基准测试) [03:22] 🎬 AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation(AniMaker:基于MCTS驱动的片段生成实现自动化多智能体动画故事叙述) [04:09] 🧮 Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training(Domain2Vec:向量化数据集以在无训练情况下找到最优数据混合) [04:52] 🎮 Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts(Optimus-3: 面向具有可扩展任务专家的通用多模态Minecraft智能体) [05:27] 🧠 Magistral(Magistral:Mistral 的首个推理模型) [06:07] 🤖 AutoMind: Adaptive Knowledgeable Agent for Automated Data Science(AutoMind:面向自动化数据科学的自适应知识型智能体) [06:53] 🎨 PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework(PosterCraft:重新思考统一框架下的高质量美学海报生成) [07:43] 🎬 VideoDeepResearch: Long Video Understanding With Agentic Tool Using(VideoDeepResearch:使用Agentic工具的长视频理解) [08:22] 🚫 ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark(ChineseHarm-Bench:一个中文有害内容检测的基准) [09:01] 🎨 CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation(CreatiPoster:面向可编辑和可控的多层图形设计生成) [09:48] 💡 Resa: Transparent Reasoning Models via SAEs(Resa:基于稀疏自编码器的透明推理模型) [10:30] 🤖 Ming-Omni: A Unified Multimodal Model for Perception and Generation(Ming-Omni:一个用于感知和生成的统一多模态模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
93
1个月前

2025.06.12 | 自信微调提升模型表现;视频生成模型高效优化。

本期的 13 篇论文如下: [00:23] 🧠 Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models(自信即全部:基于语言模型的小样本强化学习微调) [01:07] 🎬 Seedance 1.0: Exploring the Boundaries of Video Generation Models(Seedance 1.0:探索视频生成模型的边界) [01:50] 🥽 PlayerOne: Egocentric World Simulator(PlayerOne:以自我为中心的真实世界模拟器) [02:30] 🎬 Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation(用于实时交互视频生成的自回归对抗后训练) [03:15] 🤖 ComfyUI-R1: Exploring Reasoning Models for Workflow Generation(ComfyUI-R1:探索用于工作流生成的推理模型) [03:48] 🧠 SeerAttention-R: Sparse Attention Adaptation for Long Reasoning(SeerAttention-R:用于长程推理的稀疏注意力自适应) [04:25] 🧪 SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner(SWE-Flow:以测试驱动的方式合成软件工程数据) [05:10] 🎶 Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation(自回归 vs. 流匹配:文本到音乐生成建模范式的比较研究) [05:52] 🎭 InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions(InterActHuman:基于布局对齐音频条件的多概念人物动画) [06:34] 🤖 SAFE: Multitask Failure Detection for Vision-Language-Action Models(SAFE:视觉-语言-动作模型的多任务失败检测) [07:14] 🧠 Reparameterized LLM Training via Orthogonal Equivalence Transformation(基于正交等价变换的重参数化LLM训练) [07:56] 👁 MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis(MIRAGE:用于全面视网膜OCT图像分析的多模态基础模型与基准) [08:39] 🌱 Branched Schrödinger Bridge Matching(分支薛定谔桥匹配) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

9分钟
84
1个月前

2025.06.11 | LLM存在地缘政治偏见;RuleReasoner提升推理效率。

本期的 15 篇论文如下: [00:22] 🌍 Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models(LLM中的地缘政治偏见:在当代语言模型中,哪些是“好”国家,哪些是“坏”国家?) [01:09] 🤖 RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling(RuleReasoner:基于领域感知动态采样的强化规则推理) [01:48] 🖼 Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better(自回归语义视觉重建助力视觉-语言模型更好地理解) [02:30] 🎬 Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion(自激:弥合自回归视频扩散中的训练-测试差距) [03:08] 🧮 Solving Inequality Proofs with Large Language Models(利用大型语言模型求解不等式证明) [03:49] 🤖 Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation(三思而后行:用于GUI自动化中术前错误诊断的GUI-Critic-R1模型) [04:25] 🖼 Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models(帧引导:视频扩散模型中用于帧级别控制的免训练引导) [05:05] 🤖 Aligning Text, Images, and 3D Structure Token-by-Token(逐Token对齐文本、图像与3D结构) [05:51] 🔍 ECoRAG: Evidentiality-guided Compression for Long Context RAG(ECoRAG:证据性引导的长文本RAG压缩) [06:28] 🎬 DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval(DiscoVLA:面向参数高效视频-文本检索的视觉、语言和对齐差异缩减) [07:14] 🖼 Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs(基于多模态大语言模型中具身推理的可解释、可靠的AI生成图像检测) [08:06] 🗜 Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor(Squeeze3D:你的3D生成模型实际上是一个极致的神经压缩器) [08:46] 🤖 Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction(思考与行动:通过扩展测试时交互进行推理的智能体) [09:21] 🧩 MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models(MoA:用于大语言模型参数高效微调的异构适配器混合) [09:58] 📚 Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability(机构书籍1.0:来自哈佛图书馆馆藏的2420亿token数据集,经过精确化处理,具有更高的准确性和可用性) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
85
1个月前
EarsOnMe

加入我们的 Discord

与播客爱好者一起交流

立即加入

扫描微信二维码

添加微信好友,获取更多播客资讯

微信二维码

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧