节目列表: HuggingFace 每日AI论文速递 - EarsOnMe | 发现和收听来自小宇宙的热门播客

2025.04.22 | LUFFY提升推理性能；FlowReasoner增强系统适应性。

本期的 15 篇论文如下： [00:25] 🧠 Learning to Reason under Off-Policy Guidance（离线策略指导下的推理学习） [01:00] 🤖 FlowReasoner: Reinforcing Query-Level Meta-Agents（FlowReasoner：强化查询级别元代理） [01:40] 🦅 Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models（Eagle 2.5：提升前沿视觉-语言模型长文本后训练性能） [02:22] 🧰 ToolRL: Reward is All Tool Learning Needs（工具强化学习：奖励是工具学习的全部） [03:07] 🌐 SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation（SphereDiff：通过球面潜在表示实现免调优全景图像和视频生成） [03:39] 🎨 StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians（StyleMe3D：基于3D高斯的解耦先验多编码器风格化） [04:18] 🤖 X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents（X-Teaming：基于自适应多智能体的多轮越狱与防御） [04:57] 🤖 UFO2: The Desktop AgentOS（UFO2：桌面AgentOS） [05:34] 🧑 LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs（LeetCodeDataset：一个用于代码大语言模型稳健评估和高效训练的时序数据集） [06:18] 👀 Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs（换个角度看世界：评估多模态大语言模型中的多视角理解能力） [07:02] 🤖 InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners（InfiGUI-R1：推进多模态GUI智能体从反应式执行者到审慎推理者的演进） [07:42] 🕹 EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models（EasyEdit2：一种用于编辑大型语言模型的简易操控框架） [08:23] 📱 LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark（LearnAct：基于统一演示基准的少样本移动GUI智能体） [09:06] 🖼 LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping（窥镜：基于拉普拉斯金字塔扭曲的生成式畸变图像） [09:50] 🎵 DRAGON: Distributional Rewards Optimize Diffusion Generative Models（DRAGON：利用分布奖励优化扩散生成模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

41

1天前

2025.04.21 | 强化学习未提升新推理能力；MIG优化指令微调数据选择。

本期的 9 篇论文如下： [00:22] 🤔 Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?（强化学习真的能激励大语言模型产生超越基础模型的推理能力吗？） [00:59] 🧠 MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space（MIG：通过最大化语义空间中的信息增益实现指令微调的自动数据选择） [01:41] 🤔 Could Thinking Multilingually Empower LLM Reasoning?（多语思考能否增强大型语言模型的推理能力？） [02:25] 🏙 AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis（AerialMegaDepth：学习空中-地面重建与视角合成） [03:09] 🏠 HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation（HiScene：利用等距视图生成创建分层3D场景） [03:52] 💡 NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes（NodeRAG：使用异构节点构建的基于图结构的RAG） [04:30] 🧠 It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization（一切皆有关联：一次关于测试时记忆、注意力偏差、保留和在线优化的探索之旅） [05:07] 🏞 Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images（令牌化图像块：用于大型图像中有效去雾的全局上下文融合） [05:51] 🧠 Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models（思想操控：外部思想能够有效应用于大型推理模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

90

2天前

【周末特辑】4月第3周最火AI论文 | 多模态模型InternVL3创新预训练；Seaweed-7B高效视频生成。

本期的 5 篇论文如下： [00:52] TOP1(🔥223) | 🖼 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models（InternVL3：探索开源多模态模型的高级训练和测试时方案） [03:22] TOP2(🔥117) | 🎬 Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model（Seaweed-7B：一种经济高效的视频生成基础模型训练方法） [05:40] TOP3(🔥112) | 🏠 PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters（PRIMA.CPP: 加速低资源家用集群上700亿参数规模大语言模型的推理） [07:34] TOP4(🔥77) | ✅ xVerify: Efficient Answer Verifier for Reasoning Model Evaluations（xVerify：用于推理模型评估的高效答案验证器） [10:43] TOP5(🔥74) | 🗂 CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training（CLIMB：基于聚类的迭代数据混合引导预训练方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

99+

4天前

2025.04.18 | CLIMB提升领域模型表现；反蒸馏采样防止模型被盗用。

本期的 15 篇论文如下： [00:23] 🗂 CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training（CLIMB：基于聚类的迭代数据混合引导预训练方法） [01:03] 🧪 Antidistillation Sampling（反蒸馏采样） [01:41] 🤝 A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis（小型LLM的策略协调框架在数据合成方面与大型LLM相媲美） [02:26] 🎬 Packing Input Frame Context in Next-Frame Prediction Models for Video Generation（视频生成中基于帧打包的下一帧预测模型） [03:02] 🤖 Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling（生成，但验证：通过回顾重采样减少视觉-语言模型中的幻觉） [03:43] 🧠 WORLDMEM: Long-term Consistent World Simulation with Memory（WORLDMEM：基于记忆的长期一致性世界模拟） [04:27] 🎬 VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models（VistaDPO：用于大型视频模型的分层时空直接偏好优化） [05:01] 🤖 NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation（NoisyRollout：利用数据增强强化视觉推理） [05:43] 🎨 DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging（DMM：构建基于蒸馏模型合并的通用图像生成模型） [06:20] 📊 ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering（ChartQAPro：一个更多样化和更具挑战性的图表问答基准） [07:07] 🤖 Exploring Expert Failures Improves LLM Agent Tuning（探索专家失败案例以提升LLM Agent的调优效果） [07:48] 🎨 InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework（InstantCharacter：使用可扩展的扩散Transformer框架个性化任何角色） [08:26] 📸 CCMNet: Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy（CCMNet：利用校准颜色校正矩阵实现跨相机色彩恒常性） [09:06] 🎬 FocusedAD: Character-centric Movie Audio Description（聚焦AD：以角色为中心的电影音频描述） [09:39] 🤔 Retrieval-Augmented Generation with Conflicting Evidence（检索增强生成与冲突证据）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

85

5天前

2025.04.17 | ColorBench测试VLM颜色理解；BitNet提升计算效率。

本期的 11 篇论文如下： [00:27] 🎨 ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness（ColorBench：视觉语言模型能否看到并理解多彩世界？一个关于颜色感知、推理和鲁棒性的综合基准） [01:09] 💡 BitNet b1.58 2B4T Technical Report（BitNet b1.58 2B4T 技术报告） [01:50] 🎨 Cobra: Efficient Line Art COlorization with BRoAder References（Cobra：基于更广泛参考的高效线稿着色） [02:28] 🚀 AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference（AlayaDB：用于高效且有效的长文本LLM推理的数据基础） [03:05] 🗣 SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning（SIFT-50M：用于语音指令微调的大规模多语种数据集） [03:51] 🧰 ReTool: Reinforcement Learning for Strategic Tool Use in LLMs（ReTool：基于强化学习的LLM战略性工具使用） [04:31] 🚀 REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers（REPA-E：通过潜在扩散Transformer解锁变分自编码器的端到端调整） [05:09] 📹 Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting（Vivid4D：通过视频修复改进单目视频的4D重建） [05:51] 🤖 Robust and Fine-Grained Detection of AI Generated Texts（AI生成文本的稳健和细粒度检测） [06:34] 🧠 Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution（思想的合冲：用极小自由分解改进大型语言模型的思维链） [07:18] 🖼 BlockGaussian: Efficient Large-Scale Scene Novel View Synthesis via Adaptive Block-Based Gaussian Splatting（BlockGaussian：基于自适应块的高效大规模场景新视角合成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

85

6天前

2025.04.16 | Genius提升LLM推理能力；xVerify高效验证推理模型。

本期的 15 篇论文如下： [00:22] 🧠 Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning（Genius：一种用于高级推理的通用且纯粹的无监督自训练框架） [01:06] ✅ xVerify: Efficient Answer Verifier for Reasoning Model Evaluations（xVerify：用于推理模型评估的高效答案验证器） [01:52] 🖼 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding（Pixel-SAIL：用于像素级理解的单Transformer） [02:37] ✅ Heimdall: test-time scaling on the generative verification（海姆达尔：生成式验证的测试时扩展） [03:23] 🎨 Seedream 3.0 Technical Report（Seedream 3.0 技术报告） [04:07] 📊 How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients（指令和推理数据如何塑造后训练：基于层级梯度的数据质量分析） [04:54] 🎮 TextArena（文本竞技场：用于大型语言模型中智能行为训练与评估的竞争性文本游戏集合） [05:43] 🧠 The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer（简单性的可扩展性：使用单一Transformer的视觉-语言学习的实证分析） [06:22] 🤖 Efficient Process Reward Model Training via Active Learning（基于主动学习的高效过程奖励模型训练） [07:01] 🚀 Efficient Generative Model Training via Embedded Representation Warmup（基于嵌入表示预热的高效生成模型训练） [07:43] 🎥 NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors（NormalCrafter: 从视频扩散先验中学习时序一致的法线） [08:23] 🧠 A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce（LLM推理的极简方法：从拒绝采样到强化学习） [09:00] 🧮 DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning（DeepMath-103K：一个大规模、具有挑战性、经过净化且可验证的数学数据集，用于推进推理研究） [09:43] 🚗 Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion（基于直接偏好优化的扩散蒸馏，用于高效3D激光雷达场景补全） [10:25] 📹 PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild（PVUW 2025 挑战报告：复杂自然视频像素级理解进展）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1周前

2025.04.15 | 多模态模型性能提升；低资源推理加速优化

本期的 15 篇论文如下： [00:23] 🖼 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models（InternVL3：探索开源多模态模型的高级训练和测试时方案） [01:03] 🏠 PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters（PRIMA.CPP: 加速低资源家用集群上700亿参数规模大语言模型的推理） [01:46] 🖼 FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding（FUSION：用于深度跨模态理解的视觉-语言表征的完全集成） [02:26] 🤔 VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning（VL-Rethinker：通过强化学习激励视觉-语言模型的自我反思） [03:07] 🤖 Iterative Self-Training for Code Generation via Reinforced Re-Ranking（基于强化重排序的迭代自训练代码生成） [03:51] 🎬 Mavors: Multi-granularity Video Representation for Multimodal Large Language Model（Mavors：面向多模态大型语言模型的多粒度视频表征） [04:28] 🤖 AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories（AgentRewardBench：评估Web Agent轨迹的自动评估方法） [05:13] 🧠 S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models（S1-Bench：一个评估大型推理模型系统1思维能力的简单基准） [05:56] 🤔 Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability（我们是否已经统一了图像生成与理解？GPT-4o图像生成能力的一项实证研究） [06:42] 🤖 DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training（DUMP：基于强化学习的LLM后训练的自动化分布级别课程学习） [07:22] 🌍 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users（SocioVerse：一个由LLM驱动的智能体和一千万真实用户池支持的社会模拟世界模型） [08:11] 🤖 Breaking the Data Barrier -- Building GUI Agents Through Task Generalization（打破数据壁垒——通过任务泛化构建GUI智能体） [08:56] 💡 TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning（TinyLLaVA-Video-R1：面向视频推理的小型多模态模型） [09:40] 🧪 LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models（LLM-SRBench：一个用于大型语言模型科学方程发现的新基准） [10:21] 🛡 EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety（EmoAgent：评估并保障人机交互中的心理健康安全）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1周前

2025.04.14 | 经济高效视频生成；自回归图像生成扩展。

本期的 13 篇论文如下： [00:24] 🎬 Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model（Seaweed-7B：一种经济高效的视频生成基础模型训练方法） [01:00] 🖼 GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation（GigaTok：将视觉标记器扩展到30亿参数以进行自回归图像生成） [01:42] 🎮 MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft（MineWorld：基于Minecraft的实时开源交互式世界模型） [02:25] 🖼 PixelFlow: Pixel-Space Generative Models with Flow（PixelFlow：基于Flow的像素空间生成模型） [03:05] 🤖 SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning（SQL-R1：通过强化学习训练自然语言到SQL的推理模型） [03:51] 🎨 FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation（FlexIP：用于定制图像生成的保持与个性动态控制） [04:30] 🎬 In-2-4D: Inbetweening from Two Single-View Images to 4D Generation（In-2-4D：从两张单视图图像到4D生成的补帧） [05:05] 🤔 ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance（ModernBERT还是DeBERTaV3？探究架构和数据对Transformer编码器模型性能的影响） [05:42] 🚀 Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs（盘古 Ultra：在昇腾NPU上突破稠密大型语言模型的极限） [06:21] 🤔 Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models（博士级大语言模型真的理解基础加法吗？探究大语言模型中的规则学习与记忆） [07:11] 🛡 SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs（稀疏自编码器助力模型遗忘：用于大语言模型精确遗忘的动态稀疏自编码器防护） [07:52] 🤝 CoRAG: Collaborative Retrieval-Augmented Generation（CoRAG：协同检索增强生成） [08:29] 🤝 InteractVLM: 3D Interaction Reasoning from 2D Foundational Models（InteractVLM：基于2D基础模型的三维交互推理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

99+

1周前

【周末特辑】4月第2周最火AI论文 | SmolVLM优化多模态模型性能；OmniSVG提升SVG生成质量。

本期的 5 篇论文如下： [00:44] TOP1(🔥149) | 💡 SmolVLM: Redefining small and efficient multimodal models（SmolVLM：重新定义小型高效多模态模型） [03:07] TOP2(🔥125) | 🎨 OmniSVG: A Unified Scalable Vector Graphics Generation Model（OmniSVG：一个统一的可扩展矢量图形生成模型） [05:57] TOP3(🔥90) | 🎬 One-Minute Video Generation with Test-Time Training（基于测试时训练的分钟级视频生成） [08:13] TOP4(🔥85) | 🚀 Hogwild! Inference: Parallel LLM Generation via Concurrent Attention（Hogwild! 推理：通过并发注意力机制实现并行LLM生成） [10:19] TOP5(🔥76) | 🧠 Kimi-VL Technical Report（Kimi-VL技术报告）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

1周前

2025.04.11 | Kimi-VL模型表现优异；VCR-Bench评估推理瓶颈。

本期的 14 篇论文如下： [00:22] 🧠 Kimi-VL Technical Report（Kimi-VL技术报告） [01:05] 🎬 VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning（VCR-Bench：一个用于视频链式思考推理的综合评估框架） [01:54] 🖼 MM-IFEngine: Towards Multimodal Instruction Following（MM-IFEngine: 面向多模态指令跟随） [02:35] 🖼 VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning（VisualCloze：一个基于视觉情境学习的通用图像生成框架） [03:15] 🤔 DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning（DeepSeek-R1 思维学：让我们来<思考>关于LLM的推理） [03:54] 🧩 HoloPart: Generative 3D Part Amodal Segmentation（HoloPart：生成式3D部件非模态分割） [04:36] 🤖 C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing（C3PO：面向测试时专家重混合的关键层、核心专家、协同路径优化） [05:11] 🤖 MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations（MOSAIC：用于多智能体模拟中内容传播和监管的社会人工智能建模） [05:58] 🖼 Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models（原生多模态模型的扩展法则） [06:30] 🧠 SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement（更少数据，更强性能：MCTS引导的样本选择用于数据高效的视觉推理自提升） [07:16] 🖼 Towards Visual Text Grounding of Multimodal Large Language Model（面向多模态大语言模型的视觉文本定位） [07:57] 🤖 MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection（MonoPlace3D：学习用于单目3D检测的3D感知物体放置） [08:39] 🧭 Compass Control: Multi Object Orientation Control for Text-to-Image Generation（罗盘控制：用于文本到图像生成的多对象方向控制） [09:22] 📍 TAPNext: Tracking Any Point (TAP) as Next Token Prediction（TAPNext：将追踪任意点（TAP）视为下一个令牌预测）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1周前

2025.04.10 | DDT提升图像生成质量；GenDoP优化相机轨迹生成。

本期的 15 篇论文如下： [00:25] 🎨 DDT: Decoupled Diffusion Transformer（解耦扩散Transformer） [01:05] 🎬 GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography（GenDoP：基于自回归的相机轨迹生成，如同电影摄影师一般） [01:49] 🔍 OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens（OLMoTrace：将语言模型的输出追溯到数万亿的训练文本） [02:28] 🖼 A Unified Agentic Framework for Evaluating Conditional Image Generation（用于评估条件图像生成的统一代理框架） [03:11] 🤔 Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?（缺失前提加剧过度思考：推理模型是否正在丧失批判性思维能力？） [03:57] 🗣 FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis（FantasyTalking：通过连贯运动合成生成逼真会说话的人像） [04:34] 🧐 A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility（冷静看待语言模型推理的进展：陷阱与可复现性之路） [05:15] 🖼 OmniCaptioner: One Captioner to Rule Them All（万能字幕器：一统天下的字幕生成器） [05:57] 🧩 Are We Done with Object-Centric Learning?（以对象为中心的学习是否已经结束？） [06:35] 🤖 Self-Steering Language Models（自导向语言模型） [07:09] 🇷 RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts（RuOpinionNE-2024：从俄语新闻文本中提取观点元组） [07:51] 🤖 Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding（掩码场景建模：缩小3D场景理解中监督学习和自监督学习之间的差距） [08:30] 👂 DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion（DiTaiListener：基于扩散模型的可控高保真听者视频生成） [09:05] 🤖 VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning（VideoChat-R1：通过强化微调增强时空感知能力） [09:47] 🤖 WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments（WildGS-SLAM：动态环境下的单目高斯溅射SLAM）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

71

1周前

2025.04.09 | OmniSVG生成高质量SVG图形；Skywork R1V多模态推理出色。

本期的 13 篇论文如下： [00:22] 🎨 OmniSVG: A Unified Scalable Vector Graphics Generation Model（OmniSVG：一个统一的可扩展矢量图形生成模型） [01:02] 🧠 Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought（Skywork R1V：以思维链引领多模态推理） [01:42] 🖼 An Empirical Study of GPT-4o Image Generation Capabilities（GPT-4o图像生成能力实证研究） [02:22] 🚀 Hogwild! Inference: Parallel LLM Generation via Concurrent Attention（Hogwild! 推理：通过并发注意力机制实现并行LLM生成） [03:03] 🎨 Less-to-More Generalization: Unlocking More Controllability by In-Context Generation（由少及多泛化：通过上下文生成解锁更多可控性） [03:46] 🧠 COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values（COIG-P：一个高质量、大规模的中文偏好数据集，用于与人类价值观对齐） [04:24] 🤔 Generative Evaluation of Complex Reasoning in Large Language Models（大语言模型中复杂推理的生成式评估） [05:14] 🎨 Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model（基于统一潜在扩散模型的保真性和可编辑性免调优图像编辑） [05:53] 🎮 V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models（V-MAGE：一个用于评估多模态大语言模型中以视觉为中心的能力的游戏评估框架） [06:32] 🧩 CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation（CrossWordBench：利用可控谜题生成评估大型语言模型和大型视觉语言模型的推理能力） [07:15] 🖼 HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance（HiFlow：基于流对齐引导的免训练高分辨率图像生成） [07:57] 💡 Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence（通过单序列并行解码加速并行化推理） [08:41] 🤖 Leanabell-Prover: Posttraining Scaling in Formal Reasoning（Leanabell-Prover：形式推理中的后训练缩放）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

95

2周前