2025.04.08 | 分钟级AI视频生成;小型模型超越大型模型

本期的 15 篇论文如下: [00:21] 🎬 One-Minute Video Generation with Test-Time Training(基于测试时训练的分钟级视频生成) [01:03] 💡 SmolVLM: Redefining small and efficient multimodal models(SmolVLM:重新定义小型高效多模态模型) [01:39] 🖼 URECA: Unique Region Caption Anything(URECA:独特区域描述一切) [02:17] 🧰 T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models(工具集成自验证:用于小语言模型中测试时计算扩展) [03:02] 🖼 Concept Lancet: Image Editing with Compositional Representation Transplant(概念柳叶刀:基于成分表示移植的图像编辑) [03:41] 🤔 Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models(量化会损害推理能力吗?量化推理模型的实证研究) [04:26] 📰 LiveVQA: Live Visual Knowledge Seeking(LiveVQA:实时视觉知识检索) [05:08] 🎨 Gaussian Mixture Flow Matching Models(高斯混合流动匹配模型) [05:47] 💡 VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks(VAPO:用于高级推理任务的高效可靠的强化学习) [06:26] 🕵 Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs(你得到的是你所支付的吗?大型语言模型API中的模型替换审计) [07:17] 🧰 DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models(DiaTool-DPO:用于工具增强的大型语言模型的多轮直接偏好优化) [07:54] ⚕ Clinical ModernBERT: An efficient and long context encoder for biomedical text(临床ModernBERT:一种用于生物医学文本的高效长上下文编码器) [08:28] 🐍 Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation(Mamba:连接视觉基础模型与视觉语言模型,实现领域泛化语义分割) [09:12] 🤖 BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation(基于模型和无模型的6D物体姿态估计BOP挑战赛2024) [09:48] 🛡 JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model(JailDAM:基于自适应记忆的视觉-语言模型越狱检测) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
1个月前

2025.04.07 | 多语言基准测试揭示LLMs跨语言泛化局限,具身智能新方法提升规划效率与适应性。

本期的 15 篇论文如下: [00:23] 🛠 Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving(Multi-SWE-bench:一个用于问题解决的多语言基准测试) [01:07] 🧠 Agentic Knowledgeable Self-awareness(具身智能的知识型自我感知) [01:49] 🧮 MegaMath: Pushing the Limits of Open Math Corpora(MegaMath:推动开放数学语料库的极限) [02:32] 🤖 SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement(SynWorld:用于智能体行为知识精炼的虚拟场景合成) [03:20] 🖼 MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models(MME-Unify:统一多模态理解与生成模型的综合基准) [04:03] 🖼 VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning(VARGPT-v1.1:通过迭代指令调优和强化学习改进视觉自回归大型统一模型) [04:42] 🔄 TransMamba: Flexibly Switching between Transformer and Mamba(TransMamba:在Transformer和Mamba之间灵活切换) [05:21] 🤖 APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay(APIGen-MT:基于模拟智能体-人类交互的多轮数据生成的主动式流程) [05:59] 🧑 HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration(HumanDreamer-X:基于高斯恢复的逼真单图像人体化身重建) [06:39] 💡 Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization(全面重打光:通用且一致的单目人体重打光与和谐化) [07:20] 👂 EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling(EvMic:基于有效时空建模的事件相机非接触式声音恢复) [08:02] 🫁 MedSAM2: Segment Anything in 3D Medical Images and Videos(MedSAM2:三维医学图像与视频中的通用分割模型) [08:47] ⚖ BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models(BEATS:大型语言模型偏见评估与评测测试套件) [09:35] 🚄 Slow-Fast Architecture for Video Multi-Modal Large Language Models(面向视频多模态大语言模型的慢-快架构) [10:14] 🎨 SPF-Portrait: Towards Pure Portrait Customization with Semantic Pollution-Free Fine-tuning(SPF-Portrait:面向纯粹人像定制的无语义污染微调) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
1个月前

【月末特辑】3月最火AI论文 | 稀疏自编码器提升文本检测,动态Tanh优化Transformer

本期的 10 篇论文如下: [00:42] TOP1(🔥226) | 🤖 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders(基于稀疏自编码器的人工文本检测特征分析) [03:07] TOP2(🔥153) | 🧠 Transformers without Normalization(无需归一化的Transformer) [04:59] TOP3(🔥136) | 🎥 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation(DropletVideo:探索整体时空一致性视频生成的数据集与方法) [07:51] TOP4(🔥135) | 🦢 RWKV-7 "Goose" with Expressive Dynamic State Evolution(RWKV-7 "Goose":具有表达性动态状态演化的序列建模架构) [11:11] TOP5(🔥130) | 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(ReCamMaster:基于单视频的相机控制生成式渲染) [13:27] TOP6(🔥129) | 🇷 RuCCoD: Towards Automated ICD Coding in Russian(RuCCoD:面向俄语自动化的ICD编码研究) [15:41] TOP7(🔥120) | 🤖 Qwen2.5-Omni Technical Report(Qwen2.5-Omni技术报告) [18:17] TOP8(🔥114) | 🌐 Unified Reward Model for Multimodal Understanding and Generation(多模态理解和生成的统一奖励模型) [20:30] TOP9(🔥113) | 🤖 DAPO: An Open-Source LLM Reinforcement Learning System at Scale(DAPO:一个大规模的开源LLM强化学习系统) [22:29] TOP10(🔥112) | 🧠 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders(我已经覆盖了所有基础:通过稀疏自编码器解读大型语言模型中的推理特征) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

25分钟
99+
1个月前

2025.04.04 | 智能体自主提升,视觉编辑推理重要。

本期的 15 篇论文如下: [00:19] 🧠 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems(具身智能体的进展与挑战:从脑启发智能到进化、协作与安全系统) [01:01] 🖼 Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing(超越像素的展望:推理驱动的视觉编辑基准测试) [01:41] 🖼 GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation(GPT-ImgEval:一个用于诊断 GPT4o 在图像生成中表现的综合性基准) [02:25] 🤖 Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme(重新思考视觉语言模型的强化学习扩展:一个透明的、从零开始的框架和综合评估方案) [03:08] 🗣 Scaling Analysis of Interleaved Speech-Text Language Models(交错语音-文本语言模型的规模化分析) [03:52] 🎬 SkyReels-A2: Compose Anything in Video Diffusion Transformers(SkyReels-A2:视频扩散Transformer中的任意元素组合) [04:36] 🧊 ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers(ShortV:通过冻结无效层中的视觉 tokens 实现高效多模态大型语言模型) [05:13] 📉 ZClip: Adaptive Spike Mitigation for LLM Pre-Training(ZClip:用于LLM预训练的自适应尖峰缓解) [05:50] 🧠 Inference-Time Scaling for Generalist Reward Modeling(通用奖励建模的推理时扩展) [06:32] 🗣 Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation(基于掩码选择性状态空间建模的音视频控制视频扩散,用于自然对话头部的生成) [07:12] ⏱ Efficient Model Selection for Time Series Forecasting via LLMs(基于大型语言模型的时间序列预测高效模型选择) [07:55] 🤖 Scaling Laws in Scientific Discovery with AI and Robot Scientists(人工智能与机器人科学家在科学发现中的规模法则) [08:35] 🧠 Instruction-Guided Autoregressive Neural Network Parameter Generation(指令引导的自回归神经网络参数生成) [09:18] 🤖 GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning(GenPRM:通过生成式推理扩展过程奖励模型的测试时计算) [10:01] 🧠 Interpreting Emergent Planning in Model-Free Reinforcement Learning(解读免模型强化学习中涌现的规划能力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
1个月前

2025.04.03 | MergeVQ高效生成高质量图像,类R1-Zero提升视觉空间推理。

本期的 15 篇论文如下: [00:23] 🎨 MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization(MergeVQ:一种用于视觉生成和表示的统一框架,具有解耦的Token合并和量化) [01:00] 🧠 Improved Visual-Spatial Reasoning via R1-Zero-Like Training(通过类R1-Zero训练改进视觉空间推理) [01:45] 🎮 AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction(动漫玩家:基于下一代游戏状态预测的无限动漫人生模拟) [02:25] 🎬 VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step(VideoScene:提炼视频扩散模型以一步生成3D场景) [03:03] 🎭 DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance(DreamActor-M1:基于混合引导的整体、富有表现力且鲁棒的人体图像动画) [03:42] 🧐 Understanding R1-Zero-Like Training: A Critical Perspective(理解类R1-Zero训练:一个批判性的视角) [04:28] 🎬 Towards Physically Plausible Video Generation via VLM Planning(基于视觉语言模型规划的物理合理视频生成) [05:09] 🤖 PaperBench: Evaluating AI's Ability to Replicate AI Research(PaperBench:评估人工智能复现人工智能研究的能力) [05:49] 🤖 ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations(ScholarCopilot:训练用于学术写作并提供精确引用的**大型语言模型**) [06:31] 💡 ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement(ILLUME+:通过双重视觉Token化和扩散细化照亮统一的多模态大语言模型) [07:11] 💃 Articulated Kinematics Distillation from Video Diffusion Models(基于视频扩散模型的铰接运动学提炼) [07:51] 🛡 Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks(保障视觉-语言模型安全:缓解基于扰动攻击中高斯噪声的脆弱性) [08:32] 👁 DASH: Detection and Assessment of Systematic Hallucinations of VLMs(DASH:视觉语言模型系统性幻觉的检测与评估) [09:11] 🖼 Boost Your Human Image Generation Model via Direct Preference Optimization(通过直接偏好优化提升人体图像生成模型) [09:47] 👁 LSNet: See Large, Focus Small(LSNet:观其大,聚焦小) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
87
1个月前

2025.04.02 | 视频生成精度提升,强化学习增强视频理解。

本期的 15 篇论文如下: [00:21] 🎬 Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation(Any2Caption:将任意条件解析为描述以实现可控视频生成) [01:01] 🎬 Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1(探索强化学习对视频理解的影响:来自SEED-Bench-R1的见解) [01:48] ⚖ JudgeLRM: Large Reasoning Models as a Judge(JudgeLRM:将大型推理模型作为评判者) [02:30] 🤖 CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis(CodeARC:用于归纳程序合成的LLM智能体推理能力基准测试) [03:13] 💡 Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources(Open-Qwen2VL:在学术资源上进行计算高效的完全开源多模态LLM预训练) [04:02] 🎥 GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors(GeometryCrafter:基于扩散先验的开放世界视频一致几何体估计) [04:48] 💻 Z1: Efficient Test-time Scaling with Code(Z1:基于代码的高效测试时扩展) [05:26] 🤖 Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents(Agent S2:计算机使用代理的组合式通用-专家框架) [06:08] 💃 MixerMDM: Learnable Composition of Human Motion Diffusion Models(MixerMDM:人类运动扩散模型的可学习组合) [06:46] 🏢 Command A: An Enterprise-Ready Large Language Model(Command A:一款面向企业就绪的大型语言模型) [07:31] 💡 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models(驾驭推理经济:大型语言模型高效推理的综述) [08:09] 🎬 OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts(OmniMMI:流视频场景下综合性多模态交互基准) [08:53] 🤯 Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?(背诵胜于推理:顶尖语言模型如何在小学水平的推理问题上失败?) [09:40] 🖼 Scaling Language-Free Visual Representation Learning(扩展无语言视觉表征学习) [10:23] 🤔 When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning(何时求解,何时验证:LLM推理的计算最优问题求解与生成式验证) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
1个月前

2025.04.01 | 多文本渲染新方法,电影级对话角色合成

本期的 15 篇论文如下: [00:22] 🖼 TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes(TextCrafter:复杂视觉场景中准确渲染多个文本) [00:59] 🎬 MoCha: Towards Movie-Grade Talking Character Synthesis(MoCha:面向电影级对话角色合成) [01:39] 🔍 What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models(什么、如何、何地以及如何有效?大型语言模型中测试时扩展的调查) [02:16] 🤖 Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model(Open-Reasoner-Zero:一种基于基础模型扩展强化学习的开源方法) [03:05] 🧠 RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy(RIG:端到端通用策略中推理与想象的协同) [03:48] 🧠 Effectively Controlling Reasoning Models through Thinking Intervention(通过思维干预有效控制推理模型) [04:32] 💡 Query and Conquer: Execution-Guided SQL Generation(查询与征服:执行引导的SQL生成) [05:15] ✍ SketchVideo: Sketch-based Video Generation and Editing(SketchVideo:基于草图的视频生成与编辑) [06:04] 🚨 TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection(TeleAntiFraud-28k:用于电信诈骗检测的音频-文本慢思考数据集) [06:57] 💡 Efficient Inference for Large Reasoning Models: A Survey(大型推理模型高效推理综述) [07:40] 🤖 Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code(基于LLM生成启发式的经典规划:用Python代码挑战最先进水平) [08:29] 🧪 Expanding RL with Verifiable Rewards Across Diverse Domains(利用可验证奖励扩展强化学习至多样化领域) [09:11] ✨ Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data(渐进式渲染蒸馏:无需3D数据即可调整Stable Diffusion用于即时文本到网格生成) [09:50] 🤖 TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization(TokenHSI:通过任务Token化统一合成物理人-场景交互) [10:30] 🇰 KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language(KOFFVQA:一个针对大型视觉-语言模型在韩语中进行客观评估的自由形式VQA基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
2个月前

2025.03.31 | 减少token使用,提升领域效率。

本期的 15 篇论文如下: [00:22] 💡 AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation(AdaptiVocab:通过轻量级词汇自适应增强LLM在特定领域的效率) [01:01] 🤖 Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback(探索人类反馈强化学习中的数据缩放趋势与影响) [01:41] 🤔 Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation(推荐之前先思考:释放序列推荐中的潜在推理能力) [02:19] 💡 A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond(大型推理模型高效推理综述:语言、多模态及其他) [02:58] 🖼 ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation(ORIGEN:文本到图像生成中零样本三维方向定位) [03:44] 🧠 OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning(OThink-MR1:通过动态强化学习激发多模态通用推理能力) [04:25] 🔄 ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback(ReFeed:基于反馈反射推理的多维度摘要改进) [04:59] 🎬 Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency(Free4D:无需微调的具有时空一致性的4D场景生成) [05:37] 🧪 PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving(物理学:在大学水平物理问题求解中对基础模型进行基准测试) [06:24] 🗣 Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics(感知准确的3D说话头生成:新定义、语音-网格表示和评估指标) [07:03] 🎬 Segment Any Motion in Videos(视频中的任意运动对象分割) [07:42] 🖼 Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging(Hi3DGen:基于法线桥接的图像高保真3D几何体生成) [08:28] 🖼 Your ViT is Secretly an Image Segmentation Model(你的ViT竟然是图像分割模型) [09:04] 🤔 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding(4D-Bench:用于4D对象理解的多模态大型语言模型基准测试) [09:48] 💡 A Refined Analysis of Massive Activations in LLMs(LLM中大规模激活的精细化分析) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
2个月前

2025.03.28 | 视频推理提升,GUI动作预测优化

本期的 15 篇论文如下: [00:22] 🧠 Video-R1: Reinforcing Video Reasoning in MLLMs(Video-R1:增强多模态大语言模型中的视频推理) [01:02] 📱 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning(UI-R1:通过强化学习增强GUI代理的动作预测) [01:41] 🤯 Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models(挑战推理的边界:一个面向大型语言模型设计的奥林匹克级别数学基准) [02:25] 🎬 VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness(VBench-2.0: 提升视频生成基准套件的内在真实性) [03:05] 🖼 LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis(LeX-Art:通过可扩展的高质量数据合成重新思考文本生成) [03:38] 🤖 Large Language Model Agent: A Survey on Methodology, Applications and Challenges(大型语言模型智能体:方法论、应用与挑战综述) [04:23] 🧠 ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation(ReaRAG:知识引导的推理增强大型推理模型的事实性,通过迭代检索增强生成) [05:01] 🖼 Lumina-Image 2.0: A Unified and Efficient Image Generative Framework(Lumina-Image 2.0:一个统一且高效的图像生成框架) [05:48] 🤖 Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks(具身推理器:协同视觉搜索、推理和行动以完成具身交互任务) [06:27] 💡 ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition(ResearchBench:基于灵感驱动的任务分解评估大语言模型在科学发现中的能力) [07:12] 🚀 Optimal Stepsize for Diffusion Sampling(扩散采样的最优步长) [07:46] 🤔 Exploring the Evolution of Physics Cognition in Video Generation: A Survey(视频生成中物理认知进化探索:一项综述) [08:24] 🎤 FinAudio: A Benchmark for Audio Large Language Models in Financial Applications(FinAudio:金融应用中音频大语言模型的基准测试) [09:01] 🗣 ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model(ChatAnyone:基于分层运动扩散模型的风格化实时人像视频生成) [09:40] 🧠 ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging(ZJUKLAB团队在SemEval-2025 Task 4:通过模型融合实现知识遗忘) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
92
2个月前

2025.03.27 | Dita跨模态策略优异,Qwen2.5-Omni多模态实时响应。

本期的 15 篇论文如下: [00:26] 🤖 Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy(Dita:扩展扩散Transformer以实现通用视觉-语言-动作策略) [01:07] 🤖 Qwen2.5-Omni Technical Report(Qwen2.5-Omni技术报告) [01:46] 🧩 LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?(乐高拼图:多模态大型语言模型在多步空间推理方面的表现如何?) [02:35] 🎬 Wan: Open and Advanced Large-Scale Video Generative Models(万:开放且先进的大规模视频生成模型) [03:24] 💡 Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models(无条件先验至关重要!改进微调扩散模型的条件生成) [04:04] 🔍 Open Deep Search: Democratizing Search with Open-source Reasoning Agents(开放深度搜索:通过开源推理Agent实现搜索的民主化) [04:44] 🖼 GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers(GenHancer:不完美的生成模型是隐藏的强大视觉中心增强器) [05:24] 📊 BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation(BizGen:推进信息图生成中的文章级可视化文本渲染) [06:01] 🤖 Gemini Robotics: Bringing AI into the Physical World(Gemini Robotics:将人工智能带入物理世界) [06:39] 🧠 MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search(MCTS-RAG:利用蒙特卡洛树搜索增强检索增强生成) [07:22] 🚀 AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset(AccVideo:利用合成数据集加速视频扩散模型) [07:54] 🖼 ViLBench: A Suite for Vision-Language Process Reward Modeling(ViLBench:一个用于视觉-语言过程奖励建模的套件) [08:33] 💾 LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation(LogQuant:通过卓越精度保持实现KV缓存的对数分布2比特量化) [09:12] 🚗 ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems(ADS-Edit:面向自动驾驶系统的多模态知识编辑数据集) [09:55] 🖼 Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models(超越文字:通过多模态自回归模型推进长文本图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
2个月前
EarsOnMe

加入我们的 Discord

与播客爱好者一起交流

立即加入

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧