本期的 21 篇论文如下: [00:24] 🤖 CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution(指南针评判者-1:一体化评判模型助力模型评估与进化) [01:11] 🌲 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree(SAM2长:通过无需训练的记忆树增强SAM 2以实现长视频分割) [01:55] 🌐 PUMA: Empowering Unified MLLM with Multi-granular Visual Generation(PUMA:赋予统一多模态大语言模型多粒度视觉生成能力) [02:37] 🤖 AutoTrain: No-code training for state-of-the-art models(AutoTrain:无代码训练最先进的模型) [03:10] ⚡ FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors(节俭NeRF:无学习先验的少样本新视角合成快速收敛) [03:56] 📊 Baichuan Alignment Technical Report(百川对齐技术报告) [04:39] 🌍 Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages(泛亚:一个完全开放的多语种多模态LLM,涵盖39种语言) [05:21] 🔍 RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style(RM-Bench:评估语言模型奖励模型的细致性与风格敏感度) [06:05] 📚 Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception(元分块:通过逻辑感知学习高效的文本分割) [06:41] 🔍 Pre-training Distillation for Large Language Models: A Design Space Exploration(大型语言模型预训练蒸馏:设计空间探索) [07:16] 🔬 Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation(炼金术:通过符号变异增强定理证明能力) [07:55] 🔄 SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation(半监督微调:LLM适应的半监督微调框架) [08:31] 📚 Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement(通过同源模型引导和上下文意识测量选择长上下文对齐的关键样本) [09:11] 🤖 Zero-shot Model-based Reinforcement Learning using Large Language Models(基于大语言模型的零样本模型强化学习) [09:53] 🗣 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant(一护:混合模态早期融合实时语音助手) [10:28] 🧠 CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy(CBT-Bench:评估大型语言模型在辅助认知行为疗法中的应用) [11:12] 🛠 Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers(路由器调优:一种简单有效的Transformer动态深度调整方法) [11:58] 🧠 Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training(幻觉解毒:用于大型语言模型训练的敏感神经元丢弃方法) [12:45] 🌍 Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs(多语言大语言模型的跨语言自动评估) [13:25] 🗣 DM-Codec: Distilling Multimodal Representations for Speech Tokenization(多模态表示蒸馏用于语音标记化) [14:17] 🧠 In-context learning and Occam's razor(上下文学习与奥卡姆剃刀) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 12 篇论文如下: [00:27] 🌐 Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation(拥有世界模型的网络代理:学习和利用环境动态进行网页导航) [01:11] 👗 MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models(魔法裁缝:文本到图像扩散模型中的组件可控个性化) [01:48] 💼 UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models(UCFE:面向用户的大语言模型金融专业能力基准) [02:37] 🧠 NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples(自然对抗样本:评估视觉语言模型) [03:12] 🧠 SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs(SeerAttention:在LLMs中学习内在稀疏注意力) [03:54] 📊 Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts(AI检测器足够好吗?机器生成文本数据集质量调查) [04:25] 🌐 Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion(扩散课程:通过图像引导扩散实现合成到真实的生成课程学习) [05:08] 🎥 DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation(DAWN: 非自回归扩散框架动态帧头像的讲话头视频生成) [05:50] 🔄 A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement(基于边际的语言模型对齐常见陷阱:梯度纠缠) [06:31] 🧬 DPLM-2: A Multimodal Diffusion Protein Language Model(DPLM-2: 一种多模态扩散蛋白质语言模型) [07:12] 📰 Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media(关键在于上下文(NMF):建模华人媒体中的主题信息动态) [07:56] 🧠 How Do Training Methods Influence the Utilization of Vision Models?(训练方法如何影响视觉模型的利用?) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:45] TOP1(🔥80) | 🌐 Baichuan-Omni Technical Report(百川-Omni 技术报告) [02:20] TOP2(🔥58) | 📊 MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures(MixEval-X:从现实世界数据混合中进行任意到任意评估) [04:20] TOP3(🔥58) | 🎥 Movie Gen: A Cast of Media Foundation Models(电影生成:媒体基础模型集合) [06:27] TOP4(🔥53) | 🤖 LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models(LOKI:基于大型多模态模型的综合合成数据检测基准) [08:23] TOP5(🔥48) | 🌐 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models(大规模多模态交错理解基准测试) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 31 篇论文如下: [00:23] 📊 MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures(MixEval-X:从现实世界数据混合中进行任意到任意评估) [01:02] 🎥 Movie Gen: A Cast of Media Foundation Models(电影生成:媒体基础模型集合) [01:35] 📱 MobA: A Two-Level Agent System for Efficient Mobile Task Automation(MobA:一种高效移动任务自动化的两级代理系统) [02:18] 🌐 Harnessing Webpage UIs for Text-Rich Visual Understanding(利用网页UI进行丰富的视觉理解) [02:59] 🔄 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation(雅努斯:解耦视觉编码以实现统一的多模态理解和生成) [03:29] 🩺 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models(多功能多模态RAG系统在医学视觉语言模型中的应用) [04:04] 📊 A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models(大规模模型后训练中Delta参数编辑的统一视角) [04:46] 🔄 PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment(PopAlign:多样化对比模式以实现更全面的模型对齐) [05:23] 🔍 BenTo: Benchmark Task Reduction with In-Context Transferability(BenTo: 基于上下文迁移性的基准任务缩减) [06:03] 🎥 DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control(DreamVideo-2:零样本主题驱动视频定制与精确运动控制) [06:49] 🧠 MoH: Multi-Head Attention as Mixture-of-Head Attention(MoH:多头部注意力机制作为混合头部注意力机制) [07:28] 🎥 VidPanos: Generative Panoramic Videos from Casual Panning Videos(VidPanos:从随意拍摄的平移视频生成全景视频) [08:03] 📉 FlatQuant: Flatness Matters for LLM Quantization(FlatQuant:扁平化对LLM量化的重要性) [08:44] 🔄 Retrospective Learning from Interactions(从交互中回顾学习) [09:22] 🔄 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation(向前失败:利用合成数据和检索增强改进ASR的生成错误校正) [10:06] 🖼 Can MLLMs Understand the Deep Implication Behind Chinese Images?(多模态大语言模型能否理解中文图像的深层含义?) [10:43] 📱 MedMobile: A mobile-sized language model with expert-level clinical capabilities(MedMobile:具备专家级临床能力的移动端语言模型) [11:22] 🌍 WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines(世界美食:多语言多文化视觉问答的大规模基准) [12:04] 🤖 Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant(记住、检索与生成:理解无限视觉概念作为个性化助手) [12:48] 🔄 LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning(LoLDU:通过下三角-对角-上三角分解实现低秩适应的参数高效微调) [13:29] 🔒 AERO: Softmax-Only LLMs for Efficient Private Inference(AERO:仅使用Softmax的LLM实现高效隐私推断) [14:12] 🌐 $γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models(γ-MoD:探索多模态大语言模型的深度混合适应) [14:45] 🌐 Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats(长序列大重建模型:广覆盖高斯点云) [15:24] 🎶 MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization(MuVi:视频到音乐生成与语义对齐及节奏同步) [16:05] 🔒 Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems(大型语言模型是否具备政治正确性?分析AI系统中的伦理偏见与越狱漏洞) [16:48] 📚 SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation(基于模式教学和检索增强生成的数学应用题解决方法) [17:27] 🗺 Roadmap towards Superhuman Speech Understanding using Large Language Models(基于大型语言模型的超人类语音理解路线图) [18:05] 🔄 Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment(面向无指导的AR视觉生成的条件对比对齐) [18:47] 🤖 TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration(TransAgent:异构代理协作迁移视觉语言基础模型) [19:25] 🔬 Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models(开放材料2024(OMat24)无机材料数据集与模型) [20:05] 📚 Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key(最小调整解锁LLM长输出:高质量数据的关键) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 19 篇论文如下: [00:28] 🧠 HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks(HumanEval-V:通过编码任务评估大型多模态模型的视觉理解和推理能力) [01:15] 🎥 VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI(VidEgoThink:评估具身AI的自中心视频理解能力) [01:50] 🧠 The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio(多模态的诅咒:评估大型多模态模型在语言、视觉和音频中的幻觉) [02:31] 🤖 Revealing the Barriers of Language Agents in Planning(揭示语言代理在规划中的障碍) [03:15] 📄 DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception(DocLayout-YOLO:通过多样合成数据和全局到局部自适应感知增强文档布局分析) [03:56] ⚙ Large Language Model Evaluation via Matrix Nuclear-Norm(大型语言模型评估通过矩阵核范数) [04:38] 🧬 Exploring Model Kinship for Merging Large Language Models(探索大型语言模型合并中的模型亲缘关系) [05:15] 📊 ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs(ProSA:评估和理解大型语言模型的提示敏感性) [05:50] ⚡ ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression(ZipVL:动态令牌稀疏化和KV缓存压缩的高效大视觉-语言模型) [06:31] 📄 Improving Long-Text Alignment for Text-to-Image Diffusion Models(改进文本到图像扩散模型的长文本对齐) [07:11] 🔄 Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models(简化、稳定和扩展连续时间一致性模型) [07:55] 🛡 Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements(可控安全对齐:推理时适应多样安全需求) [08:34] 🔍 Tracking Universal Features Through Fine-Tuning and Model Merging(通过微调和模型合并追踪通用特征) [09:08] 🔄 Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL(逆向洞察:通过逆向强化学习重构LLM训练目标) [09:46] 🧠 Neural Metamorphosis(神经变形) [10:25] 🌍 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation(世界医学QA-V:多语言、多模态医学考试数据集用于多模态语言模型评估) [11:09] 🌐 OMCAT: Omni Context Aware Transformer(全上下文感知变压器) [11:44] ⏳ ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains(ChroKnowledge:揭示语言模型在多领域中的时间知识) [12:22] 📚 DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities(DyVo:动态词汇表用于实体学习的稀疏检索) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:26] 🤖 MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation(多模态大语言模型能看见吗?动态校正解码以减轻幻觉) [01:07] 🛠 MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models(MTU-Bench:大型语言模型的多粒度工具使用基准) [01:47] 📚 LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models(LLM×MapReduce:利用大型语言模型简化长序列处理) [02:25] 🛡 SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI(SecCodePLT:评估代码生成AI安全性的统一平台) [03:01] 📹 LVD-2M: A Long-take Video Dataset with Temporally Dense Captions(LVD-2M:一个带有时间密集标注的长镜头视频数据集) [03:44] 🧠 What Matters in Transformers? Not All Attention is Needed(Transformer中什么最重要?并非所有注意力机制都必要) [04:18] 🌟 GS^3: Efficient Relighting with Triple Gaussian Splatting(GS^3:高效的三重高斯点云重光照) [04:51] 🤯 Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free(你的混合专家大型语言模型实际上是一个免费的嵌入模型) [05:31] 🌍 Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts(通过语言家族专家混合模型高效实现50种语言的医疗大语言模型民主化) [06:08] 🚀 SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning(SimBa:深度强化学习中扩展参数的简单性偏置) [06:43] 📊 Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices(高效扩散模型:从原理到实践的综合调查) [07:14] 🤖 Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation(面向协同、广义和高效的双系统机器人操作) [07:58] 🔄 Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt(互增强效应的实证研究及其在少样本文本分类任务中的应用通过提示) [08:37] 🌍 Towards Natural Image Matting in the Wild via Real-Scenario Prior(面向自然图像抠图的现实场景先验) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 15 篇论文如下: [00:24] 🌐 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models(大规模多模态交错理解基准测试) [01:06] 🤖 LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models(LOKI:基于大型多模态模型的综合合成数据检测基准) [02:01] 🔍 Toward General Instruction-Following Alignment for Retrieval-Augmented Generation(面向检索增强生成的通用指令遵循对齐) [02:36] 📊 MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks(MEGA-Bench:将多模态评估扩展到500多个真实世界任务) [03:12] 🎥 Animate-X: Universal Character Image Animation with Enhanced Motion Representation(Animate-X:增强运动表示的通用角色图像动画) [04:02] 📚 Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models(全能数学:面向大型语言模型的奥林匹克级数学基准) [04:44] 📚 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content(LiveXiv -- 基于Arxiv论文内容的多模态实时基准) [05:29] 🎥 Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention(Cavia:具有视角控制的多视角视频扩散与视角集成注意力) [06:09] ⏳ TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models(时间轴基准:多模态视频模型细粒度时间理解评测) [06:58] 🌊 Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations(基于校正随机微分方程的语义图像反演与编辑) [07:40] 📊 Rethinking Data Selection at Scale: Random Selection is Almost All You Need(重新思考大规模数据选择:随机选择几乎是你所需要的) [08:26] 🌲 Tree of Problems: Improving structured problem solving with compositionality(问题树:通过组合性改进结构化问题解决) [09:13] 📺 TVBench: Redesigning Video-Language Evaluation(TVBench:重塑视频语言评估) [09:54] 🤖 Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies(可泛化的人形机器人操作:改进的三维扩散策略) [10:29] 📚 LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory(长时记忆评估:在长期交互记忆中评估聊天助手) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 16 篇论文如下: [00:25] 🌐 Baichuan-Omni Technical Report(百川-Omni 技术报告) [00:59] 🖼 Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis(Meissonic:高效高分辨率文本到图像生成的掩码生成Transformer复兴) [01:41] 🔧 From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning(从通才到专家:通过任务特定视觉指令调整适应视觉语言模型) [02:17] 🎨 EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models(进化导演:利用大规模视觉语言模型接近高级文本到图像生成) [02:53] 🧠 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization(结构化RAG:通过推理时混合信息结构化提升LLMs的知识密集型推理能力) [03:34] 📏 PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness(大语言模型:具备显式位置感知的长度控制与复制粘贴) [04:11] 🌐 Semantic Score Distillation Sampling for Compositional Text-to-3D Generation(语义分数蒸馏采样用于组合式文本到3D生成) [04:47] 🧠 SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights(超级纠正:利用错误驱动的洞察力监督和纠正语言模型) [05:29] 🔄 Mechanistic Permutability: Match Features Across Layers(机制可置换性:跨层匹配特征) [06:07] 🤖 Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining(多智能体协作数据选择以提高LLM预训练效率) [06:45] ⚡ KV Prediction for Improved Time to First Token(KV预测提升首次输出时间) [07:30] 🌐 ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion(零样本对象合成:基于扩散的图像内在特性) [08:13] 🚨 MiRAGeNews: Multimodal Realistic AI-Generated News Detection(多模态现实AI生成新闻检测) [08:52] 🤖 DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models(DA-Code:面向大型语言模型的代理数据科学代码生成基准) [09:30] 📈 I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow(I-Max:最大化预训练校正流变换器的分辨率潜力与投影流) [10:12] 🧠 Mentor-KD: Making Small Language Models Better Multi-step Reasoners(导师-KD:使小型语言模型成为更好的多步推理者) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:37] TOP1(🔥128) | 🔍 Differential Transformer(差分Transformer) [02:38] TOP2(🔥125) | ⚡ Addition is All You Need for Energy-efficient Language Models(加法即所需:高效能语言模型) [04:13] TOP3(🔥84) | 🌐 Aria: An Open Multimodal Native Mixture-of-Experts Model(Aria:一个开放的多模态原生混合专家模型) [06:18] TOP4(🔥73) | 🤖 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments(GLEE:基于语言的经济环境统一框架与基准) [08:25] TOP5(🔥63) | 👤 Personalized Visual Instruction Tuning(个性化视觉指令微调) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 21 篇论文如下: [00:25] 🧮 MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code(MathCoder2:通过模型翻译的数学代码进行持续预训练以提升数学推理能力) [01:09] 🚀 PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs(前缀量化:静态量化通过LLMs中的前缀异常值超越动态量化) [01:59] 🤖 MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents(MLLM作为检索器:交互式学习多模态检索以增强具身代理) [02:33] 🎨 DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models(DICE:离散逆向可控编辑的多项扩散与掩码生成模型) [03:03] 🔄 Benchmarking Agentic Workflow Generation(代理工作流生成基准测试) [03:44] 🤖 Agent S: An Open Agentic Framework that Uses Computers Like a Human(Agent S:一个使用计算机如人类的开放代理框架) [04:23] 🔄 Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow(修正扩散:在修正流中直线性并非必需) [04:55] 🤖 Intriguing Properties of Large Language and Vision Models(大型语言与视觉模型的引人特性) [05:35] 🎥 Progressive Autoregressive Video Diffusion Models(渐进式自回归视频扩散模型) [06:26] 🌲 Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning(基于MCTS的LLMs自我改进:利用逐步知识与课程偏好学习) [07:10] 🌐 Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality(保留预训练视觉语言模型的多模态能力以提升视觉语言组合性) [07:50] 🤖 GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models(GLOV:引导大型语言模型作为视觉语言模型的隐式优化器) [08:36] 🧩 SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe(SFTMix:利用Mixup方法提升语言模型指令微调) [09:15] 🔄 Emergent properties with repeated examples(重复示例的涌现特性) [09:57] 🤖 Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System(优化基于LLM的多智能体系统的有效性与效率) [10:40] 🎲 Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates(欺骗自动LLM基准测试:空模型实现高胜率) [11:14] 🌐 Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition(无处不在同时进行:LLMs 可以在叠加状态下进行多任务上下文学习) [11:58] 🧬 LPZero: Language Model Zero-cost Proxy Search from Zero(LPZero:从零开始的零成本代理搜索) [12:41] 🌐 MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting(MotionGS:探索显式运动引导的可变形3D高斯喷射) [13:15] 🔍 Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations(扩展你的卷积核:大卷积核设计在卷积神经网络中的通用表示) [13:51] 🖼 DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation(DART:去噪自回归Transformer用于可扩展的文本到图像生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 10 篇论文如下: [00:40] TOP1(🔥129) | 🤖 Training Language Models to Self-Correct via Reinforcement Learning(通过强化学习训练语言模型进行自我修正) [02:41] TOP2(🔥121) | 🚀 Qwen2.5-Coder Technical Report(Qwen2.5-Coder技术报告) [04:44] TOP3(🔥96) | 🌐 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models(Molmo 和 PixMo:用于最先进多模态模型的开放权重和开放数据) [06:30] TOP4(🔥95) | 🖼 Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing(引导与重缩放:无调参自引导机制实现高效真实图像编辑) [08:23] TOP5(🔥86) | 🧠 Attention Heads of Large Language Models: A Survey(大型语言模型注意力头:一项综述) [10:17] TOP6(🔥85) | 🎥 Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency(Loopy:驯服音频驱动的人像化身与长期运动依赖) [11:56] TOP7(🔥81) | 🌐 OmniGen: Unified Image Generation(全能生成:统一图像生成模型) [13:51] TOP8(🔥81) | 🧠 Emu3: Next-Token Prediction is All You Need(Emu3:下一个词预测是所有你需要的) [15:45] TOP9(🔥78) | 📄 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model(通用OCR理论:通过统一端到端模型迈向OCR-2.0) [17:59] TOP10(🔥77) | 🧠 OLMoE: Open Mixture-of-Experts Language Models(OLMoE:开放式混合专家语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 43 篇论文如下: [00:23] 🤖 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments(GLEE:基于语言的经济环境统一框架与基准) [01:09] 👤 Personalized Visual Instruction Tuning(个性化视觉指令微调) [01:48] 🌍 Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation(迈向世界模拟器:基于物理常识的视频生成基准) [02:35] 🖼 IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation(迭代组合感知反馈学习:从模型库中提升文本到图像生成) [03:17] 🔍 Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate(解码大型视觉语言模型中的跨模态对齐与模态集成率) [03:54] 🌐 Aria: An Open Multimodal Native Mixture-of-Experts Model(Aria:一个开放的多模态原生混合专家模型) [04:29] 🌐 Pixtral 12B(Pixtral 12B) [05:09] 🎥 Pyramidal Flow Matching for Efficient Video Generative Modeling(金字塔流匹配用于高效视频生成建模) [05:49] 🔗 Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning(揭示视觉表示学习中的骨干-优化器耦合偏差) [06:29] 🎥 MM-Ego: Towards Building Egocentric Multimodal LLMs(MM-Ego:构建以自我为中心的多模态大型语言模型) [07:07] 🔄 One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation(一种初始化方法统治所有:通过解释方差适应进行微调) [07:51] 📖 Story-Adapter: A Training-free Iterative Framework for Long Story Visualization(故事适配器:一种无需训练的迭代框架用于长故事可视化) [08:33] 🚀 Self-Boosting Large Language Models with Synthetic Preference Data(利用合成偏好数据自我提升大型语言模型) [09:13] 🚀 Falcon Mamba: The First Competitive Attention-free 7B Language Model(猎鹰曼巴:首个无注意力机制的7B语言模型) [09:53] 🎨 TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation(TweedieMix:改进基于扩散的图像/视频生成中的多概念融合) [10:24] ⏳ Temporal Reasoning Transfer from Text to Video(从文本到视频的时间推理迁移) [10:54] 🎥 TRACE: Temporal Grounding Video LLM via Causal Event Modeling(TRACE:通过因果事件建模实现视频时间定位的大型语言模型) [11:30] 📊 Data Selection via Optimal Control for Language Models(通过最优控制进行语言模型数据选择) [12:07] 🤖 Response Tuning: Aligning Large Language Models without Instruction(响应调优:无需指令对齐大型语言模型) [12:49] 🤖 CursorCore: Assist Programming through Aligning Anything(CursorCore:通过对齐任何内容辅助编程) [13:36] 🎥 ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler(ViBiDSampler:利用双向扩散采样器增强视频插值) [14:16] 🗣 Mixed-Session Conversation with Egocentric Memory(带有自我中心记忆的混合会话) [14:57] 🎮 ING-VP: MLLMs cannot Play Easy Vision-based Games Yet(ING-VP:多模态大语言模型在视觉游戏中的表现仍不尽人意) [15:41] 🔓 AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs(AutoDAN-Turbo:一种用于策略自我探索以破解LLMs的终身代理) [16:26] 🎥 T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design(T2V-Turbo-v2:通过数据、奖励和条件引导设计增强视频生成模型后训练) [17:00] 📖 Collective Critics for Creative Story Generation(创意故事生成的集体批评框架) [17:36] 🎵 Diversity-Rewarded CFG Distillation(多样性奖励的CFG蒸馏) [18:16] 🧠 Retrieval-Augmented Decision Transformer: External Memory for In-context RL(检索增强决策变压器:上下文强化学习的外部记忆) [18:57] 🎙 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching(F5-TTS:基于流匹配生成流畅且忠实语音的童话生成器) [19:32] 🎹 FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance(《致爱丽丝:捕捉并物理合成钢琴演奏手部动作》) [20:20] 🧠 Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning(整体遗忘基准:文本到图像扩散模型遗忘的多方面评估) [21:01] 🧬 Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning(多模态大语言模型用于逆向分子设计与逆合成规划) [21:38] 🎥 BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way(BroadWay:无需训练提升文本到视频生成模型) [22:21] 🚨 Multimodal Situational Safety(多模态情境安全) [22:56] 💥 Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders(幻觉AI劫持攻击:大型语言模型与恶意代码推荐器) [23:38] 🛠 Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach(Seeker:利用基于LLM的多代理方法增强代码中的异常处理) [24:18] 🌐 Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control(联合生成多视角一致的PBR纹理:协作控制方法) [24:55] 🤖 TinyEmo: Scaling down Emotional Reasoning via Metric Projection(TinyEmo:通过度量投影缩小情感推理) [25:29] 🧠 MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders(心理竞技场:通过自我对弈训练语言模型用于心理健康障碍的诊断与治疗) [26:08] 🎭 TextToon: Real-Time Text Toonify Head Avatar from Single Video(文本转卡通:从单视频实时生成卡通化头部虚拟形象) [26:49] 🤖 Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA(伟大的思想是否一致?探究CAIMIRA框架下的人机问答互补性) [27:28] 📊 MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering(MLE-bench:评估机器学习代理在机器学习工程中的表现) [28:03] 🧠 Does Spatial Cognition Emerge in Frontier Models?(空间认知在前沿模型中是否出现?) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
与播客爱好者一起交流
播放列表还是空的
去找些喜欢的节目添加进来吧