节目列表: HuggingFace 每日AI论文速递 - EarsOnMe | 发现和收听来自小宇宙的热门播客

2025.03.26 | 视频预测性能提升，多模态预训练效果显著。

本期的 15 篇论文如下： [00:22] 🎬 Long-Context Autoregressive Video Modeling with Next-Frame Prediction（基于下一帧预测的长程上下文自回归视频建模） [01:01] 🖼 CoMP: Continual Multimodal Pre-training for Vision Foundation Models（CoMP：面向视觉基础模型的持续多模态预训练） [01:42] 🎬 Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation（探索大型多模态模型在视频理解中的幻觉现象：基准、分析与缓解） [02:28] 📈 Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing（基于随机生成与回滚预算强制的Flow模型推理时扩展） [03:14] 🖼 Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation（揪出伪造：基于大型多模态模型的合成图像检测与伪影解释） [03:54] 🖼 Scaling Vision Pre-Training to 4K Resolution（将视觉预训练扩展到4K分辨率） [04:33] 🤔 Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking（三思而后行：通过扩展多轮测试时思考来增强LLM推理能力） [05:15] 🖼 CoLLM: A Large Language Model for Composed Image Retrieval（CoLLM：用于组合图像检索的大型语言模型） [05:53] 🤖 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding（MDocAgent：用于文档理解的多模态多代理框架） [06:35] 🖼 Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models（基于扩散模型的潜在空间超分辨率高分辨率图像生成） [07:13] 🔍 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning（ReSearch：通过强化学习训练大型语言模型以进行搜索推理） [07:54] 🛡 LookAhead Tuning: Safer Language Models via Partial Answer Previews（前瞻调优：通过部分答案预览实现更安全的语言模型） [08:38] 💡 Frequency Dynamic Convolution for Dense Image Prediction（用于密集图像预测的频率动态卷积） [09:18] 🖼 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation（LPOSS：基于图像块和像素的标签传播，用于开放词汇语义分割） [09:51] 🧬 Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation（基于直通引导的Gumbel-Softmax Flow Matching用于可控生物序列生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

4周前

2025.03.25 | 稀疏自编码器解读LLM中的推理特征，交互视频革新

本期的 15 篇论文如下： [00:24] 🧠 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders（我已经覆盖了所有基础：通过稀疏自编码器解读大型语言模型中的推理特征） [01:03] 🎮 Position: Interactive Generative Video as Next-Generation Game Engine（立场：交互式生成视频作为下一代游戏引擎） [01:47] 🎬 Video-T1: Test-Time Scaling for Video Generation（Video-T1：面向视频生成的测试时缩放） [02:35] 🌐 Aether: Geometric-Aware Unified World Modeling（Aether：几何感知统一世界建模） [03:11] 🧠 SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild（SimpleRL-Zoo：探索和驯服开放基础模型中的零强化学习） [03:51] 🎬 OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models（OmnimatteZero：基于预训练视频扩散模型的免训练实时全域Matte） [04:31] 🤖 Judge Anything: MLLM as a Judge Across Any Modality（万物皆可判：多模态大型语言模型作为跨模态的评估者） [05:16] 💡 LEMMA: Learning from Errors for MatheMatical Advancement in LLMs（LEMMA：通过从错误中学习促进大型语言模型在数学领域的进步） [05:57] 🖼 Equivariant Image Modeling（等变图像建模） [06:37] 🚀 Training-free Diffusion Acceleration with Bottleneck Sampling（基于瓶颈采样的免训练扩散加速方法） [07:11] ✨ CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models（CFG-Zero*：改进的用于Flow Matching模型的无分类器引导） [07:59] 🤔 Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models（视频简单问答：面向大型视频语言模型的事实性评估） [08:39] 🚄 FFN Fusion: Rethinking Sequential Computation in Large Language Models（FFN融合：重新思考大型语言模型中的序列计算） [09:20] 🛡 Defeating Prompt Injections by Design（通过设计击败提示注入攻击） [10:00] 🤝 AgentRxiv: Towards Collaborative Autonomous Research（AgentRxiv：迈向协同自主研究）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

4周前

2025.03.24 | 多智能体协作提升性能，苏格拉底式对话优化提示。

本期的 15 篇论文如下： [00:22] 🧠 MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving（MAPS：一个基于大七人格和苏格拉底指导的多智能体框架，用于多模态科学问题求解） [01:09] 🤖 MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization（MARS：一个融合苏格拉底式指导的多智能体自动提示优化框架） [01:55] 🤖 RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints（RoboFactory：探索具有组合约束的具身智能体协作） [02:38] 🧮 When Less is Enough: Adaptive Token Reduction for Efficient Image Representation（适可而止：用于高效图像表征的自适应Token缩减） [03:21] 🌉 Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation（用于自回归视觉生成的连续和离散令牌桥接） [03:55] 🧠 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement（OpenVLThinker：通过迭代自提升进行复杂视觉语言推理的早期探索） [04:37] ✍ Modifying Large Language Model Post-Training for Diverse Creative Writing（修改大型语言模型后训练以实现多样化的创意写作） [05:21] 🧮 MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems（MathFlow: 提升 MLLM 在视觉数学问题中的感知流程） [06:05] 🎬 Enabling Versatile Controls for Video Diffusion Models（实现视频扩散模型的多功能控制） [06:48] 🎬 ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering（ETVA：通过细粒度问题生成与回答评估文本到视频的对齐） [07:27] 🖼 Single Image Iterative Subject-driven Generation and Editing（单图像迭代式主体驱动生成与编辑） [08:12] 🎨 When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO（当偏好出现分歧：通过少数群体感知自适应DPO对齐扩散模型） [08:56] ⚖ From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration（从头到尾：通过自适应数据校准实现大型视觉-语言模型中的平衡表征） [09:37] 🚀 FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models（FastCuRL：基于渐进式上下文扩展的课程强化学习，用于高效训练类R1推理模型） [10:13] 🗣 TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting（TaoAvatar：基于3D高斯溅射的增强现实中实时逼真的全身对话化身）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

4周前

【周末特辑】3月第3周最火AI论文 | 序列建模创新，视频渲染突破

本期的 5 篇论文如下： [00:37] TOP1(🔥118) | 🦢 RWKV-7 "Goose" with Expressive Dynamic State Evolution（RWKV-7 "Goose"：具有表达性动态状态演化的序列建模） [02:36] TOP2(🔥115) | 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video（ReCamMaster：基于单视频的相机控制生成式渲染） [05:18] TOP3(🔥89) | 🤖 DAPO: An Open-Source LLM Reinforcement Learning System at Scale（DAPO：一个大规模的开源LLM强化学习系统） [07:45] TOP4(🔥84) | 🎥 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation（DropletVideo：探索整体时空一致性视频生成的数据集与方法） [10:28] TOP5(🔥79) | 🎨 PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity（PLADIS：通过利用稀疏性，在推理时突破扩散模型中Attention的限制）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

1个月前

2025.03.21 | 蒸馏提升超分辨率效率，优化推理减少计算负担。

本期的 15 篇论文如下： [00:23] 🖼 One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation（基于蒸馏的单步残差转移扩散超分辨率） [01:01] 🤔 Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models（停止过度思考：大型语言模型高效推理综述） [01:38] 🚀 Unleashing Vecset Diffusion Model for Fast Shape Generation（释放Vecset扩散模型以实现快速形状生成） [02:18] 🤖 Survey on Evaluation of LLM-based Agents（基于大型语言模型（LLM）的智能体评估方法综述） [02:56] 🎨 DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers（DiffMoE：用于可扩展扩散Transformer的动态Token选择） [03:33] 🤖 Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning（Cosmos-Reason1：从物理常识到具身推理） [04:14] 🖼 Scale-wise Distillation of Diffusion Models（扩散模型的尺度wise蒸馏） [04:54] 🗜 Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models（面向视频大语言模型的即插即用1.x-Bit KV缓存量化） [05:36] 🧮 MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion（MathFusion：通过指令融合增强大型语言模型解决数学问题的能力） [06:17] 🖼 InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity（无限的你：在保留身份的同时进行灵活的照片重塑） [06:56] 🎮 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse（JARVIS-VLA：通过后训练大规模视觉语言模型，使用键盘和鼠标玩视觉游戏） [07:41] 🧠 CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners（CaKE：电路感知编辑实现通用知识学习器） [08:26] 🖼 Ultra-Resolution Adaptation with Ease（简易的超分辨率自适应） [09:04] 🎨 Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts（专家竞赛：一种灵活的路由策略，用于扩展具有混合专家模型的扩散Transformer） [09:48] 🎬 MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance（MagicMotion：基于稠密到稀疏轨迹引导的可控视频生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1个月前

2025.03.20 | 自适应前瞻采样优化推理；强化学习提升3D网格质量

本期的 15 篇论文如下： [00:23] 🔍 $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation（$\phi$-解码：用于平衡推理时探索与利用的自适应前瞻采样） [01:08] 🎨 DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning（DeepMesh：基于强化学习的自回归艺术家网格创建） [01:51] 🌷 TULIP: Towards Unified Language-Image Pretraining（TULIP：迈向统一的语言-图像预训练） [02:26] 🤖 Cube: A Roblox View of 3D Intelligence（Cube：Roblox 视角下的 3D 智能） [03:06] 📱 Efficient Personalization of Quantized Diffusion Model without Backpropagation（无需反向传播的量化扩散模型高效个性化） [03:48] 🎬 Temporal Regularization Makes Your Video Generator Stronger（时间正则化使你的视频生成器更强大） [04:21] 🤖 STEVE: AStep Verification Pipeline for Computer-use Agent Training（STEVE：用于计算机使用代理训练的步骤验证管道） [04:59] 🖼 LEGION: Learning to Ground and Explain for Synthetic Image Detection（LEGION：学习定位与解释以用于合成图像检测） [05:41] 🎶 MusicInfuser: Making Video Diffusion Listen and Dance（MusicInfuser：让视频扩散模型聆听与舞动） [06:24] 👋 ViSpeak: Visual Instruction Feedback in Streaming Videos（ViSpeak：流视频中的视觉指令反馈） [07:03] 🧠 GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction（GKG-LLM：一个用于广义知识图谱构建的统一框架） [07:46] 👁 Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning（通过随身携带的视觉条件反射缓解多模态长链思维推理中的视觉遗忘） [08:32] 🗣 Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait（解锁姿态多样性：用于音频驱动的说话人像的精确高效的基于隐式关键点的时空扩散） [09:09] 🤖 ELTEX: A Framework for Domain-Driven Synthetic Data Generation（ELTEX：一种领域驱动的合成数据生成框架） [09:52] 🧪 CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning（CURIE：评估大型语言模型在多任务科学长文本理解与推理方面的能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

91

1个月前

2025.03.19 | 动态序列建模优势，视频生成理解挑战

本期的 15 篇论文如下： [00:21] 🦢 RWKV-7 "Goose" with Expressive Dynamic State Evolution（RWKV-7 "Goose"：具有表达性动态状态演化的序列建模） [00:55] 🤯 Impossible Videos（不可能的视频） [01:38] 🎨 Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM（Creation-MMBench：评估多模态大型语言模型中具有上下文感知能力的创造性智能） [02:17] 🤖 DAPO: An Open-Source LLM Reinforcement Learning System at Scale（DAPO：一个大规模的开源LLM强化学习系统） [02:58] 🧠 DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding（DeepPerception：提升多模态大型语言模型中类R1认知视觉感知能力，用于知识密集型视觉定位） [03:39] 🖼 CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era（CapArena：LLM时代下详细图像描述的基准测试与分析） [04:25] 🤖 Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation（无限可动性：通过程序生成实现可伸缩的高保真铰接物体合成） [05:13] 🧠 Frac-Connections: Fractional Extension of Hyper-Connections（Frac-Connections：超连接的分数扩展） [05:52] 🌍 Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control（宇宙-迁移1：基于自适应多模态控制的条件世界生成） [06:30] 🧐 MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification（MPBench：用于过程错误识别的综合多模态推理基准） [07:13] 🤖 Aligning Multimodal LLM with Human Preference: A Survey（多模态大语言模型与人类偏好对齐：一项综述） [07:51] ⏱ Measuring AI Ability to Complete Long Tasks（衡量人工智能完成长时任务的能力） [08:38] 🎭 Concat-ID: Towards Universal Identity-Preserving Video Synthesis（Concat-ID：面向通用身份保持的视频合成） [09:13] 🖼 FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis（FlexWorld: 用于灵活视角合成的渐进式扩展3D场景） [09:50] 🤔 Temporal Consistency for LLM Reasoning Process Error Identification（LLM推理过程错误识别的时序一致性方法）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1个月前

2025.03.18 | 视频生成新方法，人形机器人新框架

本期的 15 篇论文如下： [00:21] 🎥 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation（DropletVideo：探索整体时空一致性视频生成的数据集与方法） [01:10] 🤖 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills（Being-0：一个具有视觉-语言模型和模块化技能的人形机器人代理） [01:49] 🖼 DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models（DreamRenderer：驯服大规模文本到图像模型中的多实例属性控制） [02:38] 🖼 Edit Transfer: Learning Image Editing via Vision In-Context Relations（编辑迁移：通过视觉上下文关系学习图像编辑） [03:12] 🖼 Personalize Anything for Free with Diffusion Transformer（使用扩散Transformer免费实现任何物体的个性化） [03:53] 🎬 WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes（WideRange4D：通过宽范围运动和场景实现高质量4D重建） [04:30] 🎨 BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing（BlobCtrl: 用于元素级图像生成与编辑的统一且灵活的框架） [05:14] 🛡 reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs（reWordBench：通过转换输入来评估和提升奖励模型的鲁棒性） [05:54] 🔬 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research（MicroVQA：一个用于基于显微镜的科学研究的多模态推理基准） [06:31] 🧠 Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey（多模态思维链推理：一项综合综述） [07:09] 🤖 Free-form language-based robotic reasoning and grasping（基于自由形式语言的机器人推理与抓取） [07:45] 🧠 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization（R1-VL：通过逐步分组相对策略优化学习多模态大型语言模型的推理） [08:35] 🤔 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning（V-STaR：视频时空推理能力评测基准） [09:18] 🎬 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning（VideoMind：用于长视频推理的链式LoRA Agent） [09:51] 🖼 Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation（奖励足以实现快速逼真的文本到图像生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1个月前

2025.03.17 | 新相机轨迹生成，稀疏性提升图像质量

本期的 15 篇论文如下： [00:25] 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video（ReCamMaster：基于单视频的相机控制生成式渲染） [01:11] 💡 PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity（PLADIS：通过利用稀疏性，在扩散模型推理时突破注意力机制的限制） [01:50] 🤖 Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning（对抗性数据收集：用于高效和鲁棒机器人模仿学习的人机协作扰动） [02:38] 📊 Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models（关于有效性和效率的技术：状态空间模型综述） [03:25] 🤖 API Agents vs. GUI Agents: Divergence and Convergence（API智能体与GUI智能体：差异与融合） [03:57] 🛡 Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks（联邦学习的脆弱性探索：梯度反演攻击深度解析） [04:47] 🎬 Large-scale Pre-training for Grounded Video Caption Generation（面向视频内容理解的大规模预训练） [05:31] 🌉 FlowTok: Flowing Seamlessly Across Text and Image Tokens（FlowTok：在文本和图像Token之间无缝流动） [06:08] ⚕ TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools（TxAgent：一个用于跨工具领域进行治疗推理的AI Agent） [06:47] 🤔 Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?（Kolmogorov-Arnold注意力机制：可学习的注意力机制更适合视觉Transformer吗？） [07:27] 📸 VGGT: Visual Geometry Grounded Transformer（VGGT：基于视觉几何的Transformer） [08:14] 🦜 Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption（Cockatiel：集成合成数据与人类偏好训练，实现细致的视频描述） [08:52] 🖼 Neighboring Autoregressive Modeling for Efficient Visual Generation（相邻自回归建模：用于高效视觉生成） [09:26] 🔬 ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges（ProJudge：一个基于多模态大语言模型的过程评估器的多模态多学科基准和指令微调数据集） [10:02] 🖼 ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy（ARMOR v0.1：通过非对称协同的交错多模态生成增强自回归多模态理解模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

1个月前

【周末特辑】3月第2周最火AI论文 | 稀疏自编码器提升文本检测，自动化ICD编码提高医疗效率。

本期的 5 篇论文如下： [00:44] TOP1(🔥208) | 🤖 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders（基于稀疏自编码器的人工文本检测特征分析） [03:15] TOP2(🔥122) | 🇷 RuCCoD: Towards Automated ICD Coding in Russian（RuCCoD：面向俄语自动化的ICD编码研究） [05:35] TOP3(🔥104) | 🌐 Unified Reward Model for Multimodal Understanding and Generation（多模态理解和生成的统一奖励模型） [07:58] TOP4(🔥89) | 🌏 Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia（众包、爬取还是生成？创建东南亚视觉语言数据集SEA-VL） [10:21] TOP5(🔥73) | 🧠 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL（LMM-R1：通过两阶段基于规则的强化学习赋予3B参数大模态模型强大的推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

1个月前

2025.03.14 | CoSTA*优化多轮编辑效率，无声品牌攻击揭示扩散模型脆弱性。

本期的 15 篇论文如下： [00:25] 🖼 CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing（CoSTA*：面向多轮图像编辑的成本敏感工具路径代理） [01:03] 🎭 Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models（无声品牌攻击：针对文本到图像扩散模型的无触发数据投毒攻击） [01:45] 🌍 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning（世界建模提升规划器性能：双重偏好优化用于具身任务规划） [02:30] 🗺 Charting and Navigating Hugging Face's Model Atlas（绘制与导航Hugging Face的模型地图） [03:14] 🧠 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing（GoT：释放多模态大型语言模型的推理能力用于视觉生成与编辑） [03:48] 🎨 CoRe^2: Collect, Reflect and Refine to Generate Better and Faster（CoRe^2：收集、反思与精炼以生成更快更好的图像） [04:29] 🧠 Transformers without Normalization（无需归一化的Transformer） [05:06] 🌐 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding（GroundingSuite：测量复杂多粒度像素接地） [05:50] 🤖 New Trends for Modern Machine Translation with Large Reasoning Models（现代机器翻译的新趋势：基于大型推理模型的研究） [06:32] 📝 Shifting Long-Context LLMs Research from Input to Output（从输入到输出：长上下文大语言模型研究的转变） [07:09] 🌐 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search（视觉网页指令：通过网络搜索扩展多模态指令数据） [07:54] 🧠 DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation（DiT-Air: 重新审视扩散模型架构设计在文本到图像生成中的效率） [08:35] 🐱 Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark（我看起来像一只猫吗？分类图像生成基准） [09:20] 🎥 Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k（Open-Sora 2.0：以20万美元训练商用级视频生成模型） [10:01] 🎥 Long Context Tuning for Video Generation（长上下文调优用于视频生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

85

1个月前

2025.03.13 | 降低视频扩散模型计算需求，提升多视角视频生成质量。

本期的 15 篇论文如下： [00:20] 🎥 TPDiff: Temporal Pyramid Video Diffusion Model（TPDiff：时间金字塔视频扩散模型） [00:58] 🎥 Reangle-A-Video: 4D Video Generation as Video-to-Video Translation（Reangle-A-Video：将4D视频生成作为视频到视频的转换） [01:42] 🧠 Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models（块扩散：在自回归与扩散语言模型之间插值） [02:18] 🎯 RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling（RewardSDS：通过奖励加权采样对齐分数蒸馏） [02:55] 🧠 GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training（GTR：引导思维强化防止基于RL的VLM代理训练中的思维崩溃） [03:36] 📄 More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG（更多文档，相同长度：隔离RAG中多文档的挑战） [04:19] 💃 Motion Anything: Any to Motion Generation（运动万象：任意到运动生成） [05:15] 📊 WildIFEval: Instruction Following in the Wild（野外交互评估：复杂条件下的指令遵循） [05:49] 📹 VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary（VLog：通过生成性检索叙事词汇的视频-语言模型） [06:29] 🤖 Quantizing Large Language Models for Code Generation: A Differentiated Replication（量化大型语言模型用于代码生成：差异化复现） [07:13] 🧠 Cost-Optimal Grouped-Query Attention for Long-Context LLMs（长上下文大语言模型的成本最优分组查询注意力） [07:53] 🧬 Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation（高精度单细胞转录组分析与生成中的多模态语言建模） [08:33] 🔄 Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space（无别名潜在扩散模型：提升扩散潜在空间的分数位移等变性） [09:15] 🔄 Self-Taught Self-Correction for Small Language Models（小语言模型的自教自纠） [09:49] 🧩 MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System（MoC：检索增强生成系统中的文本分块学习混合模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

1个月前