节目列表: HuggingFace 每日AI论文速递 - EarsOnMe | 发现和收听来自小宇宙的热门播客

2025.03.17 | 新相机轨迹生成，稀疏性提升图像质量

本期的 15 篇论文如下： [00:25] 🎥 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video（ReCamMaster：基于单视频的相机控制生成式渲染） [01:11] 💡 PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity（PLADIS：通过利用稀疏性，在扩散模型推理时突破注意力机制的限制） [01:50] 🤖 Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning（对抗性数据收集：用于高效和鲁棒机器人模仿学习的人机协作扰动） [02:38] 📊 Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models（关于有效性和效率的技术：状态空间模型综述） [03:25] 🤖 API Agents vs. GUI Agents: Divergence and Convergence（API智能体与GUI智能体：差异与融合） [03:57] 🛡 Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks（联邦学习的脆弱性探索：梯度反演攻击深度解析） [04:47] 🎬 Large-scale Pre-training for Grounded Video Caption Generation（面向视频内容理解的大规模预训练） [05:31] 🌉 FlowTok: Flowing Seamlessly Across Text and Image Tokens（FlowTok：在文本和图像Token之间无缝流动） [06:08] ⚕ TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools（TxAgent：一个用于跨工具领域进行治疗推理的AI Agent） [06:47] 🤔 Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?（Kolmogorov-Arnold注意力机制：可学习的注意力机制更适合视觉Transformer吗？） [07:27] 📸 VGGT: Visual Geometry Grounded Transformer（VGGT：基于视觉几何的Transformer） [08:14] 🦜 Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption（Cockatiel：集成合成数据与人类偏好训练，实现细致的视频描述） [08:52] 🖼 Neighboring Autoregressive Modeling for Efficient Visual Generation（相邻自回归建模：用于高效视觉生成） [09:26] 🔬 ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges（ProJudge：一个基于多模态大语言模型的过程评估器的多模态多学科基准和指令微调数据集） [10:02] 🖼 ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy（ARMOR v0.1：通过非对称协同的交错多模态生成增强自回归多模态理解模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

2个月前

【周末特辑】3月第2周最火AI论文 | 稀疏自编码器提升文本检测，自动化ICD编码提高医疗效率。

本期的 5 篇论文如下： [00:44] TOP1(🔥208) | 🤖 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders（基于稀疏自编码器的人工文本检测特征分析） [03:15] TOP2(🔥122) | 🇷 RuCCoD: Towards Automated ICD Coding in Russian（RuCCoD：面向俄语自动化的ICD编码研究） [05:35] TOP3(🔥104) | 🌐 Unified Reward Model for Multimodal Understanding and Generation（多模态理解和生成的统一奖励模型） [07:58] TOP4(🔥89) | 🌏 Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia（众包、爬取还是生成？创建东南亚视觉语言数据集SEA-VL） [10:21] TOP5(🔥73) | 🧠 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL（LMM-R1：通过两阶段基于规则的强化学习赋予3B参数大模态模型强大的推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

2个月前

2025.03.14 | CoSTA*优化多轮编辑效率，无声品牌攻击揭示扩散模型脆弱性。

本期的 15 篇论文如下： [00:25] 🖼 CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing（CoSTA*：面向多轮图像编辑的成本敏感工具路径代理） [01:03] 🎭 Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models（无声品牌攻击：针对文本到图像扩散模型的无触发数据投毒攻击） [01:45] 🌍 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning（世界建模提升规划器性能：双重偏好优化用于具身任务规划） [02:30] 🗺 Charting and Navigating Hugging Face's Model Atlas（绘制与导航Hugging Face的模型地图） [03:14] 🧠 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing（GoT：释放多模态大型语言模型的推理能力用于视觉生成与编辑） [03:48] 🎨 CoRe^2: Collect, Reflect and Refine to Generate Better and Faster（CoRe^2：收集、反思与精炼以生成更快更好的图像） [04:29] 🧠 Transformers without Normalization（无需归一化的Transformer） [05:06] 🌐 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding（GroundingSuite：测量复杂多粒度像素接地） [05:50] 🤖 New Trends for Modern Machine Translation with Large Reasoning Models（现代机器翻译的新趋势：基于大型推理模型的研究） [06:32] 📝 Shifting Long-Context LLMs Research from Input to Output（从输入到输出：长上下文大语言模型研究的转变） [07:09] 🌐 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search（视觉网页指令：通过网络搜索扩展多模态指令数据） [07:54] 🧠 DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation（DiT-Air: 重新审视扩散模型架构设计在文本到图像生成中的效率） [08:35] 🐱 Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark（我看起来像一只猫吗？分类图像生成基准） [09:20] 🎥 Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k（Open-Sora 2.0：以20万美元训练商用级视频生成模型） [10:01] 🎥 Long Context Tuning for Video Generation（长上下文调优用于视频生成）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

85

2个月前

2025.03.13 | 降低视频扩散模型计算需求，提升多视角视频生成质量。

本期的 15 篇论文如下： [00:20] 🎥 TPDiff: Temporal Pyramid Video Diffusion Model（TPDiff：时间金字塔视频扩散模型） [00:58] 🎥 Reangle-A-Video: 4D Video Generation as Video-to-Video Translation（Reangle-A-Video：将4D视频生成作为视频到视频的转换） [01:42] 🧠 Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models（块扩散：在自回归与扩散语言模型之间插值） [02:18] 🎯 RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling（RewardSDS：通过奖励加权采样对齐分数蒸馏） [02:55] 🧠 GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training（GTR：引导思维强化防止基于RL的VLM代理训练中的思维崩溃） [03:36] 📄 More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG（更多文档，相同长度：隔离RAG中多文档的挑战） [04:19] 💃 Motion Anything: Any to Motion Generation（运动万象：任意到运动生成） [05:15] 📊 WildIFEval: Instruction Following in the Wild（野外交互评估：复杂条件下的指令遵循） [05:49] 📹 VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary（VLog：通过生成性检索叙事词汇的视频-语言模型） [06:29] 🤖 Quantizing Large Language Models for Code Generation: A Differentiated Replication（量化大型语言模型用于代码生成：差异化复现） [07:13] 🧠 Cost-Optimal Grouped-Query Attention for Long-Context LLMs（长上下文大语言模型的成本最优分组查询注意力） [07:53] 🧬 Multimodal Language Modeling for High-Accuracy Single Cell Transcriptomics Analysis and Generation（高精度单细胞转录组分析与生成中的多模态语言建模） [08:33] 🔄 Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space（无别名潜在扩散模型：提升扩散潜在空间的分数位移等变性） [09:15] 🔄 Self-Taught Self-Correction for Small Language Models（小语言模型的自教自纠） [09:49] 🧩 MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System（MoC：检索增强生成系统中的文本分块学习混合模型）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

2个月前

2025.03.12 | 东南亚数据集创新构建，大模态模型推理能力显著提升

本期的 15 篇论文如下： [00:23] 🌏 Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia（众包、爬取还是生成？创建东南亚视觉语言数据集SEA-VL） [01:04] 🧠 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL（LMM-R1：通过两阶段基于规则的强化学习赋予3B参数大模态模型强大的推理能力） [01:43] 🎵 YuE: Scaling Open Foundation Models for Long-Form Music Generation（YuE：扩展开放基础模型用于长篇音乐生成） [02:17] 👤 Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models（UniF²ace：基于统一多模态模型的细粒度人脸理解和生成） [02:59] 🎥 MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice（MagicInfinite：用你的文字和声音生成无限对话视频） [03:42] 🧠 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories（SegAgent：通过模仿人类标注者轨迹探索多模态大模型的像素理解能力） [04:19] 🌐 Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model（Seedream 2.0：一种中英双语图像生成基础模型） [05:03] 🌐 Gemini Embedding: Generalizable Embeddings from Gemini（双子座嵌入：从双子座模型中获得可泛化的嵌入） [05:45] 🧠 Implicit Reasoning in Transformers is Reasoning through Shortcuts（Transformer中的隐式推理是通过捷径实现的） [06:21] 🌟 LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization（LightGen：通过知识蒸馏和直接偏好优化实现高效图像生成） [07:06] 🎥 Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling（无需调参的多事件长视频生成通过同步耦合采样） [07:44] 🧠 Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning（通过元强化微调优化测试时计算） [08:30] 🌐 OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models（OmniMamba：基于线性架构的高效统一多模态理解和生成模型） [09:14] 🧠 CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing（CineBrain：自然视听叙事处理中的大规模多模态脑数据集） [09:52] 🎥 Video Action Differencing（视频动作差异分析）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

2个月前

2025.03.11 | 稀疏自编码器提升文本检测，SEAP优化语言模型效率

本期的 11 篇论文如下： [00:25] 🤖 Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders（基于稀疏自编码器的人工文本检测特征分析） [01:00] 🧠 SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models（SEAP：无训练的稀疏专家激活剪枝解锁大规模语言模型的脑力） [01:43] 🧠 MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning（MM-Eureka：基于规则的大规模强化学习在视觉顿悟中的探索） [02:27] 📝 Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning（记笔记能带来专注吗？面向多轮多模态对话学习） [03:09] 🎬 Automated Movie Generation via Multi-Agent CoT Planning（基于多智能体链式思维规划的自动化电影生成） [03:44] 🔒 FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates（FedRand：通过随机化LoRA子参数更新增强联邦学习中的隐私保护） [04:18] 🔥 DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs（DistiLLM-2：一种对比方法提升大语言模型蒸馏效果） [04:53] 🚀 EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer（EasyControl：为扩散Transformer添加高效灵活的控制） [05:38] 🛠 FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation（FEA-Bench：评估功能实现仓库级代码生成基准） [06:15] 🚗 AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning（AlphaDrive：通过强化学习和推理释放VLMs在自动驾驶中的潜力） [07:01] 📚 SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing（SurveyForge：关于大纲启发式、记忆驱动生成和多维度评估的自动化综述写作）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

99+

2个月前

2025.03.10 | 多模态任务新框架，俄语ICD编码提升。

本期的 20 篇论文如下： [00:19] 🌐 Unified Reward Model for Multimodal Understanding and Generation（多模态理解和生成的统一奖励模型） [01:04] 🇷 RuCCoD: Towards Automated ICD Coding in Russian（RuCCoD：面向俄语自动化的ICD编码研究） [01:41] 🌍 EuroBERT: Scaling Multilingual Encoders for European Languages（EuroBERT：扩展欧洲语言的多语言编码器） [02:28] 🗣 S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information（S2S-Arena：评估语音到语音协议在指令跟随中的副语言信息） [03:08] 🧠 Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching（思维草图：结合认知启发草图的高效LLM推理） [03:47] 🧠 Forgetting Transformer: Softmax Attention with a Forget Gate（遗忘Transformer：带遗忘门的Softmax注意力机制） [04:28] 🧠 R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning（R1-Searcher：通过强化学习激励LLMs的搜索能力） [05:19] 🎥 VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control（VideoPainter：任意长度视频修复与编辑的即插即用上下文控制） [06:04] 🎭 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning（R1-Omni：基于强化学习的可解释全模态情感识别） [06:50] 🎥 TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models（TrajectoryCrafter：通过扩散模型重定向单目视频的相机轨迹） [07:26] 🌊 ProReflow: Progressive Reflow with Decomposed Velocity（ProReflow：渐进式重流与分解速度） [08:11] 🤖 BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities（BEHAVIOR机器人套件：简化日常家庭活动的全身操作） [08:50] 🧠 An Empirical Study on Eliciting and Improving R1-like Reasoning Models（关于启发和提升类似R1推理模型的实证研究） [09:27] 🧠 Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts（线性-专家混合模型：线性序列建模与专家混合模型的结合） [10:13] 🧠 TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation（TinyR1-32B-Preview：通过分支-合并蒸馏提升准确性） [10:56] 🧑 LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding（LONGCODEU：评估长上下文语言模型在长代码理解中的表现） [11:41] 🔄 Learning from Failures in Multi-Attempt Reinforcement Learning（从失败中学习：多尝试强化学习） [12:20] 🔍 SAGE: A Framework of Precise Retrieval for RAG（SAGE：RAG精准检索框架） [13:01] 🧠 R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model（R1-Zero在2B非SFT模型上的视觉推理中的“顿悟时刻”） [13:39] 🤖 Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles（初次了解你并更好地成为你：通过隐式用户画像建模人类对话模拟器）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

14分钟

99+

2个月前

【周末特辑】3月第1周最火AI论文 | 多模态模型音频安全评估，集成工具提升推理效率。

本期的 5 篇论文如下： [00:35] TOP1(🔥64) | 🧠 Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs（Phi-4-Mini技术报告：通过LoRA混合的多模态语言模型实现紧凑且强大的性能） [02:30] TOP2(🔥58) | 🛠 START: Self-taught Reasoner with Tools（自教工具集成推理器） [04:36] TOP3(🔥57) | 🧠 Visual-RFT: Visual Reinforcement Fine-Tuning（视觉强化微调：视觉强化微调） [06:40] TOP4(🔥52) | 🌍 Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers（巴别塔：服务于全球90%以上人口的开源多语言大型语言模型） [09:03] TOP5(🔥51) | 📊 Predictive Data Selection: The Data That Predicts Is the Data That Teaches（预测性数据选择：预测数据即为教学数据）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

2个月前

2025.03.07 | 提升推理效率，AI助手优化生活。

本期的 18 篇论文如下： [00:21] 🛠 START: Self-taught Reasoner with Tools（自教工具集成推理器） [01:03] 👓 EgoLife: Towards Egocentric Life Assistant（EgoLife：面向自我中心的生活助手） [01:39] 📞 LLM as a Broken Telephone: Iterative Generation Distorts Information（大型语言模型作为失真传话：迭代生成对信息的影响） [02:14] 🧠 LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation（LINGOLY-TOO：通过语言模板化和正字法混淆分离记忆与推理） [02:51] 🔄 HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization（混合归一化：通过混合归一化实现稳定高效的Transformer训练） [03:34] 🎥 Token-Efficient Long Video Understanding for Multimodal LLMs（高效的多模态大语言模型长视频理解） [04:14] 🧠 FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion（FuseChat-3.0：偏好优化与异构模型融合） [04:58] 🎮 PokéChamp: an Expert-level Minimax Language Agent（宝可冠军：一个专家级的Minimax语言代理） [05:42] 🎧 Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities（音频火烈鸟2：具有长音频理解和专家推理能力的音频语言模型） [06:21] 📊 IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval（IFIR：评估专家领域信息检索中指令遵循的综合基准） [07:02] 📊 Identifying Sensitive Weights via Post-quantization Integral（通过后量化积分识别敏感权重） [07:46] 📏 L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling（L²M：长上下文语言模型的互信息缩放定律） [08:22] 🎥 The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation（双剑合璧：结合语言模型与扩散模型进行视频生成） [09:05] 🤖 Lost in Literalism: How Supervised Training Shapes Translationese in LLMs（迷失于字面主义：监督训练如何塑造LLMs中的翻译体） [09:48] 🚀 Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks（专用反馈和编辑模型增强开放式通用领域任务的推理时扩展） [10:33] 🧠 Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer（专家联盟：将分层路由适应等价分解的Transformer） [11:13] 🤖 Combining Flow Matching and Transformers for Efficient Solution of Bayesian Inverse Problems（结合流匹配与Transformer实现高效的贝叶斯反问题求解） [11:54] 🚫 Understanding and Predicting Derailment in Toxic Conversations on GitHub（理解与预测GitHub上毒性对话中的脱轨现象）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

99+

2个月前

2025.03.06 | 开源多语言模型Babel表现优异，多模态嵌入模型ABC提升控制能力。

本期的 17 篇论文如下： [00:24] 🌍 Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers（巴别塔：服务于全球90%以上人口的开源多语言大型语言模型） [01:11] 🧠 ABC: Achieving Better Control of Multimodal Embeddings using VLMs（ABC：使用视觉语言模型实现多模态嵌入的更好控制） [01:47] 🩺 Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions（利用知识描述增强视觉语言模型在异常定位中的性能） [02:24] 🎥 GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control（GEN3C：具备精确相机控制和时间上3D一致性的生成视频模型） [03:02] 🧠 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding（KodCode：一个多样、具有挑战性且可验证的合成代码数据集） [03:43] 🧠 CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom（CrowdSelect：基于多LLM智慧的合成指令数据选择） [04:26] 📄 QE4PE: Word-level Quality Estimation for Human Post-Editing（QE4PE：面向人工译后编辑的词语级质量评估） [05:08] 🗣 Exploring Rewriting Approaches for Different Conversational Tasks（探索不同对话任务的重写方法） [05:43] 🧠 Process-based Self-Rewarding Language Models（基于过程的自奖励语言模型） [06:23] 🤖 Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective（针对特定领域的AI进行小型语言模型微调：边缘AI视角） [07:00] 🌐 Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases（基于文本丰富图知识库的结构与文本混合检索） [07:40] 🛠 Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models（检索模型不擅长工具使用：大型语言模型工具检索基准测试） [08:22] 🤖 FLAME: A Federated Learning Benchmark for Robotic Manipulation（FLAME: 机器人操作的联邦学习基准） [09:01] 🛡 Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection（多语言软件漏洞检测的大语言模型基准测试） [09:53] 🤖 CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs（认知无人机：一种用于无人机实时认知任务解决和推理的VLA模型及评估基准） [10:36] 🚗 Interact, Instruct to Improve: A LLM-Driven Parallel Actor-Reasoner Framework for Enhancing Autonomous Vehicle Interactions（交互、指导以提升：一种用于增强自动驾驶车辆交互的LLM驱动并行行动者-推理者框架） [11:14] 🇨 SwiLTra-Bench: The Swiss Legal Translation Benchmark（SwiLTra-Bench：瑞士法律翻译基准）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

12分钟

99+

2个月前

2025.03.05 | MPO提升LLM规划效率，Mask-DPO增强事实性对齐。

本期的 18 篇论文如下： [00:21] 🚀 MPO: Boosting LLM Agents with Meta Plan Optimization（MPO：通过元计划优化提升LLM代理） [00:59] 🤖 Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs（Mask-DPO：大语言模型的可泛化细粒度事实性对齐） [01:43] 🧩 LADDER: Self-Improving LLMs Through Recursive Problem Decomposition（LADDER：通过递归问题分解实现自我改进的LLMs） [02:26] 📚 Wikipedia in the Era of LLMs: Evolution and Risks（大语言模型时代的维基百科：演变与风险） [03:06] 🚀 PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization（PipeOffload：通过内存优化提升流水线并行的可扩展性） [03:50] 🔄 Iterative Value Function Optimization for Guided Decoding（迭代价值函数优化指导解码） [04:33] 🤖 MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents（多智能体基准：评估LLM智能体的协作与竞争） [05:19] ⚡ FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling（FR-Spec：通过频率排序的推测采样加速大词汇量语言模型） [05:58] 🧐 SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking（SemViQA：越南信息事实核查的语义问答系统） [06:45] 🖼 RectifiedHR: Enable Efficient High-Resolution Image Generation via Energy Rectification（RectifiedHR：通过能量校正实现高效的高分辨率图像生成） [07:18] 🌐 UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface（UFO：通过开放式语言接口实现细粒度视觉感知统一方法） [07:56] 🧠 ATLaS: Agent Tuning via Learning Critical Steps（通过学习关键步骤进行代理调优） [08:41] 🤖 Language Models can Self-Improve at State-Value Estimation for Better Search（语言模型能够在状态值估计中自我改进以提升搜索效果） [09:24] 🔧 IterPref: Focal Preference Learning for Code Generation via Iterative Debugging（迭代调试优化的代码生成偏好学习） [10:15] 🔬 SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models（SPIDER：综合多器官监督病理数据集与基线模型） [10:56] 🌐 Improve Representation for Imbalanced Regression through Geometric Constraints（通过几何约束改进不平衡回归的表示） [11:35] 🎯 Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content（Q-Eval-100K：评估文本到视觉内容的质量与对齐水平） [12:16] 🤖 AppAgentX: Evolving GUI Agents as Proficient Smartphone Users（AppAgentX：演进出熟练使用智能手机的图形用户界面代理）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

99+

2个月前

2025.03.04 | 强化视觉推理，提升3D重建质量。

本期的 20 篇论文如下： [00:21] 🧠 Visual-RFT: Visual Reinforcement Fine-Tuning（视觉强化微调：视觉强化微调） [01:05] 🌐 Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models（Difix3D+：通过单步扩散模型改进三维重建） [01:43] 🧠 Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs（Phi-4-Mini技术报告：通过LoRA混合的多模态语言模型实现紧凑且强大的性能） [02:25] 🎥 OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment（OneRec：统一生成推荐与迭代偏好对齐） [03:04] 🤔 When an LLM is apprehensive about its answers -- and when its uncertainty is justified（当LLM对其答案感到不安时——以及何时其不确定性是有道理的） [03:46] 🎵 DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion（DiffRhythm：基于潜在扩散的超快速且极度简单的端到端全长歌曲生成） [04:28] 🐯 Liger: Linearizing Large Language Models to Gated Recurrent Structures（Liger：将大型语言模型线性化为门控递归结构） [05:05] 📊 Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions（麒麟：一个包含应用级用户会话的多模态信息检索数据集） [05:50] 🧠 Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs（实现自我改进推理者的认知行为，或，高效STaRs的四个习惯） [06:28] ⚡ Speculative Ad-hoc Querying（投机性即席查询） [07:15] ⚡ DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting（双解码：硬件感知的异构推测解码与动态多序列草稿） [07:52] 🎨 Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation（Kiss3DGen： repurposing Image Diffusion Models for 3D Asset Generation） [08:31] 🧠 Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia（词形重要：LLM在字谜现象下的语义重构） [09:10] ⚡ From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens（从小时到分钟：超长序列生成的高效加速，最高可达100K tokens） [09:47] 🔍 Large-Scale Data Selection for Instruction Tuning（大规模数据选择用于指令微调） [10:26] 🌐 SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity（SampleMix：一种协调数据质量和多样性的样本级预训练数据混合策略） [11:01] 🤖 CodeArena: A Collective Evaluation Platform for LLM Code Generation（CodeArena：面向LLM代码生成的大规模评估平台） [11:47] 🎥 VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation（视频UFO：用于文本到视频生成的大规模用户聚焦数据集） [12:42] 🎙 PodAgent: A Comprehensive Framework for Podcast Generation（PodAgent：播客生成的综合框架） [13:18] 🏠 Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model（无姿态稀疏视角房间布局重建在预训练模型时代的应用）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

14分钟

99+

2个月前