节目列表: HuggingFace 每日AI论文速递 - EarsOnMe

2024.10.25 每日AI论文 | 内存效率显著提升，长上下文对齐能力增强。

本期的 21 篇论文如下： [00:26] 🚀 Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss（打破内存壁垒：对比损失的近无限批量规模扩展） [01:09] 🔄 LOGO -- Long cOntext aliGnment via efficient preference Optimization（LOGO -- 通过高效偏好优化实现长上下文对齐） [01:45] 🧠 Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch（从零开始释放LLMs的推理能力：可扩展的问题合成方法） [02:30] 🤔 Can Knowledge Editing Really Correct Hallucinations?（知识编辑真的能纠正幻觉吗？） [03:17] 🎮 Unbounded: A Generative Infinite Game of Character Life Simulation（无界：生成式无限角色生活模拟游戏） [04:02] 🎥 Framer: Interactive Frame Interpolation（Framer：交互式帧插值） [04:48] 📊 Distill Visual Chart Reasoning Ability from LLMs to MLLMs（从LLMs到MLLMs的视觉图表推理能力提炼） [05:35] 📉 Why Does the Effective Context Length of LLMs Fall Short?（为什么大型语言模型的有效上下文长度不足？） [06:14] 🔒 Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances（基于生成先验的鲁棒水印技术对抗图像编辑：从基准测试到进展） [06:52] 🔧 Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs（天工奖励：LLM奖励建模的技巧包） [07:27] 🌍 CAMEL-Bench: A Comprehensive Arabic LMM Benchmark（CAMEL-Bench：一个全面的阿拉伯语大型多模态模型基准） [08:09] 📊 Should We Really Edit Language Models? On the Evaluation of Edited Language Models（我们真的应该编辑语言模型吗？关于编辑语言模型的评估） [08:43] 🌐 ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning（ADEM-VL：高效视觉语言调优的自适应嵌入融合方法） [09:20] 🌐 WAFFLE: Multi-Modal Model for Automated Front-End Development（WAFFLE：自动化前端开发的多模态模型） [09:52] 📚 CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models（CCI3.0-HQ：一个用于预训练大型语言模型的高质量大规模中文数据集） [10:30] 🔄 Stable Consistency Tuning: Understanding and Improving Consistency Models（稳定一致性调优：理解与改进一致性模型） [11:10] 🧮 Language Models are Symbolic Learners in Arithmetic（语言模型在算术中的符号学习者角色） [12:00] 🐍 Taipan: Efficient and Expressive State Space Language Models with Selective Attention（Taipan：高效且表达丰富的状态空间语言模型与选择性注意力） [12:44] 🔄 Value Residual Learning For Alleviating Attention Concentration In Transformers（残差值学习缓解Transformer中的注意力集中问题） [13:23] 📚 Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits（多草稿推测采样：典型架构与理论极限） [14:03] 🤖 Data Scaling Laws in Imitation Learning for Robotic Manipulation（机器人操作中的模仿学习数据缩放定律）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

15分钟

74

9个月前

2024.10.24 每日AI论文 | 多图像任务优化，视频生成模型评估

本期的 10 篇论文如下： [00:25] 🖼 MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models（多图像增强直接偏好优化：大型视觉语言模型） [01:09] 🌍 WorldSimBench: Towards Video Generation Models as World Simulators（世界模拟器：迈向视频生成模型作为世界模拟器） [01:47] 🌊 Scaling Diffusion Language Models via Adaptation from Autoregressive Models（通过自回归模型适应扩展扩散语言模型） [02:20] 📱 Lightweight Neural App Control（轻量级神经应用控制） [03:01] 🏠 ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding（ARKit标签制造者：室内3D场景理解的新尺度） [03:47] 🖼 Scalable Ranked Preference Optimization for Text-to-Image Generation（可扩展的文本到图像生成中的排序偏好优化） [04:23] 🌆 DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes（动态城市：动态场景的大规模LiDAR生成） [05:05] 🩺 MedINST: Meta Dataset of Biomedical Instructions（医学指令元数据集：MedINST） [05:52] 🌍 M-RewardBench: Evaluating Reward Models in Multilingual Settings（多语言环境下的奖励模型评估：M-RewardBench） [06:27] 📊 TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts（TP-Eval：通过定制提示挖掘多模态大语言模型的评估潜力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

7分钟

99+

9个月前

2024.10.23 每日AI论文 | 视觉冗余减少提升效率，动态三维重建优化镜面场景。

本期的 8 篇论文如下： [00:27] 🔍 PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction（金字塔式视觉冗余减少：通过金字塔视觉冗余减少加速大型视觉-语言模型） [01:09] 🌟 SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes（光谱运动：镜面场景的动态三维重建） [01:48] 🤖 Aligning Large Language Models via Self-Steering Optimization（通过自引导优化对齐大型语言模型） [02:30] 🇯 JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation（JMMMU：一个用于文化意识评估的日本大规模多学科多模态理解基准） [03:11] 🧬 EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search（EvoPress：通过进化搜索实现最优动态模型压缩） [03:53] 🧠 MiniPLM: Knowledge Distillation for Pre-Training Language Models（MiniPLM：预训练语言模型的知识蒸馏） [04:30] 🔍 Mitigating Object Hallucination via Concentric Causal Attention（通过同心因果注意力缓解对象幻觉） [05:19] 🧠 Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes（数学神经外科：仅使用前向传递隔离语言模型的数学推理能力）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

6分钟

71

9个月前

2024.10.22 每日AI论文 | 指南针评判者加速模型评估，SAM2Long提升长视频分割精度。

本期的 21 篇论文如下： [00:24] 🤖 CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution（指南针评判者-1：一体化评判模型助力模型评估与进化） [01:11] 🌲 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree（SAM2长：通过无需训练的记忆树增强SAM 2以实现长视频分割） [01:55] 🌐 PUMA: Empowering Unified MLLM with Multi-granular Visual Generation（PUMA：赋予统一多模态大语言模型多粒度视觉生成能力） [02:37] 🤖 AutoTrain: No-code training for state-of-the-art models（AutoTrain：无代码训练最先进的模型） [03:10] ⚡ FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors（节俭NeRF：无学习先验的少样本新视角合成快速收敛） [03:56] 📊 Baichuan Alignment Technical Report（百川对齐技术报告） [04:39] 🌍 Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages（泛亚：一个完全开放的多语种多模态LLM，涵盖39种语言） [05:21] 🔍 RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style（RM-Bench：评估语言模型奖励模型的细致性与风格敏感度） [06:05] 📚 Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception（元分块：通过逻辑感知学习高效的文本分割） [06:41] 🔍 Pre-training Distillation for Large Language Models: A Design Space Exploration（大型语言模型预训练蒸馏：设计空间探索） [07:16] 🔬 Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation（炼金术：通过符号变异增强定理证明能力） [07:55] 🔄 SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation（半监督微调：LLM适应的半监督微调框架） [08:31] 📚 Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement（通过同源模型引导和上下文意识测量选择长上下文对齐的关键样本） [09:11] 🤖 Zero-shot Model-based Reinforcement Learning using Large Language Models（基于大语言模型的零样本模型强化学习） [09:53] 🗣 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant（一护：混合模态早期融合实时语音助手） [10:28] 🧠 CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy（CBT-Bench：评估大型语言模型在辅助认知行为疗法中的应用） [11:12] 🛠 Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers（路由器调优：一种简单有效的Transformer动态深度调整方法） [11:58] 🧠 Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training（幻觉解毒：用于大型语言模型训练的敏感神经元丢弃方法） [12:45] 🌍 Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs（多语言大语言模型的跨语言自动评估） [13:25] 🗣 DM-Codec: Distilling Multimodal Representations for Speech Tokenization（多模态表示蒸馏用于语音标记化） [14:17] 🧠 In-context learning and Occam's razor（上下文学习与奥卡姆剃刀）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

15分钟

99+

9个月前

2024.10.21 每日AI论文 | 提升网页导航成功率，增强图像生成精细度。

本期的 12 篇论文如下： [00:27] 🌐 Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation（拥有世界模型的网络代理：学习和利用环境动态进行网页导航） [01:11] 👗 MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models（魔法裁缝：文本到图像扩散模型中的组件可控个性化） [01:48] 💼 UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models（UCFE：面向用户的大语言模型金融专业能力基准） [02:37] 🧠 NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples（自然对抗样本：评估视觉语言模型） [03:12] 🧠 SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs（SeerAttention：在LLMs中学习内在稀疏注意力） [03:54] 📊 Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts（AI检测器足够好吗？机器生成文本数据集质量调查） [04:25] 🌐 Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion（扩散课程：通过图像引导扩散实现合成到真实的生成课程学习） [05:08] 🎥 DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation（DAWN: 非自回归扩散框架动态帧头像的讲话头视频生成） [05:50] 🔄 A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement（基于边际的语言模型对齐常见陷阱：梯度纠缠） [06:31] 🧬 DPLM-2: A Multimodal Diffusion Protein Language Model（DPLM-2: 一种多模态扩散蛋白质语言模型） [07:12] 📰 Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media（关键在于上下文（NMF）：建模华人媒体中的主题信息动态） [07:56] 🧠 How Do Training Methods Influence the Utilization of Vision Models?（训练方法如何影响视觉模型的利用？）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

8分钟

95

9个月前

【周末特辑】10月第3周最火AI论文 | 多模态大语言模型创新，评估标准统一化。

本期的 5 篇论文如下： [00:45] TOP1(🔥80) | 🌐 Baichuan-Omni Technical Report（百川-Omni 技术报告） [02:20] TOP2(🔥58) | 📊 MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures（MixEval-X：从现实世界数据混合中进行任意到任意评估） [04:20] TOP3(🔥58) | 🎥 Movie Gen: A Cast of Media Foundation Models（电影生成：媒体基础模型集合） [06:27] TOP4(🔥53) | 🤖 LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models（LOKI：基于大型多模态模型的综合合成数据检测基准） [08:23] TOP5(🔥48) | 🌐 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models（大规模多模态交错理解基准测试）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

99+

9个月前

2024.10.18 每日AI论文 | AI评估标准化，电影生成模型领先。

本期的 31 篇论文如下： [00:23] 📊 MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures（MixEval-X：从现实世界数据混合中进行任意到任意评估） [01:02] 🎥 Movie Gen: A Cast of Media Foundation Models（电影生成：媒体基础模型集合） [01:35] 📱 MobA: A Two-Level Agent System for Efficient Mobile Task Automation（MobA：一种高效移动任务自动化的两级代理系统） [02:18] 🌐 Harnessing Webpage UIs for Text-Rich Visual Understanding（利用网页UI进行丰富的视觉理解） [02:59] 🔄 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation（雅努斯：解耦视觉编码以实现统一的多模态理解和生成） [03:29] 🩺 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models（多功能多模态RAG系统在医学视觉语言模型中的应用） [04:04] 📊 A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models（大规模模型后训练中Delta参数编辑的统一视角） [04:46] 🔄 PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment（PopAlign：多样化对比模式以实现更全面的模型对齐） [05:23] 🔍 BenTo: Benchmark Task Reduction with In-Context Transferability（BenTo: 基于上下文迁移性的基准任务缩减） [06:03] 🎥 DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control（DreamVideo-2：零样本主题驱动视频定制与精确运动控制） [06:49] 🧠 MoH: Multi-Head Attention as Mixture-of-Head Attention（MoH：多头部注意力机制作为混合头部注意力机制） [07:28] 🎥 VidPanos: Generative Panoramic Videos from Casual Panning Videos（VidPanos：从随意拍摄的平移视频生成全景视频） [08:03] 📉 FlatQuant: Flatness Matters for LLM Quantization（FlatQuant：扁平化对LLM量化的重要性） [08:44] 🔄 Retrospective Learning from Interactions（从交互中回顾学习） [09:22] 🔄 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation（向前失败：利用合成数据和检索增强改进ASR的生成错误校正） [10:06] 🖼 Can MLLMs Understand the Deep Implication Behind Chinese Images?（多模态大语言模型能否理解中文图像的深层含义？） [10:43] 📱 MedMobile: A mobile-sized language model with expert-level clinical capabilities（MedMobile：具备专家级临床能力的移动端语言模型） [11:22] 🌍 WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines（世界美食：多语言多文化视觉问答的大规模基准） [12:04] 🤖 Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant（记住、检索与生成：理解无限视觉概念作为个性化助手） [12:48] 🔄 LoLDU: Low-Rank Adaptation via Lower-Diag-Upper Decomposition for Parameter-Efficient Fine-Tuning（LoLDU：通过下三角-对角-上三角分解实现低秩适应的参数高效微调） [13:29] 🔒 AERO: Softmax-Only LLMs for Efficient Private Inference（AERO：仅使用Softmax的LLM实现高效隐私推断） [14:12] 🌐 $γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models（γ-MoD：探索多模态大语言模型的深度混合适应） [14:45] 🌐 Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats（长序列大重建模型：广覆盖高斯点云） [15:24] 🎶 MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization（MuVi：视频到音乐生成与语义对齐及节奏同步） [16:05] 🔒 Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems（大型语言模型是否具备政治正确性？分析AI系统中的伦理偏见与越狱漏洞） [16:48] 📚 SBI-RAG: Enhancing Math Word Problem Solving for Students through Schema-Based Instruction and Retrieval-Augmented Generation（基于模式教学和检索增强生成的数学应用题解决方法） [17:27] 🗺 Roadmap towards Superhuman Speech Understanding using Large Language Models（基于大型语言模型的超人类语音理解路线图） [18:05] 🔄 Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment（面向无指导的AR视觉生成的条件对比对齐） [18:47] 🤖 TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration（TransAgent：异构代理协作迁移视觉语言基础模型） [19:25] 🔬 Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models（开放材料2024（OMat24）无机材料数据集与模型） [20:05] 📚 Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key（最小调整解锁LLM长输出：高质量数据的关键）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

21分钟

99+

9个月前

2024.10.17 每日AI论文 | 视觉推理能力待提升，自中心视频理解需改进

本期的 19 篇论文如下： [00:28] 🧠 HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks（HumanEval-V：通过编码任务评估大型多模态模型的视觉理解和推理能力） [01:15] 🎥 VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI（VidEgoThink：评估具身AI的自中心视频理解能力） [01:50] 🧠 The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio（多模态的诅咒：评估大型多模态模型在语言、视觉和音频中的幻觉） [02:31] 🤖 Revealing the Barriers of Language Agents in Planning（揭示语言代理在规划中的障碍） [03:15] 📄 DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception（DocLayout-YOLO：通过多样合成数据和全局到局部自适应感知增强文档布局分析） [03:56] ⚙ Large Language Model Evaluation via Matrix Nuclear-Norm（大型语言模型评估通过矩阵核范数） [04:38] 🧬 Exploring Model Kinship for Merging Large Language Models（探索大型语言模型合并中的模型亲缘关系） [05:15] 📊 ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs（ProSA：评估和理解大型语言模型的提示敏感性） [05:50] ⚡ ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression（ZipVL：动态令牌稀疏化和KV缓存压缩的高效大视觉-语言模型） [06:31] 📄 Improving Long-Text Alignment for Text-to-Image Diffusion Models（改进文本到图像扩散模型的长文本对齐） [07:11] 🔄 Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models（简化、稳定和扩展连续时间一致性模型） [07:55] 🛡 Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements（可控安全对齐：推理时适应多样安全需求） [08:34] 🔍 Tracking Universal Features Through Fine-Tuning and Model Merging（通过微调和模型合并追踪通用特征） [09:08] 🔄 Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL（逆向洞察：通过逆向强化学习重构LLM训练目标） [09:46] 🧠 Neural Metamorphosis（神经变形） [10:25] 🌍 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation（世界医学QA-V：多语言、多模态医学考试数据集用于多模态语言模型评估） [11:09] 🌐 OMCAT: Omni Context Aware Transformer（全上下文感知变压器） [11:44] ⏳ ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains（ChroKnowledge：揭示语言模型在多领域中的时间知识） [12:22] 📚 DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities（DyVo：动态词汇表用于实体学习的稀疏检索）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

13分钟

80

9个月前

2024.10.16 每日AI论文 | 多模态模型幻觉问题，工具使用基准评估。

本期的 14 篇论文如下： [00:26] 🤖 MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation（多模态大语言模型能看见吗？动态校正解码以减轻幻觉） [01:07] 🛠 MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models（MTU-Bench：大型语言模型的多粒度工具使用基准） [01:47] 📚 LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models（LLM×MapReduce：利用大型语言模型简化长序列处理） [02:25] 🛡 SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI（SecCodePLT：评估代码生成AI安全性的统一平台） [03:01] 📹 LVD-2M: A Long-take Video Dataset with Temporally Dense Captions（LVD-2M：一个带有时间密集标注的长镜头视频数据集） [03:44] 🧠 What Matters in Transformers? Not All Attention is Needed（Transformer中什么最重要？并非所有注意力机制都必要） [04:18] 🌟 GS^3: Efficient Relighting with Triple Gaussian Splatting（GS^3：高效的三重高斯点云重光照） [04:51] 🤯 Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free（你的混合专家大型语言模型实际上是一个免费的嵌入模型） [05:31] 🌍 Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts（通过语言家族专家混合模型高效实现50种语言的医疗大语言模型民主化） [06:08] 🚀 SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning（SimBa：深度强化学习中扩展参数的简单性偏置） [06:43] 📊 Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices（高效扩散模型：从原理到实践的综合调查） [07:14] 🤖 Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation（面向协同、广义和高效的双系统机器人操作） [07:58] 🔄 Empirical Study of Mutual Reinforcement Effect and Application in Few-shot Text Classification Tasks via Prompt（互增强效应的实证研究及其在少样本文本分类任务中的应用通过提示） [08:37] 🌍 Towards Natural Image Matting in the Wild via Real-Scenario Prior（面向自然图像抠图的现实场景先验）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

9分钟

99+

9个月前

2024.10.15 每日AI论文 | MMIE推动LVLMs发展，LOKI评估合成数据检测。

本期的 15 篇论文如下： [00:24] 🌐 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models（大规模多模态交错理解基准测试） [01:06] 🤖 LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models（LOKI：基于大型多模态模型的综合合成数据检测基准） [02:01] 🔍 Toward General Instruction-Following Alignment for Retrieval-Augmented Generation（面向检索增强生成的通用指令遵循对齐） [02:36] 📊 MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks（MEGA-Bench：将多模态评估扩展到500多个真实世界任务） [03:12] 🎥 Animate-X: Universal Character Image Animation with Enhanced Motion Representation（Animate-X：增强运动表示的通用角色图像动画） [04:02] 📚 Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models（全能数学：面向大型语言模型的奥林匹克级数学基准） [04:44] 📚 LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content（LiveXiv -- 基于Arxiv论文内容的多模态实时基准） [05:29] 🎥 Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention（Cavia：具有视角控制的多视角视频扩散与视角集成注意力） [06:09] ⏳ TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models（时间轴基准：多模态视频模型细粒度时间理解评测） [06:58] 🌊 Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations（基于校正随机微分方程的语义图像反演与编辑） [07:40] 📊 Rethinking Data Selection at Scale: Random Selection is Almost All You Need（重新思考大规模数据选择：随机选择几乎是你所需要的） [08:26] 🌲 Tree of Problems: Improving structured problem solving with compositionality（问题树：通过组合性改进结构化问题解决） [09:13] 📺 TVBench: Redesigning Video-Language Evaluation（TVBench：重塑视频语言评估） [09:54] 🤖 Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies（可泛化的人形机器人操作：改进的三维扩散策略） [10:29] 📚 LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory（长时记忆评估：在长期交互记忆中评估聊天助手）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

9个月前

2024.10.14 每日AI论文 | 多模态模型Baichuan-Omni开源,Meissonic提升文生图效率

本期的 16 篇论文如下： [00:25] 🌐 Baichuan-Omni Technical Report（百川-Omni 技术报告） [00:59] 🖼 Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis（Meissonic：高效高分辨率文本到图像生成的掩码生成Transformer复兴） [01:41] 🔧 From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning（从通才到专家：通过任务特定视觉指令调整适应视觉语言模型） [02:17] 🎨 EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models（进化导演：利用大规模视觉语言模型接近高级文本到图像生成） [02:53] 🧠 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization（结构化RAG：通过推理时混合信息结构化提升LLMs的知识密集型推理能力） [03:34] 📏 PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness（大语言模型：具备显式位置感知的长度控制与复制粘贴） [04:11] 🌐 Semantic Score Distillation Sampling for Compositional Text-to-3D Generation（语义分数蒸馏采样用于组合式文本到3D生成） [04:47] 🧠 SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights（超级纠正：利用错误驱动的洞察力监督和纠正语言模型） [05:29] 🔄 Mechanistic Permutability: Match Features Across Layers（机制可置换性：跨层匹配特征） [06:07] 🤖 Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining（多智能体协作数据选择以提高LLM预训练效率） [06:45] ⚡ KV Prediction for Improved Time to First Token（KV预测提升首次输出时间） [07:30] 🌐 ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion（零样本对象合成：基于扩散的图像内在特性） [08:13] 🚨 MiRAGeNews: Multimodal Realistic AI-Generated News Detection（多模态现实AI生成新闻检测） [08:52] 🤖 DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models（DA-Code：面向大型语言模型的代理数据科学代码生成基准） [09:30] 📈 I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow（I-Max：最大化预训练校正流变换器的分辨率潜力与投影流） [10:12] 🧠 Mentor-KD: Making Small Language Models Better Multi-step Reasoners（导师-KD：使小型语言模型成为更好的多步推理者）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

11分钟

99+

9个月前

【周末特辑】10月第2周最火AI论文 | 差分Transformer提升文本处理，L-Mul算法降低能耗。

本期的 5 篇论文如下： [00:37] TOP1(🔥128) | 🔍 Differential Transformer（差分Transformer） [02:38] TOP2(🔥125) | ⚡ Addition is All You Need for Energy-efficient Language Models（加法即所需：高效能语言模型） [04:13] TOP3(🔥84) | 🌐 Aria: An Open Multimodal Native Mixture-of-Experts Model（Aria：一个开放的多模态原生混合专家模型） [06:18] TOP4(🔥73) | 🤖 GLEE: A Unified Framework and Benchmark for Language-based Economic Environments（GLEE：基于语言的经济环境统一框架与基准） [08:25] TOP5(🔥63) | 👤 Personalized Visual Instruction Tuning（个性化视觉指令微调）【关注我们】您还可以在以下平台找到我们，获得播客内容以外更多信息小红书: AI速递

10分钟

81

9个月前