2025.02.27 | Kanana提升韩英双语效率,GHOST 2.0实现高保真头部转移。

本期的 18 篇论文如下: [00:23] 🌐 Kanana: Compute-efficient Bilingual Language Models(Kanana:计算高效的双语语言模型) [00:54] 👤 GHOST 2.0: generative high-fidelity one shot transfer of heads(GHOST 2.0:生成高保真一次性头部转移) [01:43] 🎥 TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding(定理解释代理:面向大语言模型定理理解的多模态解释) [02:21] 🤖 Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems(代理奖励建模:将人类偏好与可验证的正确性信号结合以构建可靠的奖励系统) [03:02] 🤖 Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?(大型语言模型能否检测长链推理中的错误?) [03:47] 🌍 Language Models' Factuality Depends on the Language of Inquiry(语言模型的事实性依赖于查询语言) [04:27] 🧠 Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation(语言模型能否证伪?评估算法推理中的反例创建) [05:11] 🤖 Towards an AI co-scientist(迈向人工智能合作科学家) [05:52] 🇬 Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance(普鲁托斯:在低资源希腊金融环境中评估大型语言模型) [06:38] 🤖 VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model(VEM:利用价值环境模型训练GUI代理的无环境探索) [07:12] 📏 Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator(蒸馏任意深度:蒸馏技术创造更强的单目深度估计器) [07:52] 📚 Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs(亚历山大项目:通过大型语言模型解除科学知识的版权负担) [08:35] 🛡 AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement(AISafetyLab:AI安全评估与改进的综合框架) [09:23] 🧠 BIG-Bench Extra Hard(BIG-Bench 超难版本) [10:07] 🔍 CritiQ: Mining Data Quality Criteria from Human Preferences(CritiQ:从人类偏好中挖掘数据质量标准) [10:44] 🔬 MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra(MolSpectra:利用多模态能量光谱预训练三维分子表示) [11:28] 📄 PosterSum: A Multimodal Benchmark for Scientific Poster Summarization(PosterSum:科学海报摘要的多模态基准) [12:08] 🧠 DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps(双优化嵌入信息用于增强注意力类激活图) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

13分钟
95
1个月前

2025.02.26 | OmniAlign-V提升多模态模型对齐,SpargeAttn加速注意力计算

本期的 14 篇论文如下: [00:23] 🤖 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference(OmniAlign-V:迈向多模态大语言模型与人类偏好增强对齐) [01:06] ⚡ SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference(SpargeAttn:准确稀疏注意力加速任意模型推理) [01:53] 🖼 KV-Edit: Training-Free Image Editing for Precise Background Preservation(KV-编辑:无需训练的图像编辑方法,实现精确背景保留) [02:32] 🌈 ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation(匿名区域变换器:可变多层透明图像生成) [03:08] 🤖 SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution(SWE-RL:通过开源软件演化数据强化学习提升LLM推理能力) [03:51] 📊 Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective(揭示大语言模型下游性能扩展:基于聚类的视角) [04:30] 🧠 Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models(尺度分布解耦:实现大型语言模型稳定有效训练) [05:11] 🔄 K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs(K-LoRA:解锁无需训练的任意主题和风格LoRA融合) [05:51] 🌐 WebGames: Challenging General-Purpose Web-Browsing AI Agents(WebGames:挑战通用网页浏览AI代理) [06:29] 🧠 Introducing Visual Perception Token into Multimodal Large Language Model(引入视觉感知令牌的多模态大语言模型) [07:07] 🎰 The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?(彩票LLM假说:重新思考LLM压缩应保留的能力) [07:47] 🧠 AAD-LLM: Neural Attention-Driven Auditory Scene Understanding(AAD-LLM:神经注意力驱动的听觉场景理解) [08:26] 🔍 LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models(LaTIM:测量Mamba模型中的潜在Token-to-Token交互) [09:07] 🧠 Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI(Shakti-VLMs:企业级AI的可扩展视觉语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

10分钟
99+
1个月前

2025.02.25 | 长上下文优化创新,视觉扩散高效通用。

本期的 20 篇论文如下: [00:27] 📖 Thus Spake Long-Context Large Language Model(长上下文大语言模型如是说) [01:09] 🌈 DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks(用于视觉感知任务的通用扩散模型) [01:48] 🚀 Slamming: Training a Speech Language Model on One GPU in a Day(撞击:在一天内使用单个GPU训练语音语言模型) [02:32] 🎥 VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing(视频粒度:调节时空注意力实现多粒度视频编辑) [03:11] 🎧 Audio-FLAN: A Preliminary Release(音频FLAN:初步发布) [03:43] 🧠 CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models(CodeCriticBench:面向大型语言模型的全面代码 critique 基准测试) [04:28] 🎨 GCC: Generative Color Constancy via Diffusing a Color Checker(GCC:通过扩散色卡生成颜色恒常性) [05:11] 📊 Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning(数学推理中测试时间扩展的语言通用性) [05:57] 🚀 Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment(让LoRA再次伟大:通过自适应奇异值和混合专家优化对齐提升LoRA性能) [06:38] 🧠 Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models(多模态不一致性推理(MMIR):多模态推理模型的新基准) [07:23] 🎥 RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers(RIFLEx:视频扩散Transformer中长度外推的免费午餐) [08:01] 📱 Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration(移动代理V:通过视频引导的多代理协作学习移动设备操作) [08:45] ⏳ Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties(中国朝代间的时间推理与对齐基准测试) [09:31] 🤖 Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation(反射性规划:视觉语言模型在多阶段长时程机器人操作中的应用) [10:02] 🔄 Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam(稳定-SPAM:如何在4位精度下比16位Adam更稳定地训练) [10:43] 📝 Can Community Notes Replace Professional Fact-Checkers?(社区笔记能替代专业事实核查员吗?) [11:24] 📈 Forecasting Open-Weight AI Model Growth on Hugging Face(预测Hugging Face上开放权重AI模型的增长) [12:08] 🔑 Beyond Release: Access Considerations for Generative AI Systems(超越发布:生成式人工智能系统的访问考量) [12:49] 🌐 TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning(TAG:一种用于多智能体分层强化学习的去中心化框架) [13:30] 💃 X-Dancer: Expressive Music to Human Dance Video Generation(X-Dancer:从音乐生成生动舞蹈视频) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

14分钟
99+
1个月前

2025.02.24 | 高效学术调查生成,标点符号关键作用

本期的 20 篇论文如下: [00:23] 📚 SurveyX: Academic Survey Automation via Large Language Models(基于大型语言模型的学术调查自动化) [01:10] 🔍 LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers(LLM显微镜:揭示标点符号在Transformer上下文记忆中的隐藏作用) [01:50] 🚗 MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction(MaskGWM:结合视频掩码重建的通用驾驶世界模型) [02:28] 🧬 Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model(Mol-LLaMA:面向大分子语言模型的分子通用理解) [03:12] 🎨 PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data(PhotoDoodle:从少量成对数据中学习艺术图像编辑) [03:55] 🔗 VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues(VLM²-Bench:深入探究视觉语言模型在显式匹配视觉线索上的隐式链接能力) [04:42] 📌 SIFT: Grounding LLM Reasoning in Contexts via Stickers(SIFT:通过贴纸将大语言模型的推理扎根于上下文中) [05:27] 🧠 LightThinker: Thinking Step-by-Step Compression(光思者:逐步压缩推理) [05:59] 🗂 StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following(结构流基准:多轮指令跟随的结构流评估) [06:48] 🛡 Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models(安全标准对所有人都一样吗?大型语言模型的用户特定安全评估) [07:40] 📚 KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding(KITAB-Bench:阿拉伯语OCR与文档理解的综合多领域基准) [08:30] 🧬 ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation(ReQFlow:用于高效高质量蛋白质骨架生成的校正四元数流) [09:11] 🧠 MoBA: Mixture of Block Attention for Long-Context LLMs(MoBA:块注意力混合模型用于长上下文LLMs) [09:49] 🤖 InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback(InterFeedback:通过人类反馈揭示大型多模态模型的交互智能) [10:37] 🧠 The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer(大语言模型中推理与性能的关系——o3(mini)通过更努力而非更长时间进行推理) [11:20] 📚 Evaluating Multimodal Generative AI with Korean Educational Standards(评估多模态生成式人工智能与韩国教育标准) [11:54] ⚠ Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?(超级智能体带来灾难性风险:科学家AI能否提供更安全的路径?) [12:29] ⚡ One-step Diffusion Models with $f$-Divergence Distribution Matching(基于$f$-散度分布匹配的一步扩散模型) [13:09] 🧠 Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence(在JSON内部思考:强化策略实现严格LLM模式遵循) [13:52] 🧠 MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models(MedHallu:检测大型语言模型中的医学幻觉的综合基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

15分钟
99+
1个月前

2025.02.21 | AI代理评估新框架,LLM学科表现差异显著。

本期的 20 篇论文如下: [00:26] 🧠 MLGym: A New Framework and Benchmark for Advancing AI Research Agents(MLGym:推进AI研究代理的新框架与基准) [01:18] 📚 SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines(SuperGPQA:扩展LLM评估至285个研究生学科) [02:04] 🌐 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features(SigLIP 2:多语言视觉-语言编码器的语义理解、定位与密集特征改进) [02:52] 🧠 How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?(在不损害大型语言模型的情况下,LoRA适配器能容纳多少知识?) [03:49] 🚀 S*: Test Time Scaling for Code Generation(S*:代码生成中的测试时间缩放) [04:35] ⏳ Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information(时间是否有其位置?时间头:语言模型如何回忆时间特定信息) [05:28] 📄 LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models(LongWriter-V:在视觉-语言模型中实现超长和高保真生成) [06:17] 🧠 Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning(逻辑-RL:通过基于规则的强化学习释放LLM推理能力) [07:13] 🖥 PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC(PC-Agent:一种用于复杂任务自动化在PC上的分层多智能体协作框架) [08:07] 🧠 S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning(S$^2$R:通过强化学习教导大语言模型自我验证与自我修正) [09:01] 🧠 Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning(利用强化学习发现高效低权重量子纠错码) [09:55] 🎥 Dynamic Concepts Personalization from Single Videos(单视频动态概念个性化) [10:38] 🖼 Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation(通过代码引导的合成多模态数据生成扩展文本丰富的图像理解) [11:23] 🌍 NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization(NAVIG:基于自然语言引导的视觉语言模型用于图像地理定位分析) [12:13] 🧠 AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO(AlphaMaze:通过GRPO提升大型语言模型的空间智能) [13:06] 🌍 How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild(LLMs在多语言环境下的幻觉现象研究:在野外场景中的多语言幻觉估计) [13:52] 🌍 Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework(基于真实人类游戏数据的 geolocation:大规模数据集与人类推理框架) [14:55] 🌐 RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers(RelaCtrl:引导相关性的高效控制扩散变换器) [15:54] 🧠 Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data(增强多模态基础模型的认知与可解释性通过自合成数据) [16:41] 🤖 LLM-based User Profile Management for Recommender System(基于大语言模型的推荐系统用户画像管理) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

18分钟
99+
2个月前

2024.12.18 每日AI论文 | 推理能力待提升,多模态模型需优化。

本期的 8 篇论文如下: [00:24] 🧠 Are Your LLMs Capable of Stable Reasoning?(你的LLM是否具备稳定推理能力?) [01:06] 📊 Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models(多维度洞察:大型多模态模型在现实世界个性化中的基准测试) [01:52] 📊 OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain(OmniEval:金融领域全方位自动RAG评估基准) [02:33] 🧠 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers(抽象概念的涌现:Transformer中上下文学习中的概念编码与解码机制) [03:16] 🤖 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents(提议者-代理-评估者(PAE):为基模型互联网代理实现自主技能发现) [04:00] 📊 VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation(VisDoM:使用多模态检索增强生成的多文档问答与视觉丰富元素) [04:39] 🤔 When to Speak, When to Abstain: Contrastive Decoding with Abstention(何时发言,何时保持沉默:对比解码与放弃机制) [05:18] 🎥 MIVE: New Design and Benchmark for Multi-Instance Video Editing(MIVE:多实例视频编辑的新设计与基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

6分钟
43
4个月前

2024.12.17 每日AI论文 | 提升检索生成效率,优化视觉生成评估。

本期的 18 篇论文如下: [00:23] 🧠 RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation(RetroLLM:赋能大型语言模型在生成过程中检索细粒度证据) [01:05] ⚡ Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models(评估代理:高效且可提示的视觉生成模型评估框架) [01:45] 🎨 BrushEdit: All-In-One Image Inpainting and Editing(BrushEdit:一站式图像修复与编辑) [02:27] 🎨 ColorFlow: Retrieval-Augmented Image Sequence Colorization(ColorFlow:检索增强型图像序列着色) [03:10] 🧩 Byte Latent Transformer: Patches Scale Better Than Tokens(字节潜在变换器:补丁尺度优于标记) [03:56] 🧠 Causal Diffusion Transformers for Generative Modeling(因果扩散变换器用于生成建模) [04:33] 🤖 Smaller Language Models Are Better Instruction Evolvers(更小的语言模型是更好的指令进化器) [05:16] 🌟 IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations(IDArb:任意数量输入视图和光照下的内在分解) [06:02] 🌳 SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models(SPaR:通过树搜索优化自我对弈以提升大型语言模型的指令遵循能力) [06:47] 🌌 Wonderland: Navigating 3D Scenes from a Single Image(奇境:从单张图像导航3D场景) [07:32] 🔬 GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs(高斯属性:将物理属性集成到3D高斯分布中与LMMs结合) [08:18] ⚡ SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator(SepLLM:通过将一段内容压缩为一个分隔符来加速大型语言模型) [09:06] 🧠 Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture(奇妙矩阵:结合以实现更高效和有效的基模型架构) [09:46] 👩 StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors(StrandHead:基于头发几何先验的文本生成解耦3D头部虚拟形象) [10:35] 🌐 MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes(MOVIS:增强室内场景多物体新颖视角合成) [11:19] 🎵 Whisper-GPT: A Hybrid Representation Audio Large Language Model(Whisper-GPT:一种混合表示的音频大语言模型) [12:10] 🤖 TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning(TidyBot++:用于机器人学习的开源全向移动机械手) [13:01] 🔒 Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning(只需简单变换即可实现纵向联邦学习中的数据保护) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

14分钟
43
4个月前

2024.12.16 每日AI论文 | 视频理解新突破,AI探索3D环境。

本期的 14 篇论文如下: [00:23] 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models(阿波罗:大型多模态模型中的视频理解探索) [01:11] 🌍 GenEx: Generating an Explorable World(GenEx:生成可探索的世界) [01:50] 🌐 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding(协同生成-VL:基于视觉专家和令牌折叠的图像理解与生成) [02:37] 🩺 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities(BiMediX2:多模态生物医学专家大模型) [03:21] 🤖 Large Action Models: From Inception to Implementation(大规模动作模型:从构想到实现) [04:09] 🎥 InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption(实例感知结构化字幕:通过实例感知结构化字幕提升文本到视频生成) [04:56] 🌟 FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion(FreeScale:通过无调谐尺度融合释放扩散模型的分辨率) [05:42] 🎯 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation(ObjectMate:面向对象插入与主体驱动生成任务的循环先验方法) [06:21] 🔥 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing(FireFlow:图像语义编辑的快速校正流反演) [07:09] 🎵 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation(基于显式桥梁和检索增强的多模态音乐生成) [07:56] 🎨 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers(FluxSpace:在修正流变换器中解耦语义编辑) [08:44] 📊 SCBench: A KV Cache-Centric Analysis of Long-Context Methods(SCBench:以KV缓存为中心的长上下文方法分析) [09:27] 🧠 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs(SmolTulu:更高的学习率与批量大小的比率可以提升SLMs的推理能力) [10:05] 🩺 Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images(Prompt2Perturb (P2P): 基于文本引导扩散的乳腺超声图像对抗攻击) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
68
4个月前

2024.12.13 每日AI论文 | 多模态系统提升长期交互,phi-4优化STEM问答表现。

本期的 23 篇论文如下: [00:23] 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [01:03] 🧠 Phi-4 Technical Report(Phi-4 技术报告) [01:43] 🧠 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions(欧几里得:通过合成高保真视觉描述提升多模态大语言模型) [02:27] 🌐 Multimodal Latent Language Modeling with Next-Token Diffusion(多模态潜在语言建模与下一词扩散) [03:10] 🌐 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(EasyRef:基于多模态大语言模型的扩散模型通用化图像参考) [03:57] 🌐 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials(AgentTrek:通过网络教程引导回放的代理轨迹合成) [04:43] 🌟 Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion(神经光装置:利用多光源扩散解锁精确物体法线和材质估计) [05:24] 📱 SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(SnapGen:通过高效架构和训练驯服高分辨率文本到图像模型以适应移动设备) [06:02] 🔬 PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations(PIG:物理信息高斯函数作为自适应参数化网格表示) [06:49] 📊 Learned Compression for Compressed Learning(压缩学习中的学习压缩) [07:32] 🎙 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition(Lyra:一个高效且以语音为中心的全认知框架) [08:20] 📊 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios(RuleArena:在现实场景中评估LLMs规则引导推理能力的基准) [09:08] 👀 Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders(Gaze-LLE:通过大规模学习编码器进行注视目标估计) [10:02] 🧠 JuStRank: Benchmarking LLM Judges for System Ranking(JuStRank:基准测试用于系统排名的LLM评判器) [10:43] 🧠 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation(OLA-VLM:通过辅助嵌入蒸馏提升多模态大语言模型的视觉感知能力) [11:34] 📚 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(版权材料对大型语言模型的影响:挪威视角) [12:16] 🔗 Word Sense Linking: Disambiguating Outside the Sandbox(词义链接:超越沙盒的消歧) [12:58] 🌐 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction(FreeSplatter:无姿态高斯喷射用于稀疏视图三维重建) [13:42] 🎥 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation(DisPose:解耦姿态引导的可控人体图像动画) [14:26] 🖼 LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(LoRACLR:对比适应用于扩散模型的定制化) [15:21] 🧭 SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts(SAME:学习基于状态自适应混合专家的通用语言引导视觉导航) [16:05] 🌟 Arbitrary-steps Image Super-resolution via Diffusion Inversion(基于扩散反演的任意步图像超分辨率) [16:46] 📚 Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages(Shiksha:面向印度语言的技术领域翻译数据集与模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

18分钟
64
4个月前

2024.12.12 每日AI论文 | 多视角视频生成突破,复杂场景模型提升

本期的 14 篇论文如下: [00:23] 🎥 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints(SynCamMaster:同步多视角视频生成) [01:07] 🌐 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations(LAION-SG:用于训练复杂图像-文本模型的增强型大规模数据集与结构化注释) [01:51] 🌐 POINTS1.5: Building a Vision-Language Model towards Real World Applications(POINTS1.5:构建面向实际应用的视觉语言模型) [02:28] 🎨 Learning Flow Fields in Attention for Controllable Person Image Generation(在注意力中学习流场用于可控人物图像生成) [03:11] 🎥 StyleMaster: Stylize Your Video with Artistic Generation and Translation(风格大师:艺术生成与转换的视频风格化) [04:00] 🔍 Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(生成密集化:学习在高保真泛化三维重建中密集化高斯分布) [04:46] 🎥 StreamChat: Chatting with Streaming Video(流媒体聊天:与流媒体视频互动) [05:28] 🧠 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark(3DSRBench:一个综合的3D空间推理基准) [06:12] 🏃 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(Mogo:用于高质量3D人体运动生成的RQ分层因果Transformer) [07:01] 🧠 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models(KaSA:知识感知奇异值适应大型语言模型) [07:40] 🖼 FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(FlowEdit:基于预训练流模型的无逆向文本编辑) [08:17] 🎨 StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements(StyleStudio:基于文本的风格迁移与风格元素选择性控制) [09:03] 🌍 MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation(MIT-10M:大规模多语言图像翻译并行语料库) [09:50] 🚀 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(自引导数据飞轮的语言引导导航学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递

11分钟
99+
4个月前
EarsOnMe

加入我们的 Discord

与播客爱好者一起交流

立即加入

播放列表

自动播放下一个

播放列表还是空的

去找些喜欢的节目添加进来吧