本期的 6 篇论文如下: [00:26] 🧠 Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning(批评者-V:视觉语言模型批评者帮助捕捉多模态推理中的错误) [01:04] 🤖 ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting(ChatGen:从自由聊天中自动生成文本到图像) [01:43] 👕 TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models(TryOffDiff:基于扩散模型的高保真服装重建虚拟试衣) [02:24] 🎥 Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models(自由引导:基于无梯度路径积分控制的增强型文本到视频生成与大规模视觉语言模型) [03:15] 🤖 Morph: A Motion-free Physics Optimization Framework for Human Motion Generation(Morph:一种无运动的物理优化框架用于人体运动生成) [03:49] 📄 LongKey: Keyphrase Extraction for Long Documents(长键:长文档的关键短语提取) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 21 篇论文如下: [00:24] 🖼 ROICtrl: Boosting Instance Control for Visual Generation(ROICtrl:提升视觉生成的实例控制) [01:08] 🎥 CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models(CAT4D:使用多视角视频扩散模型在4D中创建任何内容) [01:55] 📚 Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment(交错场景图用于交错文本与图像生成评估) [02:38] 🌐 MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation(MARVEL-40M+:高保真文本到3D内容创建的多层次视觉细化) [03:21] 🤖 Large Language Model-Brained GUI Agents: A Survey(大语言模型驱动的图形用户界面代理:综述) [03:57] 🎨 DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching(DreamCache:通过特征缓存实现无需微调的轻量级个性化图像生成) [04:35] ⚡ Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient(协同解码使视觉自回归建模更高效) [05:14] 🎥 Identity-Preserving Text-to-Video Generation by Frequency Decomposition(基于频率分解的身份保持文本到视频生成) [05:47] 🚗 DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving(扩散驱动:用于端到端自动驾驶的截断扩散模型) [06:31] 🔺 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes(三维凸包拼接:基于三维平滑凸包的辐射场渲染) [07:10] 🎭 Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters(制作可动画化:一种高效的3D角色动画制作框架) [07:48] 🎛 Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis(Omegance:扩散合成中多粒度控制的单一参数) [08:26] 🦖 ChatRex: Taming Multimodal LLM for Joint Perception and Understanding(ChatRex:驯服多模态大语言模型以实现联合感知与理解) [09:26] 🧍 UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing(UniPose:一种统一的多模态人体姿态理解、生成和编辑框架) [10:06] 🧠 Optimizing Brain Tumor Segmentation with MedNeXt: BraTS 2024 SSA and Pediatrics(优化脑肿瘤分割与MedNeXt:BraTS 2024 SSA与儿科研究) [10:43] ⏱ Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding(草稿模型知道何时停止:一种用于推测解码的自验证长度策略) [11:27] 🎙 VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format(视频大语言模型何时发言:通过视频-文本二重奏交互格式增强时间敏感视频理解) [12:03] 🌟 Adaptive Blind All-in-One Image Restoration(自适应盲全合一图像恢复) [12:39] 🛡 Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing(编辑与我的脸将不再保持:针对恶意生成编辑的个人生物识别防御) [13:18] 🎥 Video-Guided Foley Sound Generation with Multimodal Controls(基于多模态控制的音效生成) [13:48] 📚 Training and Evaluating Language Models with Template-based Data Generation(基于模板的数据生成训练与评估语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 18 篇论文如下: [00:28] 🖥 ShowUI: One Vision-Language-Action Model for GUI Visual Agent(ShowUI:一种用于GUI视觉代理的视觉-语言-动作模型) [01:08] 🎥 Pathways on the Image Manifold: Image Editing via Video Generation(图像流形上的路径:通过视频生成进行图像编辑) [01:45] ⭐ Star Attention: Efficient LLM Inference over Long Sequences(星型注意力:长序列上高效的大型语言模型推理) [02:24] ⚡ Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration(重新思考MLLMs中的Token减少:迈向无训练加速的统一范式) [03:01] 📊 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs(MME-Survey: 多模态大语言模型评估的综合调查) [03:44] 🎨 TEXGen: a Generative Diffusion Model for Mesh Textures(TEXGen:一种用于网格纹理的生成扩散模型) [04:27] 🎨 SketchAgent: Language-Driven Sequential Sketch Generation(SketchAgent:语言驱动的顺序草图生成) [05:11] 🔄 Learning 3D Representations from Procedural 3D Programs(从程序化3D程序中学习3D表示) [05:55] 🧠 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models(VLRewardBench:视觉语言生成奖励模型的挑战性基准) [06:50] 🔄 SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE(SAR3D:通过多尺度3D VQVAE实现自回归3D物体生成与理解) [07:27] 🖼 FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity(精细标题:聚焦任意粒度的组合图像描述) [08:09] 🎨 DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting(DreamMix:解耦对象属性以增强定制化图像修复的可编辑性) [08:41] 📹 SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis(SALOVA:长视频助手在长视频分析中的目标检索与路由) [09:19] 📉 Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens(低比特量化有利于未充分训练的大型语言模型:基于100万亿训练标记的量化大型语言模型缩放规律) [10:05] 🧬 MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts(MolReFlect:面向分子与文本之间细粒度对齐的研究) [10:40] 👕 Controllable Human Image Generation with Personalized Multi-Garments(个性化多服装的可控人体图像生成) [11:12] 🤖 Visual Counter Turing Test (VCT^2): Discovering the Challenges for AI-Generated Image Detection and Introducing Visual AI Index (V_AI)(视觉反图灵测试(VCT²):发现AI生成图像检测的挑战并引入视觉AI指数(V_AI)) [11:55] 🎥 AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation(锚点创作者:通过人-物交互视频生成动画网络锚点推广产品) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 21 篇论文如下: [00:26] 🌐 Material Anything: Generating Materials for Any 3D Object via Diffusion(材料生成:通过扩散生成任意3D对象的材料) [01:05] 🎨 Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator(基于修复的大规模文本到图像模型:零样本主题驱动图像生成器) [01:48] 🤖 From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge(从生成到判断:LLM作为评判者的机遇与挑战) [02:22] 🌐 Knowledge Transfer Across Modalities with Natural Language Supervision(基于自然语言监督的多模态知识迁移) [03:00] 🧠 MH-MoE:Multi-Head Mixture-of-Experts(多头混合专家模型) [03:34] 🎥 DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation(DreamRunner:基于检索增强的运动适应细粒度故事视频生成) [04:13] 🌐 One Diffusion to Generate Them All(一个扩散模型生成所有) [04:54] 👁 VisualLens: Personalization through Visual History(视觉透镜:通过视觉历史实现个性化) [05:34] 🔍 Factorized Visual Tokenization and Generation(因子分解视觉标记化与生成) [06:15] 🔍 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?(O1复制之旅 -- 第二部分:通过简单蒸馏超越O1预览版,巨大进步还是苦涩教训?) [07:00] 🩺 GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI(通用医疗人工智能的大规模视觉语言模型与综合多模态数据集) [07:39] 🌐 SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis(SplatFlow:用于3D高斯喷射合成的多视图校正流模型) [08:25] 🔄 From CISC to RISC: language-model guided assembly transpilation(从CISC到RISC:语言模型引导的汇编转译) [09:03] ⚙ Cautious Optimizers: Improving Training with One Line of Code(谨慎优化器:用一行代码改进训练) [09:49] 🤖 The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz(不可能的测试:2024年不可解数据集与AGI测验的机会) [10:30] 🔮 Predicting Emergent Capabilities by Finetuning(通过微调预测涌现能力) [11:04] 📊 SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation(SegBook:体积医学图像分割的简单基线和操作手册) [11:48] 🩺 Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline(交互式医学图像分割:基准数据集与基线) [12:25] 🤔 LLMs Do Not Think Step-by-step In Implicit Reasoning(大语言模型在隐式推理中不进行逐步思考) [13:00] 🌐 Best of Both Worlds: Advantages of Hybrid Graph Sequence Models(双剑合璧:混合图序列模型的优势) [13:34] 🔗 Edge Weight Prediction For Category-Agnostic Pose Estimation(类别无关姿态估计的边权重预测) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:26] 🎨 Style-Friendly SNR Sampler for Style-Driven Generation(风格友好SNR采样器用于风格驱动生成) [01:08] 🚀 TÜLU 3: Pushing Frontiers in Open Language Model Post-Training(TÜLU 3:推动开放语言模型后训练的前沿) [01:53] 🌐 OminiControl: Minimal and Universal Control for Diffusion Transformer(OminiControl:扩散Transformer的最小且通用控制) [02:31] 🛡 A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection(一种应用于离题提示检测的灵活大型语言模型防护开发方法) [03:08] 🧠 Large Multi-modal Models Can Interpret Features in Large Multi-modal Models(大型多模态模型中的特征解释) [03:49] 🎥 VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection(视频浓缩:通过核心帧选择进行细粒度视频推理的大规模思维链数据集) [04:29] 🎮 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games(BALROG:在游戏中评估代理型LLM和VLM的推理能力) [05:13] 🎥 Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction(基于协调的补丁重构高效长视频标记化) [05:56] 👴 MyTimeMachine: Personalized Facial Age Transformation(我的时光机:个性化面部年龄转换) [06:34] 🎥 Novel View Extrapolation with Video Diffusion Priors(基于视频扩散先验的新视角外推) [07:10] 🎥 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement(视频修复:通过错位评估和局部细化改进文本到视频生成) [07:54] ☁ Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images(适应视觉基础模型用于遥感图像中云分割的鲁棒性) [08:31] 🤖 One to rule them all: natural language to bind communication, perception and action(一统天下:自然语言结合通信、感知与行动) [09:15] 🤖 WildLMa: Long Horizon Loco-Manipulation in the Wild(野外长时程移动操作) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:41] TOP1(🔥93) | 🧠 LLaVA-o1: Let Vision Language Models Reason Step-by-Step(LLaVA-o1:让视觉语言模型逐步推理) [02:41] TOP2(🔥55) | 🌍 Generative World Explorer(生成世界探索者) [05:00] TOP3(🔥44) | 🧠 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization(通过混合偏好优化提升多模态大语言模型的推理能力) [07:11] TOP4(🔥41) | 📚 RedPajama: an Open Dataset for Training Large Language Models(红睡衣:用于训练大型语言模型的开放数据集) [09:20] TOP5(🔥41) | ⚡ SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(SageAttention2技术报告:用于即插即用推理加速的精确4比特注意力机制) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:26] 🧠 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization(通过混合偏好优化提升多模态大语言模型的推理能力) [01:12] 🌐 Multimodal Autoregressive Pre-training of Large Vision Encoders(大规模视觉编码器多模态自回归预训练) [01:55] 🧠 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions(Marco-o1:面向开放式解决方案的开放推理模型) [02:40] 🧠 Hymba: A Hybrid-head Architecture for Small Language Models(Hymba:一种用于小语言模型的混合头架构) [03:22] 🚀 Ultra-Sparse Memory Network(超稀疏内存网络) [03:58] 📚 OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs(开放学者:利用检索增强型语言模型合成科学文献) [04:47] 🧠 Natural Language Reinforcement Learning(自然语言强化学习) [05:26] 🧠 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models(Insight-V:探索多模态大语言模型的长链视觉推理) [06:08] 🤖 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models(我了解这个实体吗?语言模型中的知识意识与幻觉) [06:46] 🌊 Stable Flow: Vital Layers for Training-Free Image Editing(稳定流:无需训练的图像编辑关键层) [07:25] 🌐 UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages(统一爬取:利用Common Crawl为低资源语言的LLM提供经济适用的适应性) [08:03] 🚗 MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control(MagicDriveDiT:基于自适应控制的高分辨率长视频生成用于自动驾驶) [08:44] 🧠 Patience Is The Key to Large Language Model Reasoning(耐心是大型语言模型推理的关键) [09:18] 🌐 Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation(将高斯散射融入扩散去噪器以实现快速且可扩展的单阶段图像到3D生成) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 8 篇论文如下: [00:28] ⚡ SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(SageAttention2技术报告:用于即插即用推理加速的精确4比特注意力机制) [01:10] 📹 VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models(VBench++:全面且多功能的视频生成模型基准套件) [01:51] 🎮 VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation(视频自动竞技场:通过用户模拟评估大型多模态模型在视频分析中的能力) [02:33] 🎯 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory(SAMURAI:利用运动感知记忆机制将分割模型适应于零样本视觉跟踪) [03:10] 🌐 Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents(你的LLM是否秘密地成为互联网的世界模型?基于模型的网络代理规划) [03:52] 🔄 When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training(精度与位置的碰撞:BFloat16在长上下文训练中破坏了RoPE) [04:34] 🎨 Stylecodes: Encoding Stylistic Information For Image Generation(风格编码:为图像生成编码风格信息) [05:11] 🩺 ORID: Organ-Regional Information Driven Framework for Radiology Report Generation(器官-区域信息驱动的放射报告生成框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 7 篇论文如下: [00:33] ⚡ Continuous Speculative Decoding for Autoregressive Image Generation(自回归图像生成的连续推测解码) [01:14] 📚 RedPajama: an Open Dataset for Training Large Language Models(红睡衣:用于训练大型语言模型的开放数据集) [01:58] 🤖 Soft Robotic Dynamic In-Hand Pen Spinning(软体机器人动态手内笔旋转) [02:39] 🚀 ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements(ITACLIP:通过图像、文本和架构增强提升无训练语义分割) [03:13] 🔒 Building Trust: Foundations of Security, Safety and Transparency in AI(构建信任:人工智能中的安全、安全和透明度基础) [03:46] 🔍 SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning(SEAGULL:通过视觉语言指令调优的无参考图像质量评估方法) [04:24] 📊 Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages(评估大型语言模型在印度官方语言中的分词器性能) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 16 篇论文如下: [00:25] 📱 BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices(BlueLM-V-3B:移动设备上多模态大语言模型的算法与系统协同设计) [01:06] 🌍 Generative World Explorer(生成世界探索者) [01:43] 🔍 Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering(搜索、验证与反馈:通过验证器工程实现下一代基础模型的后训练范式) [02:24] 🎥 AnimateAnything: Consistent and Controllable Animation for Video Generation(动画任何事物:视频生成的连贯可控动画) [03:08] 🧠 Top-$nσ$: Not All Logits Are You Need(Top-$nσ$:并非所有对数都需要) [03:55] 🧠 Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts(Awaker2.5-VL:通过参数高效混合专家稳定扩展多模态大语言模型) [04:40] ⚡ SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers(SmoothCache:一种用于扩散变换器的通用推理加速技术) [05:19] 📚 Drowning in Documents: Consequences of Scaling Reranker Inference(文档淹没:扩展重排序器推理的后果) [06:00] 🩺 Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering(医疗问答系统中检索增强生成系统的综合与实用评估) [06:37] 📱 SlimLM: An Efficient Small Language Model for On-Device Document Assistance(SlimLM:一种用于设备端文档辅助的高效小型语言模型) [07:19] 🎥 VeGaS: Video Gaussian Splatting(视频高斯喷射) [07:50] 🔄 Adaptive Decoding via Latent Preference Optimization(通过潜在偏好优化的自适应解码) [08:27] 🎥 StableV2V: Stablizing Shape Consistency in Video-to-Video Editing(稳定视频编辑:在视频到视频编辑中保持形状一致性) [09:11] 🇩 LLäMmlein: Compact and Competitive German-Only Language Models from Scratch(LLäMmlein:从头开始构建紧凑且有竞争力的德语专用语言模型) [09:43] 👕 FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on(FitDiT:提升高保真虚拟试穿的真实服装细节) [10:18] 📜 Evaluating the role of `Constitutions' for learning from AI feedback(评估‘宪法’在从AI反馈中学习的作用) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 6 篇论文如下: [00:28] 🧠 LLaVA-o1: Let Vision Language Models Reason Step-by-Step(LLaVA-o1:让视觉语言模型逐步推理) [01:14] 🎨 Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement(区域感知文本到图像生成:硬绑定与软优化) [01:51] 🌐 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation(高斯任意:交互式点云潜在扩散用于3D生成) [02:25] 🌅 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use(GUI代理的黎明:基于Claude 3.5计算机使用的初步案例研究) [03:00] 📖 Number it: Temporal Grounding Videos like Flipping Manga(像翻阅漫画一样进行视频时间定位) [03:45] 🌍 Xmodel-1.5: An 1B-scale Multilingual LLM(Xmodel-1.5:一个10亿参数的多语言大型语言模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:44] TOP1(🔥54) | 🖼 Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models(Add-it:基于预训练扩散模型的图像无训练对象插入) [02:31] TOP2(🔥44) | 🤖 Large Language Models Can Self-Improve in Long-context Reasoning(大型语言模型在长上下文推理中的自我改进) [04:15] TOP3(🔥43) | 🌐 LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models(LLaMA-Mesh:将3D网格生成与语言模型统一) [06:12] TOP4(🔥42) | 🎨 OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision(全能编辑器:通过专家监督构建图像编辑通用模型) [08:01] TOP5(🔥42) | 📚 M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework(M-Longdoc:多模态超长文档理解和检索感知调优框架的基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
与播客爱好者一起交流
播放列表还是空的
去找些喜欢的节目添加进来吧