本期的 14 篇论文如下: [00:23] 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models(阿波罗:大型多模态模型中的视频理解探索) [01:11] 🌍 GenEx: Generating an Explorable World(GenEx:生成可探索的世界) [01:50] 🌐 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding(协同生成-VL:基于视觉专家和令牌折叠的图像理解与生成) [02:37] 🩺 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities(BiMediX2:多模态生物医学专家大模型) [03:21] 🤖 Large Action Models: From Inception to Implementation(大规模动作模型:从构想到实现) [04:09] 🎥 InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption(实例感知结构化字幕:通过实例感知结构化字幕提升文本到视频生成) [04:56] 🌟 FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion(FreeScale:通过无调谐尺度融合释放扩散模型的分辨率) [05:42] 🎯 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation(ObjectMate:面向对象插入与主体驱动生成任务的循环先验方法) [06:21] 🔥 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing(FireFlow:图像语义编辑的快速校正流反演) [07:09] 🎵 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation(基于显式桥梁和检索增强的多模态音乐生成) [07:56] 🎨 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers(FluxSpace:在修正流变换器中解耦语义编辑) [08:44] 📊 SCBench: A KV Cache-Centric Analysis of Long-Context Methods(SCBench:以KV缓存为中心的长上下文方法分析) [09:27] 🧠 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs(SmolTulu:更高的学习率与批量大小的比率可以提升SLMs的推理能力) [10:05] 🩺 Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images(Prompt2Perturb (P2P): 基于文本引导扩散的乳腺超声图像对抗攻击) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:43] TOP1(🔥95) | 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(扩展开源多模态模型性能边界:模型、数据与测试时扩展) [03:01] TOP2(🔥65) | 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [05:09] TOP3(🔥64) | 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation(揭开强化学习代理中记忆复杂性的分类与评估方法) [07:29] TOP4(🔥61) | 🎥 STIV: Scalable Text and Image Conditioned Video Generation(STIV:可扩展的文本与图像条件视频生成) [09:46] TOP5(🔥53) | 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning(ProcessBench:识别数学推理中的过程错误) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 23 篇论文如下: [00:23] 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [01:03] 🧠 Phi-4 Technical Report(Phi-4 技术报告) [01:43] 🧠 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions(欧几里得:通过合成高保真视觉描述提升多模态大语言模型) [02:27] 🌐 Multimodal Latent Language Modeling with Next-Token Diffusion(多模态潜在语言建模与下一词扩散) [03:10] 🌐 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(EasyRef:基于多模态大语言模型的扩散模型通用化图像参考) [03:57] 🌐 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials(AgentTrek:通过网络教程引导回放的代理轨迹合成) [04:43] 🌟 Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion(神经光装置:利用多光源扩散解锁精确物体法线和材质估计) [05:24] 📱 SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(SnapGen:通过高效架构和训练驯服高分辨率文本到图像模型以适应移动设备) [06:02] 🔬 PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations(PIG:物理信息高斯函数作为自适应参数化网格表示) [06:49] 📊 Learned Compression for Compressed Learning(压缩学习中的学习压缩) [07:32] 🎙 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition(Lyra:一个高效且以语音为中心的全认知框架) [08:20] 📊 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios(RuleArena:在现实场景中评估LLMs规则引导推理能力的基准) [09:08] 👀 Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders(Gaze-LLE:通过大规模学习编码器进行注视目标估计) [10:02] 🧠 JuStRank: Benchmarking LLM Judges for System Ranking(JuStRank:基准测试用于系统排名的LLM评判器) [10:43] 🧠 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation(OLA-VLM:通过辅助嵌入蒸馏提升多模态大语言模型的视觉感知能力) [11:34] 📚 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(版权材料对大型语言模型的影响:挪威视角) [12:16] 🔗 Word Sense Linking: Disambiguating Outside the Sandbox(词义链接:超越沙盒的消歧) [12:58] 🌐 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction(FreeSplatter:无姿态高斯喷射用于稀疏视图三维重建) [13:42] 🎥 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation(DisPose:解耦姿态引导的可控人体图像动画) [14:26] 🖼 LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(LoRACLR:对比适应用于扩散模型的定制化) [15:21] 🧭 SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts(SAME:学习基于状态自适应混合专家的通用语言引导视觉导航) [16:05] 🌟 Arbitrary-steps Image Super-resolution via Diffusion Inversion(基于扩散反演的任意步图像超分辨率) [16:46] 📚 Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages(Shiksha:面向印度语言的技术领域翻译数据集与模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:23] 🎥 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints(SynCamMaster:同步多视角视频生成) [01:07] 🌐 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations(LAION-SG:用于训练复杂图像-文本模型的增强型大规模数据集与结构化注释) [01:51] 🌐 POINTS1.5: Building a Vision-Language Model towards Real World Applications(POINTS1.5:构建面向实际应用的视觉语言模型) [02:28] 🎨 Learning Flow Fields in Attention for Controllable Person Image Generation(在注意力中学习流场用于可控人物图像生成) [03:11] 🎥 StyleMaster: Stylize Your Video with Artistic Generation and Translation(风格大师:艺术生成与转换的视频风格化) [04:00] 🔍 Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(生成密集化:学习在高保真泛化三维重建中密集化高斯分布) [04:46] 🎥 StreamChat: Chatting with Streaming Video(流媒体聊天:与流媒体视频互动) [05:28] 🧠 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark(3DSRBench:一个综合的3D空间推理基准) [06:12] 🏃 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(Mogo:用于高质量3D人体运动生成的RQ分层因果Transformer) [07:01] 🧠 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models(KaSA:知识感知奇异值适应大型语言模型) [07:40] 🖼 FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(FlowEdit:基于预训练流模型的无逆向文本编辑) [08:17] 🎨 StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements(StyleStudio:基于文本的风格迁移与风格元素选择性控制) [09:03] 🌍 MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation(MIT-10M:大规模多语言图像翻译并行语料库) [09:50] 🚀 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(自引导数据飞轮的语言引导导航学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 23 篇论文如下: [00:25] 🧑 Evaluating and Aligning CodeLLMs on Human Preference(评估与对齐代码大语言模型的人类偏好) [01:19] 🎥 STIV: Scalable Text and Image Conditioned Video Generation(STIV:可扩展的文本与图像条件视频生成) [01:59] 🎨 DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation(DiffSensei:连接多模态大语言模型与扩散模型以实现定制化漫画生成) [02:39] 🔒 Hidden in the Noise: Two-Stage Robust Watermarking for Images(隐藏在噪声中:图像的两阶段鲁棒水印技术) [03:19] 🎥 UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics(UniReal:通过学习真实世界动态实现通用图像生成与编辑) [04:04] 📄 OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations(全向文档基准:多样PDF文档解析的综合评估) [04:50] 🎨 FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models(FiVA:用于文本到图像扩散模型的细粒度视觉属性数据集) [05:32] 🎥 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation(3D轨迹大师:掌握视频生成中的多实体三维运动) [06:09] 🧠 Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation(框架表示假设:多标记语言模型的可解释性与概念引导文本生成) [06:55] 🧠 Perception Tokens Enhance Visual Reasoning in Multimodal Language Models(感知令牌增强多模态语言模型的视觉推理能力) [07:41] 🎥 Video Motion Transfer with Diffusion Transformers(基于扩散变换器的视频运动迁移) [08:23] 🚀 EMOv2: Pushing 5M Vision Model Frontier(EMOv2:推动5M规模视觉模型前沿) [09:02] 🛡 Granite Guardian(花岗岩守护者) [09:44] 🌟 ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance(ILLUME:让您的LLMs看见、绘制并自我增强) [10:30] 🎥 ObjCtrl-2.5D: Training-free Object Control with Camera Poses(ObjCtrl-2.5D:无需训练的对象控制与相机姿态) [11:21] 🚀 LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation(LoRA.rar:通过超网络学习合并LoRA以实现主题-风格条件图像生成) [12:12] 📱 MoViE: Mobile Diffusion for Video Editing(MoViE:移动设备上的扩散模型视频编辑) [12:46] 🧬 Chimera: Improving Generalist Model with Domain-Specific Experts(奇美拉:通过特定领域专家提升通用模型) [13:28] 🌐 Fully Open Source Moxin-7B Technical Report(全开源Moxin-7B技术报告) [14:09] 📱 Mobile Video Diffusion(移动视频扩散) [14:45] 🤖 Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation(情境化反驳言论:适应、个性化与评估策略) [15:24] 🤖 Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment(最大化对齐与最小化反馈:高效学习视觉运动机器人策略对齐的奖励) [16:15] 🔒 A New Federated Learning Framework Against Gradient Inversion Attacks(一种对抗梯度反演攻击的新型联邦学习框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 9 篇论文如下: [00:23] 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning(ProcessBench:识别数学推理中的过程错误) [01:13] 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation(揭开强化学习代理中记忆复杂性的分类与评估方法) [01:58] 🧠 Training Large Language Models to Reason in a Continuous Latent Space(在连续潜在空间中训练大型语言模型进行推理) [02:38] 🌐 Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models(探索多粒度概念注释在多模态大语言模型中的应用) [03:22] 🎥 Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation(Divot:基于扩散模型的视频理解与生成) [04:09] 🎥 You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale(所见即所得:在无姿态视频上大规模学习3D创作) [04:53] 🌍 Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space(地球的全局与密集嵌入:潜在空间中的Major TOM浮动) [05:31] 🌐 Robust Multi-bit Text Watermark with LLM-based Paraphrasers(基于LLM的鲁棒多比特文本水印) [06:15] 🤖 CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction(CARP:通过粗到细自回归预测进行视觉运动策略学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 11 篇论文如下: [00:27] 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(扩展开源多模态模型性能边界:模型、数据与测试时扩展) [00:58] 🎥 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment(利用人类反馈进行文本到视频模型对齐) [01:41] 🧠 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale(MAmmoTH-VL:大规模指令调优激发多模态推理) [02:24] 🤖 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases(EXAONE 3.5:面向实际应用的大型语言模型系列) [03:26] 🤖 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation(Moto:作为机器人操作桥梁语言的潜在运动标记) [04:10] 🚀 APOLLO: SGD-like Memory, AdamW-level Performance(APOLLO:类似SGD的内存,AdamW级别的性能) [04:49] ⚡ SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion(SwiftEdit:通过一步扩散实现闪电般快速的文本引导图像编辑) [05:26] 🎥 GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration(GenMAC:基于多智能体协作的组合式文本到视频生成) [06:07] ⏱ Mind the Time: Temporally-Controlled Multi-Event Video Generation(注意时间:时间控制的多事件视频生成) [06:42] 🏠 2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction(2DGS-Room:基于种子引导的2D高斯喷射与几何约束的高保真室内场景重建) [07:20] 🗣 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling(DEMO:通过细粒度元素建模重构对话交互) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:40] TOP1(🔥102) | 🚀 SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(SNOOPI:超强一步扩散蒸馏与适当引导) [02:39] TOP2(🔥100) | 🔄 PaliGemma 2: A Family of Versatile VLMs for Transfer(PaliGemma 2:多功能视觉语言模型的迁移研究) [04:40] TOP3(🔥64) | 🔍 VisionZip: Longer is Better but Not Necessary in Vision Language Models(视觉压缩:视觉语言模型中长度并非必要优势) [06:14] TOP4(🔥60) | 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(X-Prompt:面向自回归视觉语言基础模型的通用上下文图像生成) [08:19] TOP5(🔥54) | 🎥 VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation(视频思维生成:多镜头视频生成的协作框架) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 23 篇论文如下: [00:23] 🔍 VisionZip: Longer is Better but Not Necessary in Vision Language Models(视觉压缩:视觉语言模型中长度并非必要优势) [01:03] 🤖 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection(代码即监控:约束感知的视觉编程用于反应性和前瞻性机器人故障检测) [01:43] 🖥 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction(Aguvis:统一纯视觉自主GUI交互代理) [02:27] 🔊 A Noise is Worth Diffusion Guidance(噪声值得扩散引导) [03:04] 📊 Evaluating Language Models as Synthetic Data Generators(评估语言模型作为合成数据生成器) [03:48] 🌐 Structured 3D Latents for Scalable and Versatile 3D Generation(结构化3D潜在表示在可扩展和多功能3D生成中的应用) [04:26] 🌐 MV-Adapter: Multi-view Consistent Image Generation Made Easy(MV-Adapter:多视角一致图像生成变得简单) [05:05] 🖼 Negative Token Merging: Image-based Adversarial Feature Guidance(负向标记合并:基于图像的对抗特征引导) [05:41] 🌐 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion(佛罗伦萨-VL:通过生成视觉编码器和深度-广度融合增强视觉语言模型) [06:18] 📈 Densing Law of LLMs(大语言模型的密度定律) [06:59] 🌌 Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis(无限:高分辨率图像合成中的比特位自回归建模) [07:37] ⚽ Towards Universal Soccer Video Understanding(面向通用足球视频理解) [08:15] 🎨 HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing(HumanEdit:一个高质量的人类奖励数据集,用于基于指令的图像编辑) [08:53] 👗 AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models(任意服装虚拟试穿:基于潜在扩散模型的可定制多服装生成) [09:35] 🌍 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation(全球MMLU:理解和解决多语言评估中的文化和语言偏见) [10:11] 🌐 Personalized Multimodal Large Language Models: A Survey(个性化多模态大语言模型:综述) [10:55] ⚡ ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality(ZipAR:通过空间局部性加速自回归图像生成) [11:36] 🧠 MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities(MRGen:基于扩散的可控数据引擎用于无标注模态的MRI分割) [12:14] 🧠 Discriminative Fine-tuning of LVLMs(判别性微调的大视觉语言模型) [12:48] 🧠 Monet: Mixture of Monosemantic Experts for Transformers(Monet:Transformer的单语义专家混合模型) [13:24] 🌊 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows(全流:多模态校正流的任意到任意生成) [13:59] 🧠 KV Shifting Attention Enhances Language Modeling(KV移位注意力增强语言建模) [14:40] 🌍 Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement(Marco-LLM:通过大规模多语言训练实现跨语言增强) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 15 篇论文如下: [00:24] 🚀 SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance(SNOOPI:超强一步扩散蒸馏与适当引导) [01:06] 🎥 Imagine360: Immersive 360 Video Generation from Perspective Anchor(Imagine360:从透视锚点生成沉浸式360度视频) [01:40] 🚗 Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion(扩散模型在高效3D LiDAR场景补全中的蒸馏方法) [02:13] 🔄 PaliGemma 2: A Family of Versatile VLMs for Transfer(PaliGemma 2:多功能视觉语言模型的迁移研究) [02:52] 🌊 TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation(TokenFlow:多模态理解和生成的统一图像分词器) [03:31] 🌐 VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models(VARCO-VISION:拓展韩国视觉语言模型的前沿) [04:05] 🌐 NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images(NVComposer:通过多张稀疏和未定位图像提升生成新视角合成) [04:49] 🎥 Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding(视频-3D大语言模型:学习位置感知视频表示用于3D场景理解) [05:34] 🔍 CleanDIFT: Diffusion Features without Noise(CleanDIFT:无噪声扩散特征) [06:11] 🎨 MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation(MIDI:单张图像生成多实例3D场景的新方法) [06:53] 🎥 One Shot, One Talk: Whole-body Talking Avatar from a Single Image(一图一语:从单张图像生成全身说话虚拟形象) [07:33] 📹 Mimir: Improving Video Diffusion Models for Precise Text Understanding(米米尔:提升视频扩散模型在精确文本理解中的应用) [08:07] 🎨 NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training(硝基融合:通过动态对抗训练实现高保真单步扩散) [08:47] 🧩 Weighted-Reward Preference Optimization for Implicit Model Fusion(加权奖励偏好优化用于隐式模型融合) [09:37] 🔍 Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning(Inst-IT:通过显式视觉提示指令调优提升多模态实例理解) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 15 篇论文如下: [00:24] 🎥 VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation(视频思维生成:多镜头视频生成的协作框架) [01:04] 🧠 Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability(关键令牌重要性:令牌级对比估计提升LLM的推理能力) [01:45] 🔄 Free Process Rewards without Process Labels(无过程标签的自由过程奖励) [02:30] 🎧 AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?(AV-Odyssey 基准:多模态大语言模型真的能理解视听信息吗?) [03:04] 🤖 MALT: Improving Reasoning with Multi-Agent LLM Training(MALT:通过多智能体LLM训练提升推理能力) [03:45] 🎥 OmniCreator: Self-Supervised Unified Generation with Universal Editing(全能创作者:自监督统一生成与通用编辑) [04:23] 🌴 Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis(真相还是幻象?面向端到端事实性评估的LLM-Oasis) [05:08] 📚 OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation(OCR 阻碍 RAG:评估 OCR 对检索增强生成系统的级联影响) [05:51] 📊 Scaling Image Tokenizers with Grouped Spherical Quantization(基于分组球面量化的图像标记器扩展) [06:27] 🌐 LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences(LSceneLLM:利用自适应视觉偏好增强大型3D场景理解) [07:09] ⚙ A dynamic parallel method for performance optimization on hybrid CPUs(混合CPU性能优化的动态并行方法) [08:00] 🌐 MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation(MaskRIS:语义扭曲感知的数据增强方法用于指称图像分割) [08:46] 🎥 Motion Prompting: Controlling Video Generation with Motion Trajectories(运动提示:通过运动轨迹控制视频生成) [09:27] 🎥 VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval(视频亮点:联合视频亮点检测与时刻检索的特征精炼与跨任务对齐Transformer) [10:01] 🤖 Generating a Low-code Complete Workflow via Task Decomposition and RAG(通过任务分解和RAG生成低代码完整工作流程) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 24 篇论文如下: [00:23] 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(X-Prompt:面向自回归视觉语言基础模型的通用上下文图像生成) [00:58] 📊 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation(GATE 开放:一个综合基准用于评估开放式交错图文生成) [01:32] 🖼 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(Switti:为文本到图像合成设计尺度变换器) [02:09] 🎥 Open-Sora Plan: Open-Source Large Video Generation Model(开放Sora计划:开源大型视频生成模型) [02:55] 🎥 TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video(TAPTRv3:时空上下文增强长视频中任意点的鲁棒跟踪) [03:37] 🤖 o1-Coder: an o1 Replication for Coding(o1-Coder:一个面向编码任务的o1模型复现) [04:12] 🤖 SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters(SOLAMI:沉浸式互动的3D自主角色社交视觉-语言-动作建模) [04:49] 🎥 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation(VISTA:通过视频时空增强提升长时和高分辨率视频理解) [05:38] 🔍 TinyFusion: Diffusion Transformers Learned Shallow(微型融合:浅层扩散变换器的学习) [06:19] 🔍 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models(VLsI:从大型到小型视觉语言模型的层级交互化) [06:52] 🎙 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(FLOAT:基于生成运动潜在流匹配的音频驱动说话人像) [07:32] 🚀 Efficient Track Anything(高效追踪任何目标) [08:15] 🌊 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(在矢量场中引导校正流模型以实现受控图像生成) [08:50] 🎥 Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(长视频扩散生成与分段交叉注意力及内容丰富的视频数据集构建) [09:33] 📹 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model(WF-VAE:通过小波驱动的能量流动增强视频VAE以用于潜在视频扩散模型) [10:11] 🔍 VLSBench: Unveiling Visual Leakage in Multimodal Safety(VLSBench:揭示多模态安全中的视觉泄露问题) [10:51] 🧠 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information(VisOnlyQA:大型视觉语言模型在几何信息视觉感知方面仍存在困难) [11:41] 🎮 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos(PhysGame:揭示游戏视频中的物理常识违规) [12:14] 🗣 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input(协作实例导航:利用代理自我对话最小化用户输入) [12:51] 🌍 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge(评估多语言理解能力:基于区域知识) [13:28] 🎨 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(无艺术生成模型:无需图形艺术知识的艺术创作) [14:02] 📈 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models(大型语言模型测试时计算的简单可证明缩放定律) [14:41] 🌐 World-consistent Video Diffusion with Explicit 3D Modeling(世界一致性视频扩散与显式3D建模) [15:22] 🔊 Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning(面向低资源环境下跨语言音频滥用检测的小样本学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
与播客爱好者一起交流
播放列表还是空的
去找些喜欢的节目添加进来吧