本期的 9 篇论文如下: [00:26] 🧠 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding(3DGraphLLM:结合语义图与大型语言模型进行三维场景理解) [01:11] 🖼 DepthLab: From Partial to Complete(DepthLab:从部分到完整) [01:54] 📊 Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization(傅里叶位置嵌入:增强注意力机制的周期性扩展以实现长度泛化) [02:35] 🎥 DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation(DiTCtrl:探索多模态扩散变压器中的注意力控制以实现无需调优的多提示长视频生成) [03:26] 🤔 In Case You Missed It: ARC 'Challenge' Is Not That Challenging(你可能错过了:ARC '挑战' 并不那么具有挑战性) [04:02] 🧠 ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing(ReMoE:使用ReLU路由的全可微分专家混合模型) [04:41] 🧩 PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models(PartGen:基于多视角扩散模型的部分级三维生成与重建) [05:20] 🧠 SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval(SKETCH:结构化知识增强的文本理解与整体检索) [06:02] 🧠 Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning(通过过程奖励引导的树搜索集成大语言模型以提升复杂推理能力) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 16 篇论文如下: [00:24] 🔄 B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners(B-STaR:监控和平衡自学习推理器中的探索与利用) [01:04] 🛡 RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response(RobustFT:在噪声响应下的大语言模型的鲁棒监督微调) [01:43] 🧠 Diving into Self-Evolving Training for Multimodal Reasoning(深入自进化训练的多模态推理) [02:29] ⚡ Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching(蒸馏解码1:使用流匹配对图像自回归模型进行一步采样) [03:12] 🎥 Large Motion Video Autoencoding with Cross-modal Video VAE(基于跨模态视频VAE的大运动视频自动编码) [03:56] 🧠 Deliberation in Latent Space via Differentiable Cache Augmentation(潜在空间中的推理增强通过可微缓存扩展) [04:41] 📚 Revisiting In-Context Learning with Long Context Language Models(重新审视长上下文语言模型中的上下文学习) [05:25] 🧠 Outcome-Refining Process Supervision for Code Generation(代码生成中的结果优化过程监督) [06:11] 🧠 DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought(DRT-o1:通过长链思维优化深度推理翻译) [06:48] 📚 LearnLM: Improving Gemini for Learning(学习语言模型:提升Gemini的学习能力) [07:33] ⚠ Agent-SafetyBench: Evaluating the Safety of LLM Agents(Agent-SafetyBench:评估LLM代理的安全性) [08:15] 🧠 OpenAI o1 System Card(OpenAI o1 系统卡片) [09:03] 🧠 NILE: Internal Consistency Alignment in Large Language Models(NILE:大型语言模型中的内部一致性对齐) [09:45] 🤖 OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning(OpenRFT:通过强化微调适应领域特定任务的推理基础模型) [10:26] 🗣 Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding(Friends-MMC:多模态多方对话理解数据集) [10:59] 🌙 PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World(PC代理:当你睡觉时,AI在工作——进入数字世界的认知之旅) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 10 篇论文如下: [00:22] ⚡ Parallelized Autoregressive Visual Generation(并行自回归视觉生成) [01:05] 🧠 Offline Reinforcement Learning for LLM Multi-Step Reasoning(基于离线强化学习的大语言模型多步推理) [01:43] 🔑 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation(SCOPE:优化长上下文生成中的键值缓存压缩) [02:30] 🚀 CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up(CLEAR:卷积类线性化提升预训练扩散变换器性能) [03:14] 🎥 Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis(驯服多模态联合训练以实现高质量视频到音频合成) [04:01] 🧠 MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design(MixLLM:基于全局混合精度的LLM量化与高效系统设计) [04:37] 🌍 LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps(大型语言模型在翻译中的迷失:M-ALERT揭示跨语言安全差距) [05:23] 🎥 Sequence Matters: Harnessing Video Models in 3D Super-Resolution(序列至关重要:利用视频模型进行3D超分辨率重建) [06:21] 🇳 Fietje: An open, efficient LLM for Dutch(Fietje:一个针对荷兰语的开源高效大型语言模型) [07:14] 👤 IDOL: Instant Photorealistic 3D Human Creation from a Single Image(IDOL:从单张图像即时生成逼真的3D人体模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:40] TOP1(🔥252) | 🤖 Qwen2.5 Technical Report(Qwen2.5技术报告) [02:31] TOP2(🔥127) | 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models(阿波罗:大型多模态模型中的视频理解探索) [04:30] TOP3(🔥86) | 🚀 Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference(更智能、更优、更快、更长:一种现代双向编码器,用于快速、内存高效的长上下文微调和推理) [06:59] TOP4(🔥82) | 🌍 GenEx: Generating an Explorable World(GenEx:生成可探索的世界) [08:58] TOP5(🔥79) | 🧠 Are Your LLMs Capable of Stable Reasoning?(你的大语言模型能够稳定推理吗?) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:22] 🤖 Qwen2.5 Technical Report(Qwen2.5技术报告) [01:00] 🧠 Progressive Multimodal Reasoning via Active Retrieval(通过主动检索实现渐进式多模态推理) [01:39] 🌐 MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval(MegaPairs:大规模数据合成用于通用多模态检索) [02:26] 🧠 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks(LongBench v2:面向现实长上下文多任务的深入理解和推理) [03:15] 📊 How to Synthesize Text Data without Model Collapse?(如何合成文本数据而不导致模型崩溃?) [03:56] 🌊 Flowing from Words to Pixels: A Framework for Cross-Modality Evolution(从文字到像素:跨模态演化的框架) [04:37] 🎥 LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis(LeviTor:面向三维轨迹的图像到视频合成) [05:20] 🖼 Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion(可感知功能的对象插入:基于掩码感知的双重扩散) [06:05] 🌐 DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation(DI-PCG:基于扩散的高效逆向程序化内容生成用于高质量3D资产创建) [06:46] 🧠 AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling(AceMath:通过后训练和奖励建模推进前沿数学推理) [07:33] 🧠 Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception(基于视觉专家的描述性字幕增强的多模态感知) [08:14] 🖼 UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency(基于循环编辑一致性的无监督指令图像编辑) [08:54] 🧪 TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation(基于文本的开放分子生成基准测试) [09:36] 🕺 Move-in-2D: 2D-Conditioned Human Motion Generation(二维条件下的生成人体运动) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 18 篇论文如下: [00:24] 🤖 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks(TheAgentCompany:在具有重要现实意义的任务上对LLM代理进行基准测试) [01:06] 🎥 AniDoc: Animation Creation Made Easier(AniDoc:让动画制作更简单) [01:44] 👗 FashionComposer: Compositional Fashion Image Generation(时尚组合器:组合式时尚图像生成) [02:28] 🤖 Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning(高效扩散Transformer策略与专家去噪混合模型在多任务学习中的应用) [03:05] 🌐 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation(提示深度任意模型用于4K分辨率精确度量深度估计) [03:42] 🔄 Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN(混合层归一化:通过结合预层归一化和后层归一化释放深层层的潜力) [04:26] 🤖 GUI Agents: A Survey(图形用户界面代理:综述) [05:12] 🌍 AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities(AnySat:适用于任意分辨率、尺度和模态的地球观测模型) [05:51] 📊 RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment(RAG-RewardBench:在检索增强生成中评估奖励模型以实现偏好对齐) [06:40] 🧠 LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer(LLaVA-UHD v2:通过分层窗口Transformer集成高分辨率特征金字塔的多模态大语言模型) [07:30] 🤖 Learning from Massive Human Videos for Universal Humanoid Pose Control(从大规模人类视频中学习通用拟人姿态控制) [08:05] 🤖 ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers(ChatDiT:一种无需训练的任务无关自由形式聊天扩散变换器基线) [08:49] 🎥 VidTok: A Versatile and Open-Source Video Tokenizer(VidTok:一种多功能且开源的视频标记器) [09:28] 🧠 Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces(空间思维:多模态大语言模型如何看、记和回忆空间) [10:13] 🔄 CAD-Recode: Reverse Engineering CAD Code from Point Clouds(CAD-Recode:从点云逆向工程CAD代码) [10:54] 🤖 AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge(AntiLeak-Bench:通过自动构建基准测试防止数据污染) [11:39] 🤖 Alignment faking in large language models(大型语言模型中的对齐伪装) [12:19] ⚡ FastVLM: Efficient Vision Encoding for Vision Language Models(FastVLM:高效视觉编码在视觉语言模型中的应用) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 8 篇论文如下: [00:24] 🧠 Are Your LLMs Capable of Stable Reasoning?(你的LLM是否具备稳定推理能力?) [01:06] 📊 Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models(多维度洞察:大型多模态模型在现实世界个性化中的基准测试) [01:52] 📊 OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain(OmniEval:金融领域全方位自动RAG评估基准) [02:33] 🧠 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers(抽象概念的涌现:Transformer中上下文学习中的概念编码与解码机制) [03:16] 🤖 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents(提议者-代理-评估者(PAE):为基模型互联网代理实现自主技能发现) [04:00] 📊 VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation(VisDoM:使用多模态检索增强生成的多文档问答与视觉丰富元素) [04:39] 🤔 When to Speak, When to Abstain: Contrastive Decoding with Abstention(何时发言,何时保持沉默:对比解码与放弃机制) [05:18] 🎥 MIVE: New Design and Benchmark for Multi-Instance Video Editing(MIVE:多实例视频编辑的新设计与基准) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 18 篇论文如下: [00:23] 🧠 RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation(RetroLLM:赋能大型语言模型在生成过程中检索细粒度证据) [01:05] ⚡ Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models(评估代理:高效且可提示的视觉生成模型评估框架) [01:45] 🎨 BrushEdit: All-In-One Image Inpainting and Editing(BrushEdit:一站式图像修复与编辑) [02:27] 🎨 ColorFlow: Retrieval-Augmented Image Sequence Colorization(ColorFlow:检索增强型图像序列着色) [03:10] 🧩 Byte Latent Transformer: Patches Scale Better Than Tokens(字节潜在变换器:补丁尺度优于标记) [03:56] 🧠 Causal Diffusion Transformers for Generative Modeling(因果扩散变换器用于生成建模) [04:33] 🤖 Smaller Language Models Are Better Instruction Evolvers(更小的语言模型是更好的指令进化器) [05:16] 🌟 IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations(IDArb:任意数量输入视图和光照下的内在分解) [06:02] 🌳 SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models(SPaR:通过树搜索优化自我对弈以提升大型语言模型的指令遵循能力) [06:47] 🌌 Wonderland: Navigating 3D Scenes from a Single Image(奇境:从单张图像导航3D场景) [07:32] 🔬 GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs(高斯属性:将物理属性集成到3D高斯分布中与LMMs结合) [08:18] ⚡ SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator(SepLLM:通过将一段内容压缩为一个分隔符来加速大型语言模型) [09:06] 🧠 Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture(奇妙矩阵:结合以实现更高效和有效的基模型架构) [09:46] 👩 StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors(StrandHead:基于头发几何先验的文本生成解耦3D头部虚拟形象) [10:35] 🌐 MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes(MOVIS:增强室内场景多物体新颖视角合成) [11:19] 🎵 Whisper-GPT: A Hybrid Representation Audio Large Language Model(Whisper-GPT:一种混合表示的音频大语言模型) [12:10] 🤖 TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning(TidyBot++:用于机器人学习的开源全向移动机械手) [13:01] 🔒 Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning(只需简单变换即可实现纵向联邦学习中的数据保护) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:23] 🎥 Apollo: An Exploration of Video Understanding in Large Multimodal Models(阿波罗:大型多模态模型中的视频理解探索) [01:11] 🌍 GenEx: Generating an Explorable World(GenEx:生成可探索的世界) [01:50] 🌐 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding(协同生成-VL:基于视觉专家和令牌折叠的图像理解与生成) [02:37] 🩺 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities(BiMediX2:多模态生物医学专家大模型) [03:21] 🤖 Large Action Models: From Inception to Implementation(大规模动作模型:从构想到实现) [04:09] 🎥 InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption(实例感知结构化字幕:通过实例感知结构化字幕提升文本到视频生成) [04:56] 🌟 FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion(FreeScale:通过无调谐尺度融合释放扩散模型的分辨率) [05:42] 🎯 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation(ObjectMate:面向对象插入与主体驱动生成任务的循环先验方法) [06:21] 🔥 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing(FireFlow:图像语义编辑的快速校正流反演) [07:09] 🎵 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation(基于显式桥梁和检索增强的多模态音乐生成) [07:56] 🎨 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers(FluxSpace:在修正流变换器中解耦语义编辑) [08:44] 📊 SCBench: A KV Cache-Centric Analysis of Long-Context Methods(SCBench:以KV缓存为中心的长上下文方法分析) [09:27] 🧠 SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs(SmolTulu:更高的学习率与批量大小的比率可以提升SLMs的推理能力) [10:05] 🩺 Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images(Prompt2Perturb (P2P): 基于文本引导扩散的乳腺超声图像对抗攻击) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 5 篇论文如下: [00:43] TOP1(🔥95) | 🌐 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling(扩展开源多模态模型性能边界:模型、数据与测试时扩展) [03:01] TOP2(🔥65) | 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [05:09] TOP3(🔥64) | 🧠 Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation(揭开强化学习代理中记忆复杂性的分类与评估方法) [07:29] TOP4(🔥61) | 🎥 STIV: Scalable Text and Image Conditioned Video Generation(STIV:可扩展的文本与图像条件视频生成) [09:46] TOP5(🔥53) | 🧮 ProcessBench: Identifying Process Errors in Mathematical Reasoning(ProcessBench:识别数学推理中的过程错误) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 23 篇论文如下: [00:23] 🎥 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions(InternLM-XComposer2.5-OmniLive:一个用于长期流式视频和音频交互的综合多模态系统) [01:03] 🧠 Phi-4 Technical Report(Phi-4 技术报告) [01:43] 🧠 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions(欧几里得:通过合成高保真视觉描述提升多模态大语言模型) [02:27] 🌐 Multimodal Latent Language Modeling with Next-Token Diffusion(多模态潜在语言建模与下一词扩散) [03:10] 🌐 EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(EasyRef:基于多模态大语言模型的扩散模型通用化图像参考) [03:57] 🌐 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials(AgentTrek:通过网络教程引导回放的代理轨迹合成) [04:43] 🌟 Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion(神经光装置:利用多光源扩散解锁精确物体法线和材质估计) [05:24] 📱 SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(SnapGen:通过高效架构和训练驯服高分辨率文本到图像模型以适应移动设备) [06:02] 🔬 PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations(PIG:物理信息高斯函数作为自适应参数化网格表示) [06:49] 📊 Learned Compression for Compressed Learning(压缩学习中的学习压缩) [07:32] 🎙 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition(Lyra:一个高效且以语音为中心的全认知框架) [08:20] 📊 RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios(RuleArena:在现实场景中评估LLMs规则引导推理能力的基准) [09:08] 👀 Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders(Gaze-LLE:通过大规模学习编码器进行注视目标估计) [10:02] 🧠 JuStRank: Benchmarking LLM Judges for System Ranking(JuStRank:基准测试用于系统排名的LLM评判器) [10:43] 🧠 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation(OLA-VLM:通过辅助嵌入蒸馏提升多模态大语言模型的视觉感知能力) [11:34] 📚 The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective(版权材料对大型语言模型的影响:挪威视角) [12:16] 🔗 Word Sense Linking: Disambiguating Outside the Sandbox(词义链接:超越沙盒的消歧) [12:58] 🌐 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction(FreeSplatter:无姿态高斯喷射用于稀疏视图三维重建) [13:42] 🎥 DisPose: Disentangling Pose Guidance for Controllable Human Image Animation(DisPose:解耦姿态引导的可控人体图像动画) [14:26] 🖼 LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(LoRACLR:对比适应用于扩散模型的定制化) [15:21] 🧭 SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts(SAME:学习基于状态自适应混合专家的通用语言引导视觉导航) [16:05] 🌟 Arbitrary-steps Image Super-resolution via Diffusion Inversion(基于扩散反演的任意步图像超分辨率) [16:46] 📚 Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages(Shiksha:面向印度语言的技术领域翻译数据集与模型) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
本期的 14 篇论文如下: [00:23] 🎥 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints(SynCamMaster:同步多视角视频生成) [01:07] 🌐 LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations(LAION-SG:用于训练复杂图像-文本模型的增强型大规模数据集与结构化注释) [01:51] 🌐 POINTS1.5: Building a Vision-Language Model towards Real World Applications(POINTS1.5:构建面向实际应用的视觉语言模型) [02:28] 🎨 Learning Flow Fields in Attention for Controllable Person Image Generation(在注意力中学习流场用于可控人物图像生成) [03:11] 🎥 StyleMaster: Stylize Your Video with Artistic Generation and Translation(风格大师:艺术生成与转换的视频风格化) [04:00] 🔍 Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction(生成密集化:学习在高保真泛化三维重建中密集化高斯分布) [04:46] 🎥 StreamChat: Chatting with Streaming Video(流媒体聊天:与流媒体视频互动) [05:28] 🧠 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark(3DSRBench:一个综合的3D空间推理基准) [06:12] 🏃 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation(Mogo:用于高质量3D人体运动生成的RQ分层因果Transformer) [07:01] 🧠 KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models(KaSA:知识感知奇异值适应大型语言模型) [07:40] 🖼 FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models(FlowEdit:基于预训练流模型的无逆向文本编辑) [08:17] 🎨 StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements(StyleStudio:基于文本的风格迁移与风格元素选择性控制) [09:03] 🌍 MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation(MIT-10M:大规模多语言图像翻译并行语料库) [09:50] 🚀 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel(自引导数据飞轮的语言引导导航学习) 【关注我们】 您还可以在以下平台找到我们,获得播客内容以外更多信息 小红书: AI速递
与播客爱好者一起交流
添加微信好友,获取更多播客资讯
播放列表还是空的
去找些喜欢的节目添加进来吧