主播
节目简介
来源:小宇宙
【目录】
本期的 15 篇论文如下:
[00:24] 🌍 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation(世界-R1:通过强化学习为文本到视频生成注入3D约束)
[01:29] 🏢 From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company(从技能到人才:将异构智能体组织为现实世界公司)
[02:26] 🧠 ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning(重建视觉空间智能评估:精准评估VLM三维推理能力)
[03:23] 🛡 Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms(视觉-语言-动作安全:威胁、挑战、评估与机制)
[04:12] 🖼 Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation(Tuna-2:像素嵌入在多模态理解与生成中击败视觉编码器)
[05:02] 🤖 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents(ClawMark:面向多轮、多日、多模态协作者智能体的现实世界基准测试)
[06:20] ✍ SketchVLM: Vision language models can annotate images to explain thoughts and guide users(SketchVLM:视觉语言模型可以通过图像标注来解释思维并引导用户)
[07:17] 🔬 Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis(奖励科学过程:面向智能体数据分析的过程级奖励建模)
[08:24] ⚖ Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment(通过辩证对齐驯服智能体中的行动者-观察者不对称性)
[09:20] 🔀 Efficient Agent Evaluation via Diversity-Guided User Simulation(通过多样性引导的用户模拟实现高效智能体评估)
[10:02] ⚡ For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs(For-Value:面向微调大语言模型和视觉语言模型的高效前向数据估值方法)
[11:04] 🎬 OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer(全镜头剪切:基于镜头查询Transformer的整体关系型镜头边界检测)
[12:03] 📷 UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models(UniGeo:通过视频模型实现相机可控图像编辑的统一几何引导)
[12:49] 📄 TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction(TexOCR:面向可编译页面到LaTeX重建的文档OCR模型进展)
[13:56] 🔄 How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models(一次循环值多少?循环语言模型的等深度缩放定律)
【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递
本期的 15 篇论文如下:
[00:24] 🌍 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation(世界-R1:通过强化学习为文本到视频生成注入3D约束)
[01:29] 🏢 From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company(从技能到人才:将异构智能体组织为现实世界公司)
[02:26] 🧠 ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning(重建视觉空间智能评估:精准评估VLM三维推理能力)
[03:23] 🛡 Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms(视觉-语言-动作安全:威胁、挑战、评估与机制)
[04:12] 🖼 Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation(Tuna-2:像素嵌入在多模态理解与生成中击败视觉编码器)
[05:02] 🤖 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents(ClawMark:面向多轮、多日、多模态协作者智能体的现实世界基准测试)
[06:20] ✍ SketchVLM: Vision language models can annotate images to explain thoughts and guide users(SketchVLM:视觉语言模型可以通过图像标注来解释思维并引导用户)
[07:17] 🔬 Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis(奖励科学过程:面向智能体数据分析的过程级奖励建模)
[08:24] ⚖ Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment(通过辩证对齐驯服智能体中的行动者-观察者不对称性)
[09:20] 🔀 Efficient Agent Evaluation via Diversity-Guided User Simulation(通过多样性引导的用户模拟实现高效智能体评估)
[10:02] ⚡ For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs(For-Value:面向微调大语言模型和视觉语言模型的高效前向数据估值方法)
[11:04] 🎬 OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer(全镜头剪切:基于镜头查询Transformer的整体关系型镜头边界检测)
[12:03] 📷 UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models(UniGeo:通过视频模型实现相机可控图像编辑的统一几何引导)
[12:49] 📄 TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction(TexOCR:面向可编译页面到LaTeX重建的文档OCR模型进展)
[13:56] 🔄 How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models(一次循环值多少?循环语言模型的等深度缩放定律)
【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递