Agent Program: Concepts & Practices
2025-03-24
This seminar covers the key concepts and practices of Agent Programs, which leverage Large Language Models (LLMs) as a core component. We will contrast applications with static workflows (e.g., RAG) against agentic programs with dynamic, runtime decision-making (e.g., ReAct). The discussion will feature the DSPy development framework, the MIPRO technique for automatic prompt optimization, and key evaluation benchmarks like WebVoyager. Finally, we will explore system-level optimizations (e.g., Parrot, Autellix) designed to improve the performance and scheduling for both static and dynamic agentic systems.
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
2025-03-10
Processing long contexts has become a critical capability for modern large language models (LLMs). However, serving long-context LLMs comes with significant inference costs due to the high memory overhead of the key-value (KV) cache. Existing work leverages dynamic sparse attention algorithms (DSAes) to mitigate the KV cache overhead, but these algorithms rely on top-k KV cache selection, which results in a trade-off between accuracy and efficiency. A larger k improves accuracy but decreases efficiency, while a smaller k boosts efficiency but compromises accuracy. To overcome this trade-off, this paper presents PSA, a Progressive Sparse Attention mechanism that integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in LLM serving. The PSA algorithm adaptively adjusts the KV cache budget of different tokens and layers according to their real attention weight distributions, rather than relying on a fixed budget k. This enables high accuracy while minimizing KV cache usage. To further enhance execution efficiency, we introduce a pipelined iteration scheme that reduces CPU-GPU interleaving and synchronization overhead during PSA computation. Additionally, we implement unified GPU memory management that optimizes PSA's memory utilization by accounting for uneven memory requirements across different model layers. Extensive experimental results demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4× and 8.8×, and increases end-to-end serving throughput by up to 1.4× and 2.0×, compared to state-of-the-art DSAes and systems without sparse attention, respectively.
PALF: Replicated Write-Ahead Logging for Distributed Databases
2025-02-19
近年来,分布式数据库系统因其可扩展性、高可用性和-致性保障而得到了广泛的研究和发展。预写日志系统(WAL)是数据库中最关键的组件之一。设计一个复制的日志系统作为支持ACID 事务的分布式数据库的基础,仍然是一个具有挑战性的问题。我们提出了 PALF(一种基于 Paxos 的仅追加日志文件系统)来解决这些挑战。PALF的核心思想是将日志系统与整个数据库协同设计,以支持数据库特定的功能,并将这些功能抽象为 PALF 原语,从而为其他分布式系统提供支持。许多数据库功能,包括事务处理、数据库恢复和物理备用数据库,都是基于 PALF 原语构建的。实验表明,PALF在性能上显著优于知名的共识协议实现,并且完全能够胜任分布式数据库的工作负载。PALF已作为OceanBase 4.0数据库系统的组件部署,并与其一起开源。