The Database Research Group is founded in 2017 and affiliated to Department of Computer Science and Engineering, Southern University of Science and Technology. We conduct in-depth, interesting and insightful researches on data science and engineering, which cover the following aspects:
(1) System: Architect novel data-intensive systems from 0 to 1 (lead by Dr. Bo Tang); (2) Algorithm: Support and accelerate advanced data analytics (lead by Dr. Bo Tang); (3) VIS: Enable intuitive visual understanding of data and systems (lead by Dr. Qiaomu Shen); (4) AI: Improve data analytics and system performance with deep learning (lead by Dr. Dan Zeng).
System
Algorithm
VIS
AI
For more details about our researches, please refer to our publications. We always welcome brilliant people to join DBGroup.
2025-04
Our work "ParaGraph: Accelerating Cross-Modal ANNS Indexing via GPU-CPU Parallelism" got accepted to DaMoN 2025. Congratulations to Yuxiang and Bo, special thanks to AlayaDB Inc. for support!
2025-03
Our work "OptMatch: an Efficient and GenericNeural Network-assisted Subgraph Matching Approch" got accepted to ICDE 2025. Congratulations to Wenzhe and Bo
2025-03
We open-sourced the lightweight vector database "AlayaLite", open source address: https://github.com/AlayaDB-AI/AlayaLite. Congratulations to the core development team!
2025-03
Our work "VQLens: A Demonstration of Vector Query Execution Analysis" got accepted to SIGMOD 2025 demo track. Congratulations to Yansha and Bo.
2025-02
Our work "AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference" got accepted to SIGMOD 2025 Industry track. Congratulations to Yangshen and Bo!
Date
Title
Speaker
2025-03-24
Agent Program: Concepts & Practices
Renjie Liu, CS master student at DBGroup in Southern University of Science and Technology (SUSTech)
2025-03-10
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
Peiqi Yin, Ph.D. candidate at The Chinese University of Hong Kong (CUHK)
2025-02-19
PALF: Replicated Write-Ahead Logging for Distributed Databases
徐泉清,北京大学计算机系博士毕业、正高级工程师、蚂蚁技术研究院数据库实验室研究员
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Proceedings of the ACM Conference on Management of Data (SIGMOD, CCF-A), 2025
AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system. For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.
Tao: Improving Resource Utilization while Guaranteeing SLO in Multi-tenant Relational Database-as-a-Service
Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025
It is an open challenge for cloud database service providers to guarantee tenants' service-level objectives (SLOs) and enjoy high resource utilization simultaneously. In this work, we propose a novel system Tao to overcome it. Tao consists of three key components: (i) tasklet-based DAG generator, (ii) tasklet-based DAG executor, and (iii) SLO-guaranteed scheduler. The core concept in Tao is tasklet, a coroutine-based lightweight execution unit of the physical execution plan. In particular, we first convert each SQL operator in the traditional physical execution plan into a set of fine-grained tasklets by the tasklet-based DAG generator. Then, we abstract the tasklet-based DAG execution procedure and implement the tasklet-based DAG executor using C++20 coroutines. Finally, we introduce the SLO-guaranteed scheduler for scheduling tenants' tasklets across CPU cores. This scheduler guarantees tenants' SLOs with a token bucket model and improves resource utilization with an on-demand core adjustment strategy. We build Tao on an open-sourced relational database, Hyrise, and conduct extensive experimental studies to demonstrate its superiority over existing solutions.
DiskGNN: Bridging I/O Efficiency and Model Accuracy for Out-of-Core GNN Training
Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025
Graph neural networks (GNNs) are models specialized for graph data and widely used in applications. To train GNNs on large graphs that exceed CPU memory, several systems have been designed to store data on disk and conduct out-of-core processing. However, these systems suffer from either read amplification when conducting random reads for node features that are smaller than a disk page, or degraded model accuracy by treating the graph as disconnected partitions. To close this gap, we build DiskGNN for high I/O efficiency and fast training without model accuracy degradation. The key technique is offline sampling, which decouples graph sampling from model computation. In particular, by conducting graph sampling beforehand for multiple mini-batches, DiskGNN acquires the node features that will be accessed during model computation and conducts pre-processing to pack the node features of each mini-batch contiguously on disk to avoid read amplification for computation. Given the feature access information acquired by offline sampling, DiskGNN also adopts designs including four-level feature store to fully utilize the memory hierarchy of GPU and CPU to cache hot node features and reduce disk access, batched packing to accelerate feature packing during pre-processing, and pipelined training to overlap disk access with other operations. We compare DiskGNN with state-of-the-art out-of-core GNN training systems. The results show that DiskGNN has more than 8× speedup over existing systems while matching their best model accuracy. DiskGNN is open-source at https://github.com/Liu-rj/DiskGNN.
Ph.D. Candidate
Class of 2019
Ph.D. Candidate
Class of 2020
Ph.D. Candidate
Class of 2020
Ph.D. Candidate
Class of 2021
Ph.D. Candidate
Class of 2021
Ph.D. Candidate
Class of 2021
Ph.D. Candidate
Class of 2021
Ph.D. Candidate
Class of 2022
Ph.D. Candidate
Class of 2022
MPhil Candidate
Class of 2019
MPhil Candidate
Class of 2020
MPhil Candidate
Class of 2020
MPhil Candidate
Class of 2021
MPhil Candidate
Class of 2021
MPhil Candidate
Class of 2021
MPhil Candidate
Class of 2022
MPhil Candidate
Class of 2022
MPhil Candidate
Class of 2022
We always welcome brilliant people to join our group. If you want to join DBGroup, please fill this form and drop an email to us as soon as possible.
dbgroup_AT_sustech_DOT_edu_DOT_cn
DBGroup, South Tower, CoE Building
Southern University of Science and Technology
Shenzhen, China