Publications

Author: Category:

2025

AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Yangshen Deng*, Zhengxin You*, Long Xiang*, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, Bo Tang

AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system. For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.

ParaGraph: Accelerating Graph Indexing through GPU-CPU Parallel Processing for Efficient Cross-modal ANNS

Data Management on New Hardware @ SIGMOD (DaMoN, CCF-A), 2025

Authors: Yuxiang Yang, Shiwen Chen, Yangshen Deng, Bo Tang

Cross-modal Approximate Nearest Neighbor Search (ANNS) is crucial for a growing number of applications, including search enginesand recommendation systems. However, existing vector search indexes often struggle with either poor search efficiency or slow index construction when handling cross-modal ANNS queries. To overcome these limitations, we introduce ParaGraph, a CPU-GPU co-processing system that enhances both search efficiency and construction speed for cross-modal ANNS. ParaGraph employs novel multi-round top-m projection and batched search-and-refine techniques for index construction. Additionally, it leverages modern heterogeneous hardware architectures by distributing computationally distinct tasks across the GPU and CPU, augmented with in-depth optimizations to maximize parallelism and performance. Compared to the state-of-the-art cross-modal ANNS index, ParaGraph achieves a 4.1× to 4.9× speedup in index construction and a 50% reduction in index size, while maintaining search efficiency.

Tao: Improving Resource Utilization while Guaranteeing SLO in Multi-tenant Relational Database-as-a-Service

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Haotian Liu, Runzhong Li, Ziyang Zhang, Bo Tang

It is an open challenge for cloud database service providers to guarantee tenants' service-level objectives (SLOs) and enjoy high resource utilization simultaneously. In this work, we propose a novel system Tao to overcome it. Tao consists of three key components: (i) tasklet-based DAG generator, (ii) tasklet-based DAG executor, and (iii) SLO-guaranteed scheduler. The core concept in Tao is tasklet, a coroutine-based lightweight execution unit of the physical execution plan. In particular, we first convert each SQL operator in the traditional physical execution plan into a set of fine-grained tasklets by the tasklet-based DAG generator. Then, we abstract the tasklet-based DAG execution procedure and implement the tasklet-based DAG executor using C++20 coroutines. Finally, we introduce the SLO-guaranteed scheduler for scheduling tenants' tasklets across CPU cores. This scheduler guarantees tenants' SLOs with a token bucket model and improves resource utilization with an on-demand core adjustment strategy. We build Tao on an open-sourced relational database, Hyrise, and conduct extensive experimental studies to demonstrate its superiority over existing solutions.

Athena: An Effective Learning-based Framework for Query Optimizer Performance Improvement

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Runzhong Li, Qilong Li, Haotian Liu, Rui Mao, Qing Li, Bo Tang

Recent studies have made it possible to integrate learning techniques into database systems for practical utilization. In particular, the state-of-the-art studies hook the conventional query optimizer to explore multiple execution plan candidates, then choose the optimal one with a learned model. This framework simplifies the integration of learning techniques into the database system. However, these methods still have room for improvement due to their limited plan exploration space and ineffective learning from execution plans. In this work, we propose Athena, an effective learning-based framework of query optimizer enhancer. It consists of three key components: (i) an order-centric plan explorer, (ii) a Tree-Mamba plan comparator and (iii) a time-weighted loss function. We implement Athena on top of the open-source database PostgreSQL and demonstrate its superiority via extensive experiments. Specifically, We achieve 1.75x, 1.95x, 5.69x, and 2.74x speedups over the vanilla PostgreSQL on the JOB, STATS-CEB, TPC-DS, and DSB benchmarks, respectively. Athena is 1.74x, 1.87x, 1.66x, and 2.28x faster than the state-of-the-art competitor Lero on these benchmarks. Additionally, Athena is open-sourced and it can be easily adapted to other relational database systems as all these proposed techniques in Athena are generic.

GPH: An Efficient and Effective Perfect Hashing Scheme for GPU Architecture

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Jiaping Cao, Le Xu, Man Lung Yiu, Jianbin Qin, Bo Tang

Hash tables are widely used to support fast lookup operations for various applications on key-value stores and relational databases. In recent years, hash tables have been significantly improved by utilizing the high memory bandwidth and large parallelism degree offered by Graphics Processing Units (GPUs). However, there is still a lack of comprehensive analysis of the lookup performance on existing GPU-based hash tables. In this work, we develop a micro-benchmark and devise an effective and general performance analysis model, which enables uniform and accurate lookup performance evaluation of GPU-based hash tables. Moreover, we propose GPH, a novel GPU-based hash table, to improve lookup performance with the guidance of the benchmark results from the analysis model devised above. In particular, GPH employs the perfect hashing scheme that ensures exactly 1 bucket probe for every lookup operation. Besides, we optimize the bucket requests to global memory in GPH by devising vectorization and instruction-level parallelism techniques. We also introduce the insert kernel in GPH to support dynamic updates (e.g., processing insert operations) on GPU. Experimentally, GPH achieves over 8500 million operations per second (MOPS) for lookup operation processing in both synthetic and real-world workloads, which outperforms all evaluated GPU-based hash tables.

Nezha: An Efficient Distributed Graph Processing System on Heterogeneous Hardware

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Pengjie Cui, Haotian Liu, Dong Jiang, Bo Tang, Ye Yuan

The growing scale of graph data across various applications demands efficient distributed graph processing systems. Despite the widespread use of the Scatter-Gather model for large-scale graph processing across distributed machines, the performance still can be significantly improved as the computation ability of each machine is not fully utilized and the communication costs during graph processing are expensive in the distributed environment. In this work, we propose a novel and efficient distributed graph processing system Nezha on heterogeneous hardware, where each machine is equipped with both CPU and GPU processors and all these machines in the distributed cluster are interconnected via Remote Direct Memory Access (RDMA). To reduce the communication costs, we devise an effective communication mode with a graphfriendly communication protocol in the graph-based RDMA communication adapter of Nezha. To improve the computation efficiency, we propose a multi-device cooperative execution mechanism in Nezha, which fully utilizes the CPU and GPU processors of each machine in the distributed cluster. We also alleviate the workload imbalance issue at inter-machine and intra-machine levels via the proposed workload balancer in Nezha. We conduct extensive experiments by running 4 widely-used graph algorithms on 5 graph datasets to demonstrate the superiority of Nezha over existing systems.

DiskGNN: Bridging I/O Efficiency and Model Accuracy for Out-of-Core GNN Training

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2025

Authors: Renjie Liu, Yichuan Wang, Xiao Yan, Zhenkun Cai, Minjie Wang, Haitian Jiang, Bo Tang, Jinyang Li.

Graph neural networks (GNNs) are models specialized for graph data and widely used in applications. To train GNNs on large graphs that exceed CPU memory, several systems have been designed to store data on disk and conduct out-of-core processing. However, these systems suffer from either read amplification when conducting random reads for node features that are smaller than a disk page, or degraded model accuracy by treating the graph as disconnected partitions. To close this gap, we build DiskGNN for high I/O efficiency and fast training without model accuracy degradation. The key technique is offline sampling, which decouples graph sampling from model computation. In particular, by conducting graph sampling beforehand for multiple mini-batches, DiskGNN acquires the node features that will be accessed during model computation and conducts pre-processing to pack the node features of each mini-batch contiguously on disk to avoid read amplification for computation. Given the feature access information acquired by offline sampling, DiskGNN also adopts designs including four-level feature store to fully utilize the memory hierarchy of GPU and CPU to cache hot node features and reduce disk access, batched packing to accelerate feature packing during pre-processing, and pipelined training to overlap disk access with other operations. We compare DiskGNN with state-of-the-art out-of-core GNN training systems. The results show that DiskGNN has more than 8× speedup over existing systems while matching their best model accuracy. DiskGNN is open-source at https://github.com/Liu-rj/DiskGNN.

QOVIS: Understanding and Diagnosing Query Optimizer via a Visualization-assisted Approach

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2025

Authors: Zhengxin You, Qiaomu Shen, Man Lung Yiu, Bo Tang.

Understanding and diagnosing query optimizers is crucial to guar antee the correctness and efficiency of query processing in database systems. However, achieving this is non-trivial as there are three technical challenges: (i) hundreds and thousands of query plans are generated for each query during the query optimization proce dure; (ii) the transformation logic among query plans is not easy to investigate even for expert database system developers; and (iii) navigating users to the root causes of the bugs/errors is in herently hard as the changes of the operators among query plans are missing in the query processing log. In this work, we propose QOVIS to overcome these challenges, which identifies the query optimization bugs/issues and investigates their root causes via a visualization-assisted approach. Specifically, QOVIS consists of data preprocessing layer, transformation logic computation layer, and visual analysis layer. We conduct extensive experimental studies (e.g., user study, case study, and performance study) to evaluate the efficiency and effectiveness of QOVIS. In particular, our user study (on 24 database developers and researchers) confirms that QOVIS significantly reduces the time required to investigate the bugs/er rors in the query optimizer. Moreover, the generality of QOVIS is verified by utilizing it to understand and diagnose the real-world reported bugs/errors in different query optimizers of three widely used systems: Apache Spark, Apache Hive, and DuckDB

Efficient and Effective Algorithms for A Family of Influence Maximization Problems with A Matroid Constraint

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2025

Authors: Yiqian Huang, Shiqi Zhang, Laks Lakshmanan, Wenqing Lin, Xiaokui Xiao, Bo Tang

Influence maximization (IM) is a classic problem that aims to identify a small group of critical individuals, known as seeds, who can influence the largest number of users in a social network through word-of-mouth. This problem finds important applications including viral marketing, infection detection, and misinformation containment. The conventional IM problem is typically studied with the oversimplified goal of selecting a single seed set. Many realworld scenarios call for multiple sets of seeds, particularly on social media platforms where various viral marketing campaigns need different sets of seeds to propagate effectively. To this end, previous works have formulated various IM variants, central to which is the requirement of multiple seed sets, naturally modeled as a matroid constraint. However, the current best-known solutions for these variants either offer a weak (1/2 − 𝜖)-approximation, or offer a (1 − 1/𝑒 − 𝜖)-approximation algorithm that is very expensive. We propose an efficient seed selection method called AMP, an algorithm with a (1 − 1/𝑒 − 𝜖)-approximation guarantee for this family of IM variants. To further improve efficiency, we also devise a fast implementation, called RAMP. We extensively evaluate the performance of our proposal against 6 competitors across 4 IM variants and on 7 real-world networks, demonstrating that our proposal outperforms all competitors in terms of result quality, running time, and memory usage. We have also deployed RAMP in a real industry strength application involving online gaming, where we show that our deployed solution significantly improves upon the baselines

2024

nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic Systems (vision)

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2024

Authors: Ye Yuan, Bo Tang, Tianfei Zhou, Zhiwei Zhang, Jianbin Qin

In this paper, we propose nsDB, a novel neuro-symbolic database system that integrates neural and symbolic system architectures natively to address the weaknesses of each, providing a strong database capable of data managing, model learning, and complex analytical query processing over multi-modal data. We employ a real-world NBA data analytical query as an example to illustrate the functionality of each component in nsDB and highlight the research challenges to build it. We then present the key design principles and our preliminary attempts to address them. In a nutshell, we envision that the next generation databasesystem nsDB integrates the complex neural system with the simplesymbolic system. Undoubtedly, nsDB will serve as a bridge betweendatabases with Al models, which abstracts away the Al complexitiesbut allows end users to enjoy the strong capabilities of them. Weare in the early stages of the journey to build nsDB, there are manyopening challenges, e.g., in-database model training, multi-objectivequery optimization, and database agent development. We hope theresearchers from different communities (e.g., system, architecture!database, artifcial intelligence) could tackle them together.

Accelerating Merkle Patricia Trie with GPU

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2024

Authors: Yangshen Deng, Muxi Yan, Bo Tang

Merkle Patricia Trie (MPT) is a type of trie structure that offers efficient lookup and insert operators for immutable data systems that require multi-version access and tamper-evident controls, such as blockchains and verifiable databases. The performance of these systems is critically dependent on the throughput of the underlying index structure MPT. In this paper, we present a novel approach to accelerate MPT by leveraging the massive parallelism of GPU. However, achieving it is challenging as (i) lock-free data structures are difficult to implement and (ii) traditional fine-grained locking does not scale on GPU. To address them, we first analyze the technical challenges of accelerating MPT via GPU, including node splitting conflicts and hash computing conflicts caused by parallel insert operations. We then propose a lock-free algorithm PhaseNU and a lock-based al gorithm LockNU on GPU to resolve the node splitting conflict. We also devise a decision model for users to choose the proper one for different workloads. We next propose a GPU-based hash compute algorithm PhaseHC to avoid hash computing conflicts. Last, we demonstrate the effectiveness of our proposed techniques by: (i) integrating them into both the real-world blockchain sys tem Geth and verifiable database LedgerDB, and demonstrating its superiority with corresponding workloads; and (ii) conducting extensive experimental studies on two real-world datasets and one synthetic dataset. Our proposed solutions significantly outperform the deployed MPT solution in Geth in all datasets.

CGgraph: An Ultra-fast Graph Processing System on Modern Commodity CPU-GPU Co-processor

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2024

Authors: Pengjie Cui, Haotian Liu, Bo Tang, Ye Yuan

In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the performance of existing systems can be significantly improved by addressing two prevailing challenges: GPU memory over-subscription and efficient CPU-GPU cooperative processing. In this work, we propose CGgraph, an ultra-fast CPU-GPU graph processing system to address these challenges. In particular, CGgraph overcomes GPU-memory over-subscription by extracting a subgraph which only needs to be loaded into GPU memory once, but its vertices and edges can be used in multiple iterations during the graph processing procedure. To support efficient CPU-GPU co-processing, we design a CPU-GPU cooperative processing scheme, which balances the workloads between CPU and GPU by on-demand task allocation. To evaluate the efficiency of CG-graph, we conduct extensive experiments, comparing it with 7 state-of-the-art systems using 4 well-known graph algorithms on 6 real-world graphs. Our prototype system CGgraph outperforms all existing systems, delivering up to an order of magnitude improvement. Moreover, CGgraph on a modern commodity machine with a CPU-GPU co-processor yields superior (or at the very least, comparable) performance compared to existing systems on a high-end CPU-GPU server.

CoroGraph: Bridging Cache Efficiency and Work Efficiency for Graph Algorithm Execution

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2024

Authors: Xiangyu Zhi, Xiao Yan, Bo Tang, Ziyao Yin, Y. Zhu, M. Zhou

Many systems are designed to run graph algorithms efficiently in memory but they achieve only cache efficiency or work efficiency. We tackle this fundamental trade-off in existing systems by designing CoroGraph, a system that attains both cache efficiency and work efficiency for in-memory graph processing. CoroGraph adopts a novel hybrid execution model, which generates update messages at vertex granularity to prioritize promising vertices for work efficiency, and commits updates at partition granularity to share data access for cache efficiency. To overlap the random memory access of graph algorithms with computation, CoroGraph extensively uses coroutine, i.e., a lightweight function in C++ that can yield and resume with low overhead, to prefetch the required data. A suite of designs are incorporated to reap the full benefits of coroutine, which include prefetch pipeline, cache-friendly graph format, and stop-free synchronization. We compare CoroGraph with five state-of-the-art graph algorithm systems via extensive experiments. The results show that CoroGraph yields shorter algorithm execution time than all baselines in 18 out of 20 cases, and its speedup over the best-performing baseline can be over 2x. Detailed profiling suggests that CoroGraph achieves both cache efficiency and work efficiency with a low memory stall and a small number of processed edges.

Marrying Top-k with Skyline Queries: Operators with Relaxed Preference Input and Controllable Output Size

ACM Transactions on Database Systems (TODS, CCF-A), 2024

Authors: Kyriakos Mouratidis, Keming Li, Bo Tang

The two paradigms to identify records of preference in a multi-objective setting rely either on dominance (e.g., the skyline operator) or on a utility function defined over the records’ attributes (typically using a top-k query). Despite their proliferation, each has its own palpable drawbacks. Motivated by these drawbacks, we identify three hard requirements for practical decision support, namely, personalization, controllable output size, and flexibility in preference specification. With these requirements as a guide, we combine elements from both paradigms and propose two new operators, ORD and ORU. We present a suite of algorithms for their efficient processing, dedicating more technical effort to ORU, whose nature is inherently more challenging. Specifically, besides a sophisticated algorithm for ORD, we describe two exact methods for ORU and one approximate. We perform a qualitative study to demonstrate how our operators work and evaluate the performance of our algorithms against adaptations of previous work that mimic their output.

How Does Software Prefetching Work on GPU Query Processing?

Proceedings of the 20th International Workshop on Data Management on New Hardware (DaMoN, CCF-A), 2024

Authors: Yangshen Deng, Shiwen Chen, Zhaoyang Hong, Bo Tang

Improving the performance of GPU query processing is a well-studied problem in database community. However, its performance is still unsatisfactory due to the low utilization of GPU memory bandwidth. In the literature, employing software prefetching techniques to improve the bandwidth utilization is a common practice in CPU database as it overlaps computation cost and memory access latency. However, it was ignored by GPU database even though the software prefetching ability has been provided by modern GPU architecture (i.e., from NVIDIA Ampere). In order to investigate the effectiveness of software prefetching techniques on GPU query processing, we implement four software prefetching algorithms on GPU, i.e., Group Prefetch (GP), Software-Pipelined Prefetch (SPP), Asynchronous Memory Access Chaining (AMAC) and Interleaved Multi-Vectorizing (IMV) in the work. We then adapt them on hash join probe and BTree search tasks with a suite of optimizations. Last, we conduct comprehensive experiments and evaluate the performance of them. The results confirm the superiority of software prefetching techniques on GPU query processing. Specifically, they can achieve up to 1.19X speedup on hash join probe and 1.31X speedup on BTree search when compared with the implementations without software prefetching.

LearnSC: An Efficient and Unified Learning-based Framework for Subgraph Counting Problem

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2024

Authors: Wenzhe Hou, Xiang Zhao, Bo Tang

Graphs are valuable data structures used to represent complex relationships between entities in a wide range of applications, such as social networks and chemical reactions. Subgraph counting problem is a well-known hard problem, as its core subroutine, the subgraph matching, is NP-complete. In this work, we propose an efficient and unified deep learning-based solution framework LearnSC, which solves the subgraph counting problem approximately. This framework offers two key advantages: (i) it is a generic solution that is orthogonal to the existing techniques of learning-based solutions; and (ii) it is equipped with a suite of optimizations to significantly improve the accuracy of the estimated results. Our experimental results on 7 datasets demonstrate that our proposal is highly accurate, robust, and scalable, making it an excellent solution for subgraph counting problem among all statistics-based and learning-based competitors.

QSRP: Efficient Reverse k -Ranks Query Processing on High-dimensional Embeddings

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2024

Authors: Bian Zheng, Xiao Yan, Jiahao Zhang, Man Lung Yiu, Bo Tang

Reverse k nearest neighbor search (RkNNS) plays an important role in various data processing and analysis tasks, seeking to pinpoint data considering the query data q among their k nearest neighbors. As large models gain popularity, processing high-dimensional vectors has become more and more widespread. However, existing RkNNS solutions face inefficiency when handling large-scale high-dimensional vectors due to their sensitivity to data dimensions and sizes during index construction or the verification of numerous candidate results in the query phase. Motivated by these challenges and the inherent intricacies of high-dimensional data processing, in this paper, we study an approximate version of the RkNNS problem (RkANNS) for high-dimensional vectors, aiming to offer efficient and practical solutions. To this end, we propose a new proximity-graph-based index called HAMG, which enables finding the query results within k hops from q. We also present a user-friendly query algorithm on HAMG that can adaptively adjust the search scope based on the desired query recall of users. To further enhance the query process, two pruning strategies are proposed to reduce the number of candidates requiring verification. Extensive experiments validate that HAMG scales well for data dimensions and sizes, and our query algorithm improves query efficiency by up to two orders of magnitude while maintaining comparable query accuracy against existing approaches.

Fair Top-k Query on Alpha-Fairness

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2024

Authors: Hao Liu, Zheng Zhang, Raymond Chi-Wing Wong, Min Xie, Bo Tang

Thetraditional top-kquerywasproposedtoobtain asmallsubsetfromthedatabaseaccordingtotheuserpreference, which is explicitly expressedas a ranking scheme (i.e., utility function).However,apoorly-designedutilityfunctionmaycreate discrimination, which in turnmay cause harm tominority groups, e.g.,womenandethnicminorities, andthus, fairness is becomingincreasinglyimportant inmanysituations, e.g.,hiring andadmissiondecisions.Motivatedbythis,westudyfairranking toalleviatediscrimination.Wedesignafairnessmodel,called fairness, toquantifythefairnessofutilityfunctions.Wepropose anefficient exact frameworkwithabasic implementationand an improved implementationtofindthe fairestutility function withtheminimummodificationpenalty.Weconductedextensive experimentsonbothrealandsyntheticdatasets todemonstrate oureffectivenessandefficiencycomparedwiththepriorstudies.

Information Diffusion Meets Invitation Mechanism

International World Wide Web Conference (WWW, CCF-A), 2024

Authors: Shiqi Zhang, Jiachen Sun, Wenqing Lin, Xiaokui Xiao, Yiqian Huang, Bo Tang

The dissemination of information is a complex process that plays a crucial role in real-world applications, especially when intertwined with friend invitations and their ensuing responses. Traditional diffusion models, however, often do not adequately capture this invitation-aware diffusion (IAD), rendering inferior results. These models typically focus on describing the social influence process, i.e., how a user is informed by friends, but tend to overlook the subsequent behavioral changes that invitations might precipitate. To this end, we present the Independent Cascade with Invitation (ICI) model, which incorporates both the social influence process and multi-stage behavior conversions in IAD. We validate our design through an empirical study on in-game IAD. Furthermore, we conduct extensive experiments to evaluate the effectiveness of our proposal against 6 state-of-the-art models on 6 real-world datasets. In particular, we demonstrate that our solution can outperform the best competitor by up to 5× in cascade estimation and 17.2% in diffusion prediction. We deploy our proposal in the seed selection and friend ranking scenarios of Tencent's online games, where it achieves improvements of up to 170% and 20.3%, respectively.

Debiasing Recommendation with Personal Popularity

International World Wide Web Conference (WWW, CCF-A), 2024

Authors: Wentao Ning, Reynold Cheng, Xiao Yan, Ben Kao, Nan Huo, Nur Al Hasan Haldar, Bo Tang

Global popularity (GP) bias is the phenomenon that popular items are recommended much more frequently than they should be, which goes against the goal of providing personalized recommendations and harms user experience and recommendation accuracy. Many methods have been proposed to reduce GP bias but they fail to notice the fundamental problem of GP, i.e., it considers popularity from a global perspective of all users and uses a single set of popular items, and thus cannot capture the interests of individual users. As such, we propose a user-aware version of item popularity named personal popularity (PP), which identifies different popular items for each user by considering the users that share similar interests. As PP models the preferences of individual users, it naturally helps to produce personalized recommendations and mitigate GP bias. To integrate PP into recommendation, we design a general personal popularity aware counterfactual (PPAC) framework, which adapts easily to existing recommendation models. In particular, PPAC recognizes that PP and GP have both direct and indirect effects on recommendations and controls direct effects with counterfactual inference techniques for unbiased recommendations. All codes and datasets are available at https://github.com/Stevenn9981/PPAC.

The Design of a Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy for JPEG Image Storage Systems.

IEEE Transactions on Computers (TOC, CCF-A), 2024

Authors: Cai Deng, Xiangyu Zou, Qi Chen, Bo Tang, Wen Xia.

Image data storage has grown explosively, so image deduplication is used to save storage by eliminating redundancy between different images. However, traditional image deduplication cannot eliminate fine-grained redundancy nor guarantee lossless results. In this work, we propose imDedup, a lossless and fine-grained deduplication scheme for JPEG image storage systems. Specifically, imDedup uses a novel sampling hash method, Feature Bitmap, to detect similar images in a fast way by utilizing the information distribution of JPEG data. Meanwhile, it uses Idelta, a novel delta encoder that incorporates image compression into deduplication, to guarantee the non-redundant data can be re-compressed via image encoding and thus improves the compression ratio. Besides, we propose the DCHash and Fixed-Point Matching (FPM) techniques to further speed up Idelta. We also propose imDedup-plus, which dynamically chooses the DCHash-based or FPM-based compressor to achieve higher throughputs without sacrificing the compression ratio. Experimental results demonstrate the superiority of the imDedup-based methods on five datasets. Compared with the state-of-the-art similarity detector and delta encoder, imDedup achieves 1.8–4.4× higher throughputs and 1.3–1.7× higher compression ratios, respectively. Besides, imDedup-plus can further achieve 1.3–2.9× higher throughputs than imDedup without sacrificing the compression ratio.

CheetahTraj: Efficient Visualization for Large Trajectory Dataset with Quality Guarantee.

IEEE Transactions on Knowledge and Data Engineering (TKDE, CCF-A), 2024

Authors: Qiaomu Shen, Chaozu Zhang, Xiao Yan, Chuan Yang, Dan Zeng, Wei Zeng, Bo Tang.

Visualizing large-scale trajectory dataset is a core subroutine for many applications. However, rendering all trajectories could result in severe visual clutter and incur long visualization delays due to large data volume. Naively sampling the trajectories reduces visualization time but usually harms visual quality, i.e., the generated visualizations may look substantially different from the exact ones without sampling. In this paper, we propose CheetahTraj, a principled sampling framework that achieves both high visualization quality and low visualization latency. We first define the visual quality function measuring the similarity between two visualizations, based on which we formulate the quality optimal sampling problem (QOSP). To solve QOSP, we design the Visual Quality Guaranteed Sampling algorithms, which reduce visual clutter while guaranteeing visual quality by considering both trajectory data distribution and human perception properties. We also develop a quad-tree-based index (InvQuad) that allows using trajectory samples computed offline for interactive online visualization. Extensive experiments including case-, user-, and quantitative-studies are conducted on three real-world trajectory datasets, and the results show that CheetahTraj consistently provides higher visual quality and better efficiency than baseline methods. Compared with visualizing all trajectories, CheetahTraj reduces the visualization latency by up to 3 orders of magnitude while avoiding visual clutter.

2023

CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2023

Authors: Jiashu Zhang, Wen Jiang, Bo Tang, H. Ma, L. Cao, Z. Jiang, Y. Nie, F. Wang, L. Zhang, Y. Liang.

In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance's cloud-native database veDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDS-Ben. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.

Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2023

Authors: Fang Wang, Xiao Yan, Man Lung Yiu, Shuai Li, Zunyao Mao, Bo Tang.

Fast query execution requires learning-based cardinality estimators to have short inference time (as model inference time adds to end-to-end query execution time) and high estimation accuracy (which is crucial for finding good execution plan). However, existing estimators cannot meet both requirements due to the inherent tension between model complexity and estimation accuracy. We propose a novel Learning-based Progressive Cardinality Estimator (LPCE), which adopts a query re-optimization methodology. In particular, LPCE consists of an initial model (LPCE-I), which estimates cardinality before query execution, and a refinement model (LPCE-R), which progressively refines the cardinality estimations using the actual cardinalities of the executed operators. During query execution, re-optimization is triggered if the estimations of LPCE-I are found to have large errors, and more efficient execution plans are selected for the remaining operators using the refined estimations provided by LPCE-R. Both LPCE-I and LPCE-R are light-weight query-driven estimators but they achieve both good efficiency and high accuracy when used jointly. Besides designing the models for LPCE-I and LPCE-R, we also integrate re-optimization and LPCE into PostgreSQL, a popular database engine. Extensive experiments show that LPCE yields shorter end-to-end query execution time than state-of-the-art learning-based estimators.

EAR-Oracle: On Efficient Indexing for Distance Queries between Arbitrary Points on Terrain Surface.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2023

Authors: Bo Huang, Victor Junqiu Wei, Raymond Chi-Wing Wong, Bo Tang.

Due to the advancement of geo-positioning technology, the terrain data has become increasingly popular and has drawn a lot of research effort from both academia and industry. The distance computation on the terrain surface is a fundamental and important problem that is widely applied in geographical information systems and 3D modeling. As could be observed from the existing studies, online computation of the distance on the terrain surface is very expensive. All existing index-based methods are only efficient under the case where the distance query must be performed among a small set of predefined points-of-interest known apriori. But, in general cases, they could not scale up to sizable datasets due to their intolerable oracle building time and space consumption. In this paper, we studied the arbitrary point-to-arbitrary point distance query on the terrain surface in which no assumption is imposed on the query points, and the distance query could be performed between any two arbitrary points. We propose an indexing structure, namely Efficient Arbitrary Point-to-Arbitrary Point Distance Oracle (EAR-Oracle), with theoretical guarantee on the accuracy, oracle building time, oracle size and query time. Our experiments demonstrate that our oracle enjoys excellent scalability and it scales up to enormous terrain surfaces but none of the existing index-based methods could be able to. Besides, it significantly outperforms all existing online computation methods by orders of magnitude in terms of the query time.

Effective and Efficient PageRank-based Positioning for Graph Visualization

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2023

Authors: Shiqi Zhang, Renchi Yang, Xiaokui Xiao, Xiao Yan, Bo Tang.

Graph visualization is a vital component in many real-world applications (e.g., social network analysis, web mining, and bioinformatics) that enables users to unearth crucial insights from complex data. Lying in the core of graph visualization is the node distance measure, which determines how the nodes are placed on the screen. A favorable node distance measure should be informative in reflecting the full structural information between nodes and effective in optimizing visual aesthetics. However, existing node distance measures yield sub-par visualization quality as they fall short of these requirements. Moreover, most existing measures are computationally inefficient, incurring a long response time when visualizing large graphs. To overcome such deficiencies, we propose a new node distance measure, PDist, geared towards graph visualization by exploiting a well-known node proximity measure,personalized PageRank. Moreover, we propose an efficient algorithm Tau-Push for estimating PDist under both single- and multi-level visualization settings. With several carefully-designed techniques, TauPush offers non-trivial theoretical guarantees for estimation accuracy and computation complexity. Extensive experiments show that our proposal significantly outperforms 13 state-of-the-art graph visualization solutions on 12 real-world graphs in terms of both efficiency and effectiveness (including aesthetic criteria and user feedback). In particular, our proposal can interactively produce satisfactory visualizations within one second for billion-edge graphs.

QEVIS: Multi-grained Visualizing of Distributed Query Execution.

IEEE Transactions on Visualization and Computer Graphics (TVCG, CCF-A), 2023

Authors: Qiaomu Shen, Zhengxin You, Xiao Yan, Chaozu Zhang, Ke Xu, Jianbin Qin, Dan Zeng, Bo Tang.

Distributed query processing systems such as Apache Hive and Spark are widely-used in many organizations for large-scale data analytics. Analyzing and understanding the query execution process of these systems are daily routines for engineers and crucial for identifying performance problems, optimizing system configurations, and rectifying errors. However, existing visualization tools for distributed query execution are insufficient because (i) most of them (if not all) do not provide fine-grained visualization (i.e., the atomic task level), which can be crucial for understanding query performance and reasoning about the underlying execution anomalies, and (ii) they do not support proper linkages between system status and query execution, which makes it difficult to identify the causes of execution problems. To tackle these limitations, we propose QEVIS, which visualizes distributed query execution process with multiple views that focus on different granularities and complement each other. Specifically, we first devise a query logical plan layout algorithm to visualize the overall query execution progress compactly and clearly. We then propose two novel scoring methods to summarize the anomaly degrees of the jobs and machines during query execution, and visualize the anomaly scores intuitively, which allow users to easily identify the components that are worth paying attention to. Moreover, we devise a scatter plot-based task view to show a massive number of atomic tasks, where task distribution patterns are informative for execution problems. We also equip QEVIS with a suite of auxiliary views and interaction methods to support easy and effective cross-view exploration, which makes it convenient to track the causes of execution problems. QEVIS has been used in the production environment of our industry partner, and we present three use cases from real-world applications and user interview to demonstrate its effectiveness. QEVIS is open-source at https://github.com/...

Towards Building The Next Generation Computation Engine

Proceedings of the ACM Turing Award Celebration Conference-China (TURC, CCF-A), 2023

Authors: Bo Tang

In this poster, I first briefly introduce several system work in the Database Research Group at Southern University of Science and Technology, (i.e., DBGroup@SUSTech); I next present the key ideas of our on-going project, i.e., architecting the next generation computation engine, which is designed for data-intensive applications on heterogeneous computing environment.

Quantifying the Competitiveness of a Dataset in Relation to General Preferences

The VLDB Journal (VLDBJ, CCF-A), 2023

Authors: Kyriakos Mouratidis, Keming Li, Bo Tang.

Typically, a specific market (e.g., of hotels, restaurants, laptops, etc.) is represented as a multi-attribute dataset of the available products. The topic of identifying and shortlisting the products of most interest to a user has been well-explored. In contrast, in this work we focus on the dataset, and aim to assess its competitiveness with regard to different possible preferences. We define measures of competitiveness, and represent them in the form of a heat-map in the domain of preferences. Our work finds application in market analysis and in business development. These applications are further enhanced when the competitiveness heat-map is used in tandem with information on user preferences (which can be readily derived by existing methods). Interestingly, our study also finds side-applications with strong practical relevance in the area of multi-objective querying. We propose a suite of algorithms to efficiently produce the heat-map, and conduct case studies and an empirical evaluation to demonstrate the practicality of our work.

Capacity Constrained Influence Maximization in Social Networks.

ACM International Conference on Knowledge Discovery and Data Mining (KDD, CCF-A), 2023

Authors: Shiqi Zhang, Yiqian Huang, Jiachen Sun, Wenqing Lin, Xiaokui Xiao, Bo Tang.

Influence maximization (IM) aims to identify a small number of influential individuals to maximize the information spread and finds applications in various fields. It was first introduced in the context of viral marketing, where a company pays a few influencers to promote the product. However, apart from the cost factor, the capacity of individuals to consume content poses challenges for implementing IM in real-world scenarios. For example, players on online gaming platforms can only interact with a limited number of friends. In addition, we observe that in these scenarios, (i) the initial adopters of promotion are likely to be the friends of influencers rather than the influencers themselves, and (ii) existing IM solutions produce sub-par results with high computational demands. Motivated by these observations, we propose a new IM variant called capacity constrained influence maximization (CIM), which aims to select a limited number of influential friends for each initial adopter such that the promotion can reach more users. To solve CIM effectively, we design two greedy algorithms, MG-Greedy and RR-Greedy, ensuring the 1/2-approximation ratio. To improve the efficiency, we devise the scalable implementation named RR-OPIM+ with (1/2-ε)-approximation and near-linear running time. We extensively evaluate the performance of 9 approaches on 6 real-world networks, and our solutions outperform all competitors in terms of result quality and running time. Additionally, we deploy RR-OPIM+ to online game scenarios, which improves the baseline considerably.

Efficient Approximation Algorithms for Spanning Centrality

ACM International Conference on Knowledge Discovery and Data Mining (KDD, CCF-A), 2023

Authors: Shiqi Zhang, Renchi Yang, Jing Tang, Xiaokui Xiao, Bo Tang.

Given a graph \mathcalG, the spanning centrality (SC) of an edge e measures the importance of e for \mathcalG to be connected. In practice, SC has seen extensive applications in computational biology, electrical networks, and combinatorial optimization. However, it is highly challenging to compute the SC of all edges (AESC) on large graphs. Existing techniques fail to deal with such graphs, as they either suffer from expensive matrix operations or require sampling numerous long random walks. To circumvent these issues, this paper proposes TGT and its enhanced version TGT+, two algorithms for AESC computation that offers rigorous theoretical approximation guarantees. In particular, TGT remedies the deficiencies of previous solutions by conducting deterministic graph traversals with carefully-crafted truncated lengths. TGT+ further advances TGT in terms of both empirical efficiency and asymptotic performance while retaining result quality, based on the combination of TGT with random walks and several additional heuristic optimizations. We experimentally evaluate TGT+ against recent competitors for AESC using a variety of real datasets. The experimental outcomes authenticate that TGT+ outperforms state of the arts often by over one order of magnitude speedup without degrading the accuracy.

DGI: An Easy and Efficient Framework for GNN Model Evaluation

ACM International Conference on Knowledge Discovery and Data Mining (KDD, CCF-A), 2023

Authors: Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, Minjie Wang.

While many systems have been developed to train graph neural networks (GNNs), efficient model evaluation, which computes node embedding according to a given model, remains to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for over 90% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. The layer-wise approach avoids neighbor explosion by conducting computation layer by layer in GNN models. However, layer-wise model evaluation takes considerable implementation efforts because users need to manually decompose the GNN model into layers, and different implementations are required for GNN models with different structures. In this paper, we present DGI -a framework for easy and efficient GNN model evaluation, which automatically translates the training code of a GNN model for layer-wise evaluation to minimize user effort. DGI is general for different GNN models and evaluation requests (e.g., computing embedding for all or some of the nodes), and supports out-of-core execution on large graphs that cannot fit in CPU memory. Under the hood, DGI traces the computation graph of GNN model, partitions the computation graph into layers that are suitable for layer-wise evaluation according to tailored rules, and executes each layer efficiently by reordering the computation tasks and managing device memory consumption. Experiment results show that DGI matches hand-written implementations of layer-wise evaluation in efficiency and consistently outperforms node-wise evaluation across different datasets and hardware settings, and the speedup can be over 1,000x.

EEPH: An Efficient Extendible Perfect Hashing for Hybrid PMem-DRAM.

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2023

Authors: Qi Chen, Hao Hu, Cai Deng, Dingbang Liu, Shiyi Li, Bo Tang, Ting Yao, Wen Xia.

In recent years, the performance of hash indexes has been significantly improved by exploiting emerging persistent memory (PMem). However, the performance improvement of hash indexes mainly comes from exploiting the hardware features of PMem. Only a few studies optimize the hash index itself to fully exploit the potential of PMem. Interestingly, many of these studies improve the performance of write, but disregard the performance of read, of hash indexes on PMem. With extensive experimental evaluation, we find the major reason for inefficient read in the hash index on PMem is that the overhead of hash collision processing is expensive.To address that, we propose a novel Efficient Extendible Perfect Hashing (EEPH) on PMem-DRAM hybrid data layout to improve read performance of hash indexes. Specifically, we reduce the overhead of dynamic perfect hashing extension on PMem by combing extendible hashing. We then design a hybrid data layout to unlock the inherent read strengths of perfect hashing (i.e., zero collision). Last, we devise a complement move algorithm to efficiently guarantee the zero collision of perfect hashing when data move is conducted on PMem. We compare EEPH with the state-of-the-art hash indexes on PMem by conducting comprehensive experiments on several real-world read-intensive and read-skew workloads. The experimental results confirm the superiority of our EEPH as it achieves up to 2.21× higher throughput and about 1/3 of the 99th percentile latency than state-of-the-art hash indexes.

DHive: Query Execution Performance Analysis via Dataflow in Apache Hive.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2023

Authors: Chaozu Zhang, Qiaomu Shen, Bo Tang

Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.

Analyzing and Combating Attribute Bias for Face Restoration

International Joint Conference on Artificial Intelligence (IJCAI, CCF-A), 2023

Authors: Zelin Li, Dan Zeng, Xiao Yan, Qiaomu Shen, Bo Tang.

Face restoration (FR) recovers high resolution (HR) faces from low resolution (LR) faces and is challenging due to its ill-posed nature. With years of development, existing methods can produce quality HR faces with realistic details. However, we observe that key facial attributes (e.g., age and gender) of the restored faces could be dramatically different from the LR faces and call this phenomenon attribute bias, which is fatal when using FR for applications such as surveillance and security. Thus, we argue that FR should consider not only image quality as in existing works but also attribute bias. To this end, we thoroughly analyze attribute bias with extensive experiments and find that two major causes are the lack of attribute information in LR faces and bias in the training data. Moreover, we propose the DebiasFR framework to produce HR faces with high image quality and accurate facial attributes. The key design is to explicitly model the facial attributes, which also allows to adjust facial attributes for the output HR faces. Experiment results show that DebiasFR has comparable image quality but significantly smaller attribute bias when compared with state-of-the-art FR methods.

Extracting Top-k Frequent and Diversified Patterns in Knowledge Graphs

IEEE Transactions on Knowledge and Data Engineering (TKDE, CCF-A), 2023

Authors: Jian Zeng, Leong Hou U, Xiao Yan, Yan Li, Mingji Han, Bo Tang.

A knowledge graph contains many real-world facts that can be used to support various analytical tasks, e.g., exceptional fact discovery and the check of claims. In this work, we attempt to extract top-k frequent and diversified patterns from knowledge graph by well capturing user interest. Specifically, we first formalize the core-based top-k frequent pattern discovery problem, which finds the top-k frequent patterns that are extended from a core pattern specified by user query and have the highest frequency. In addition, to diversify the top-k frequent patterns, we define a distance function to measure the dissimilarity between two patterns, and return top-k patterns in which the pairwise diversity of any two resultant patterns exceeds a given threshold. As the search space of candidate patterns is exponential w.r.t. the number of nodes and edges in the knowledge graph, discovering frequent and diversified patterns is computationally challenging. To achieve high efficiency, we propose a suite of techniques, including (1) We devise a meta-index to avoid generating invalid candidate patterns; (2) We propose an upper bound of the frequency score (i.e., MNI) of the candidate pattern, which is used to prune unqualified candidates earlier and prioritize the enumeration order of patterns; (3) We design an advanced join-based approach to compute the MNI of candidate patterns efficiently; and (4) We develop a lower bound for distance function and incrementally compute the pairwise diversity among the patterns. Using real-world knowledge graphs, we experimentally verify the efficiency and effectiveness of our proposed techniques. We also demonstrate the utility of the extracted patterns by case studies.

Data-Scarce Animal Face Alignment via Bi-Directional Cross-Species Knowledge Transfer.

Proceedings of ACM International Conference on Multimedia (ACM MM, CCF-A), 2023

Authors: Dan Zeng, Shanchuan Hong, Shuiwang Li, Qiaomu Shen, Bo Tang.

Animal face alignment is challenging due to large intra- and inter-species variations and a scarcity of labeled data. Existing studies circumvent this problem by directly finetuning a human face alignment model or focusing on animal-specific face alignment(e.g., horse, sheep). In this paper, we propose Cross-Species Knowledge Transfer, Meta-CSKT, for animal face alignment, which consists of a base network and an adaptation network. Two networks continuously complement each other through the bi-directional cross-species knowledge transfer. This is motivated by observing knowledge sharing among animals. Meta-CSKT uses a circuit feedback mechanism to improve the base network with the cognitive differences of the adaptation network between few-shot labeled and large-scale unlabeled data. In addition, we propose a positive example mining method to identify positives, semi-hard positives, and hard negatives in unlabeled data to mitigate the scarcity of labeled data and facilitate Meta-CSKT learning. Experiments show that Meta-CSKT outperforms state-of-the-art methods by a large margin on the horse facial keypoint dataset and Japanese Macaque Species dataset, while achieving comparable results to state-of-the-art methods on large-scale labeled AnimalWeb(e.g., 18K), using only a few labeled images~(e.g., 40)1.

Multi-domain Recommendation with Embedding Disentangling and Domain Alignment.

Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM, CCF-B), 2023

Authors: Wentao Ning, Xiao Yan, W. Liu, Reynold Cheng, R. Zhang, Bo Tang.

Multi-domain recommendation (MDR) aims to provide recommendations for different domains (e.g., types of products) with overlapping users/items and is common for platforms such as Amazon, Facebook, and LinkedIn that host multiple services. Existing MDR models face two challenges: First, it is difficult to disentangle knowledge that generalizes across domains (e.g., a user likes cheap items) and knowledge specific to a single domain (e.g., a user likes blue clothing but not blue cars). Second, they have limited ability to transfer knowledge across domains with small overlaps. We propose a new MDR method named EDDA with two key components, i.e., embedding disentangling recommender and domain alignment, to tackle the two challenges respectively. In particular, the embedding disentangling recommender separates both the model and embedding for the inter-domain part and the intra-domain part, while most existing MDR methods only focus on model-level disentangling. The domain alignment leverages random walks from graph processing to identify similar user/item pairs from different domains and encourages similar user/item pairs to have similar embeddings, enhancing knowledge transfer. We compare EDDA with 12 state-of-the-art baselines on 3 real datasets. The results show that EDDA consistently outperforms the baselines on all datasets and domains. All datasets and codes are available at https://github.com/Stevenn9981/EDDA.

向量数据库:关键技术、系统架构与未来挑战

CCCF (CCCF, None), 2023

Authors: 唐博, 秦建斌, 毛睿.

向量(vector)是线性代数学科中的基本对象, 常用于表达无法通过单个数值(即标量)表示的多 维数值。在数据库领域,面向向量数据的存储与查 询处理在早年间多出现于空间数据库(spatial database)的 Top-k 查询和图片相似度检索等典型应用中。

时间敏感网络下的一种新型灵活门控机制

计算机工程与科学 (计算机工程与科学, None), 2023

Authors: 林佳烁, 李伟超, 成剑, 詹双平, 冯景斌, 王涛, 黄倩怡, 唐博, 汪漪.

Time-sensitive networking flow scheduling algorithms often generate a large number of gate control events, exceeding the capabilities of network devices and making it difficult to deploy sche- duling algorithms in practical networks. To address this issue, a novel flexible gate control-based flow scheduling strategy is proposed, which relaxes the strict isolation constraints between real-time traffic flows and best-effort traffic flows, allowing some nodes to not enable the gate control mechanism for real-time flows. This strategy can flexibly select to enable the gate control for real-time business at various network device ports, reducing the gate control events required for scheduling real-time traffic flows by up to 91.6%. It even allows the existence of network devices on the transmission path of real-time traffic that do not support gate control scheduling, achieving a mixed deployment with ordinary networks.

2022

𝜏-LevelIndex: Towards Efficient Query Processing in Continuous Preference Space.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2022

Authors: Jiahao Zhang, Bo Tang, Man Lung Yiu, Xiao Yan, Keming Li*.

Top-k related queries in continuous preference space (e.g., k-shortlist preference query kSPR, uncertain top-k query UTK, output-size specified utility-based query ORU) have numerous applications but are expensive to process. Existing algorithms process each query via specialized optimizations, which are difficult to generalize. In this work, we propose a novel and general index structure T-LevelIndex, which can be used to process various queries in continuous preference space efficiently. We devise efficient approaches to build the T-LevelIndex by fully exploiting the properties of continuous preference space. We conduct extensive experimental studies on both real- and synthetic- benchmarks. The results show that (i) our proposed index building approaches have low costs in terms of both space and time, and (ii) T-LevelIndex significantly outperforms specialized solutions for processing a spectrum of queries in continuous preference space, and the speedup can be two to three orders of magnitude.

Spatial Data Quality in the IoT Era: Management and Exploitation

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2022

Authors: Huan Li, Bo Tang , Hua Lu, Christian S. Jensen, Muhammad Aamir Cheema.

Within the rapidly expanding Internet of Things (IoT), growing amounts of spatially referenced data are being generated. Due to the dynamic, decentralized, and heterogeneous nature of the IoT, spatial IoT data (SID) quality has attracted considerable attention in academia and industry. How to invent and use technologies for managing spatial data quality and exploiting low-quality spatial data are key challenges in the IoT. In this tutorial, we highlight the SID consumption requirements in applications and offer an overview of spatial data quality in the IoT setting. In addition, we review pertinent technologies for quality management and low-quality data exploitation, and we identify trends and future directions for quality-aware SID management and utilization. The tutorial aims to not only help researchers and practitioners to better comprehend SID quality challenges and solutions, but also offer insights that may enable innovative research and applications.

GHive: A Demonstration of GPU-Accelerated Query Processing in Apache Hive.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2022

Authors: Haotian Liu, Bo Tang, Jiashu Zhang, Yangshen Deng, Xinying Zheng, Qiaomu Shen, Xiao Yan, Dan Zeng, Zunyao Mao, Chaozu Zhang, Zhengxin You, Zhihao Wang, Runzhe Jiang, Fang Wang, Man Lung Yiu, Huan Li, Mingji Han, Qian Li, Zhenghai Luo.

As a distributed, fault-tolerant data warehouse system for large-scale data analytics, Apache Hive has been used for various applications in many organizations (e.g., Facebook, Amazon, and Huawei). Exploiting the large degrees of parallelism of GPU to improve the performance of online analytical processing (OLAP) in database system is a common practice in the industry. Meanwhile, it is a common practice to exploit the large degrees of parallelism of GPU to improve the performance of online analytical processing (OLAP) in database systems. This demo presents GHive, which enables Apache Hive to accelerate OLAP queries by jointly utilizing CPU and GPU in intelligent and efficient ways. The takeaways for SIGMOD attendees include: (1) the superior performance of GHive compared with vanilla Hive that only uses CPU; (2) intuitive visualizations of execution statistics for Hive and GHive to understand where the acceleration of GHive comes from; (3) detailed profiling of the time taken by each operator on CPU and GPU to show the advantages of GPU execution

Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2022

Authors: Huan Li, Lanjing Yi, Bo Tang, Hua Lu, Christian Jensen.

Underlying many types of data analytics, a spatiotemporal quantile monitoring (SQM) query continuously returns the quantiles of a dataset observed in a spatiotemporal range. In this paper, we study SQM in an Internet of Things (IoT) based edge computing environment, where concurrent SQM queries share the same infrastructure asynchronously. To minimize query latency while providing result accuracy guarantees, we design a processing framework that virtualizes edge-resident data sketches for quantile computing. In the framework, a coordinator edge node manages edge sketches and synchronizes edge sketch processing and query executions. The co-ordinator also controls the processed data fractions of edge sketches, which helps to achieve the optimal latency with error-bounded results for each single query. To support concurrent queries, we employ a grid to decompose queries into subqueries and process them efficiently using shared edge sketches. We also devise a relaxation algorithm to converge to optimal latencies for those subqueries whose result errors are still bounded. We evaluate our proposals using two high-speed streaming datasets in a simulated IoT setting with edge nodes. The results show that our proposals achieve efficient, scalable, and error-bounded SQM.

Manu: A Cloud Native Vector Database Management System.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2022

Authors: R. Guo, Long Xiang, X. Luan, Xiao Yan, X. Yi, J. Luo, Q. Cheng, W. Xu, Jiarui Luo, F. Liu, Z. Cao, Y. Qiao, T. Wang, Bo Tang, C. Xie.

With the development of learning-based embedding models, embedding vectors are widely used for analyzing and searching unstructured data. As vector collections exceed billion-scale, fully managed and horizontally scalable vector databases are necessary. In the past three years, through interaction with our 1200+ industry users, we have sketched a vision for the features that next-generation vector databases should have, which include long-term evolvability, tunable consistency, good elasticity, and high performance. We present Manu, a cloud native vector database that implements these features. It is difficult to integrate all these features if we follow traditional DBMS design rules. As most vector data applications do not require complex data models and strong data consistency, our design philosophy is to relax the data model and consistency constraints in exchange for the aforementioned features. Specifically, Manu firstly exposes the write-ahead log (WAL) and binlog as backbone services. Secondly, write components are designed as log publishers while all read-only analytic and search components are designed as independent subscribers to the log services. Finally, we utilize multi-version concurrency control (MVCC) and a delta consistency model to simplify the communication and cooperation among the system components. These designs achieve a low coupling among the system components, which is essential for elasticity and evolution. We also extensively optimize Manu for performance and usability with hardware-aware implementations and support for complex search semantics. Manu has been used for many applications, including, but not limited to, recommendation, multimedia, language, medicine and security. We evaluated Manu in three typical application scenarios to demonstrate its efficiency, elasticity, and scalability.

Constructing Compact Time Series Index for Efficient Window Query Processing.

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022

Authors: Jing Zhao, Peng Wang, Bo Tang, Lu Liu, Chen Wang, Wei Wang, Jianmin Wang.

Analyzing and mining of time series have been widely studied in both academia and industry in recent years. Given a set of long time series, data analysts can utilize the window-based similarity search to explore subsequences in arbitrary time windows. Existing techniques are not efficient for window-based query processing. In particular, the whole matching index approach needs to build an individual index for each window, which incurs huge space cost. The existing window-based approach can only cluster neighboring windows, which leads to loose bounds of each group, and thus degrades the query processing efficiency. In this paper, we propose a compact time series index (WinIdx) for efficient window query processing. Specifically, i) we propose a novel distance measurement to capture the similarity between windows, ii) WinIdx provides a compact index structure for windows within a cluster by exploiting the similarity among subsequences relationships, and iii) several optimizations (e.g., sortable summarization, summarization envelop) are equipped in WinIdx to improve the efficiency of index construction, query processing and index footprints. We conduct extensive experiments on both real and synthetic time series to demonstrate the superiority of WinIdx against state-of-the-art approaches.

imDedup: a Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy among Images.

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022

Authors: Cai Deng, Qi Chen, Xiangyu Zou, Eric Xu, Bo Tang, Wen Xia.

Images occupy a large amount of storage in data centers. To cope with the explosive growth of the image storage requirement, image compression techniques are devised to shrink the size of every single image at first. Furthermore, image deduplication methods are proposed to reduce the storage cost as they could be used to eliminate redundancy among images. However, state-of-the-art image deduplication methods either can only eliminate file-level coarse-grained redundancy or cannot guarantee lossless deduplication. In this work, we propose a new lossless image deduplication framework to eliminate fine-grained redundancy among images. It first decodes images to expose similarity, then eliminates fine-grained redundancy on the decoded data by delta compres-sion, and finally re-compresses the remaining data by image compression encoding. Based on this framework, we propose a novel lossless similarity-based deduplication (SBD) scheme for decoded image data (called imDedup). Specifically, imDedup uses a novel and fast sampling method (called Feature Map) to detect similar images in a two-dimensional way, which greatly reduces computation overhead. Meanwhile, it uses a novel delta encoder (called Idelta) which incorporates image compression encoding characteristics into deduplication to guarantee the remaining deduplicated image data to be friendly re-compressed via image encoding, which significantly improves the compression ratio. We implement a prototype of imDedup for JPEG images, and demonstrate its superiority on four datasets: Compared with exact image deduplication, imDedup achieves a 19%-38% higher compression ratio by efficiently eliminating fine-grained redundancy. Compared with the similarity detector and delta encoder of state-of-the-art SBD schemes running on the decoded image data, imDedup achieves a 1.8×-3.4× higher throughput and a 1.3 ×-1. 6 × higher compression ratio, respectively.

CheetahKG: A Demonstration for Core-based Top-k Frequent Pattern Discovery on Knowledge Graphs.

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022

Authors: Bo Tang, Zeng Jian, Qiandong Tang, Chuan Yang, Qiaomu Shen, Leong Hou U, Xiao Yan, Dan Zeng

Knowledge graphs capture the complex relationships among various entities, which can be found in various real world applications, e.g., Amazon product graph, Freebase, and COVID-19. To facilitate the knowledge graph analytical tasks, a system that supports interactive and efficient query processing is always in demand. In this demonstration, we develop a prototype system, CheetahKG, that embeds with our state-of-the-art query processing engine for the top-k frequent pattern discovery. Such discovered patterns can be used for two purposes, (i) identifying related patterns and (ii) guiding knowledge exploration. In the demonstration sessions, the attendees will be invited to test the efficiency and effectiveness of the query engine and use the discovered patterns to analyze knowledge graphs on CheetahKG.

Fast Error-Bounded Distance Distribution Computation(Extended Abstract).

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022

Authors: Jiahao Zhang, Man Lung Yiu, Bo Tang, Qing Li.

Distance distributions have been widely applied in many real-world applications, e.g., graph analysis. Unfortunately, due to the large data volume and expensive distance computation, the exact distance distribution computation is excessively slow. Motivated by this, we present a novel approximate solution in this paper that (i) achieves error-bound guarantees and (ii) is generic to various distance measures. Our proposed method outperforms the baseline in terms of accuracy and efficiency when evaluating on three widely used distance measures with real-world datasets.

Face2Exp: Combating Data Biases for Facial Expression Recognition .

IEEE Conference on Computer Vision and Pattern Recognition (CVPR, CCF-A), 2022

Authors: Dan Zeng, Z. Lin, Xiao Yan, Fei Wang, Bo Tang.

Facial expression recognition (FER) is challenging due to the class imbalance caused by data collection. Existing studies tackle the data bias problem using only labeled facial expression dataset. Orthogonal to existing FER methods, we propose to utilize large unlabeled face recognition (FR) datasets to enhance FER. However, this raises another data bias problem—the distribution mismatch between FR and FER data. To combat the mismatch, we propose the Meta-Face2Exp framework, which consists of a base network and an adaptation network. The base network learns prior expression knowledge on class-balanced FER data while the adaptation network is trained to fit the pseudo labels of FR data generated by the base model. To combat the mismatch between FR and FER data, Meta-Face2Exp uses a circuit feedback mechanism, which improves the base network with the feedback from the adaptation network. Experiments show that our MetaFace2Exp achieves comparable accuracy to state-of-the-art FER methods with 10% of the labeled FER data utilized by the baselines. We also demonstrate that the circuit feedback mechanism successfully eliminates data bias11Code is available at link: https://github.com/danzeng1990/Face2Exp..

Spatial Data Quality in the Internet of Things: Management, Exploitation, and Prospects.

ACM Computing Surveys, 55, 3, Article 57 (February 2022). (ACM Computing Surveys, CCF-A), 2022

Authors: Huan Li, Hua Lu, Christian S. Jensen, Bo Tang, Muhammad Aamir Cheema.

With the continued deployment of the Internet of Things (IoT), increasing volumes of devices are being deployed that emit massive spatially referenced data. Due in part to the dynamic, decentralized, and heterogeneous architecture of the IoT, the varying and often low quality of spatial IoT data (SID) presents challenges to applications built on top of this data. This survey aims to provide unique insight to practitioners who intend to develop IoT-enabled applications and to researchers who wish to conduct research that relates to data quality in the IoT setting. The survey offers an inventory analysis of major data quality dimensions in SID and covers significant data characteristics and associated quality considerations. The survey summarizes data quality related technologies from both task and technique perspectives. Organizing the technologies from the task perspective, it covers recent progress in SID quality management, encompassing location refinement, uncertainty elimination, outlier removal, fault correction, data integration, and data reduction; and it covers low-quality SID exploitation, encompassing querying, analysis, and decision-making techniques. Finally, the survey covers emerging trends and open issues concerning the quality of SID.

GHive: Accelerating Analytical Query Processing in Apache Hive via CPU-GPU Heterogeneous Computing .

Proceedings of Symposium on Cloud Computing (SoCC, CCF-B), 2022

Authors: Haotian Liu, Bo Tang, Jiashu Zhang, Yangshen Deng, Xiao Yan, Xinying Zheng, Qiaomu Shen, Dan Zeng, Zunyao Mao, Chaozu Zhang, Zhengxin You, Zhihao Wang, Runzhe Jiang, Fang Wang, Man Lung Yiu, Huan Li, Mingji Han, Qian Li, Zhenghai Luo.

As a popular distributed data warehouse system, Apache Hive has been widely used for big data analytics in many organizations. Meanwhile, exploiting the massive parallelism of GPU to accelerate online analytical processing (OLAP) has been extensively explored in the database community. In this paper, we present GHive, which enhances CPU-based Hive via CPU-GPU heterogeneous computing. GHive is designed for the business intelligence applications and provides the same API as Hive for compatibility. To run SQL queries jointly on both CPU and GPU, GHive comes with three key techniques: (i) a novel data model gTable, which is column-based and enables efficient data movement between CPU memory and GPU memory; (ii) a GPU-based operator library Panda, which provides a complete set of SQL operators with extensively optimized GPU implementations; (iii) a hardware-aware MapReduce job placement scheme, which puts jobs judiciously on either GPU or CPU via a cost-based approach. In the experiments, we observe that GHive outperforms Hive in both query processing speed and operating expense on the Star Schema Benchmark (SSB).

Measuring Friendship Closeness: A Perspective of Social Identity Theory.

Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM, CCF-B), 2022

Authors: Shiqi Zhang, Jiachen Sun, Wenqing Lin, Xiaokui Xiao, Bo Tang.

Measuring the closeness of friendships is an important problem that finds numerous applications in practice. For example, online gaming platforms often host friendship-enhancing events in which a user (called the source) only invites his/her friend (called the target) to play together. In this scenario, the measure of friendship closeness is the backbone for understanding source invitation and target adoption behaviors, and underpins the recommendation of promising targets for the sources. However, most existing measures for friendship closeness only consider the information between the source and target but ignore the information of groups where they are located, which renders inferior results. To address this issue, we present new measures for friendship closeness based on the social identity theory (SIT), which describes the inclination that a target endorses behaviors of users inside the same group. The core of SIT is the process that a target assesses groups of users as them or us. Unfortunately, this process is difficult to be captured due to perceptual factors. To this end, we seamlessly reify the factors of SIT into quantitative measures, which consider local and global information of a target's group. We conduct extensive experiments to evaluate the effectiveness of our proposal against 8 state-of-the-art methods on 3 online gaming datasets. In particular, we demonstrate that our solution can outperform the best competitor on the behavior prediction (resp. online target recommendation) by up to 23.2% (resp. 34.2%) in the corresponding evaluation metric.

Automatic Meta-Path Discovery for Effective Graph-Based Recommendation.

Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM, CCF-B), 2022

Authors: Wentao Ning, Reynold Cheng, Jiajun Shen, Nur Al Hasan Haldar, Ben Kao, Xiao Yan, Nan Huo, Tian Li, Wai Kit Lam, Bo Tang.

Heterogeneous Information Networks (HINs) are labeled graphs that depict relationships among different types of entities (e.g., users, movies and directors). For HINs, meta-path-based recommenders (MPRs) utilize meta-paths (i.e., abstract paths consisting of node and link types) to predict user preference, and have attracted a lot of attention due to their explainability and performance. We observe that the performance of MPRs is highly sensitive to the meta-paths they use, but existing works manually select the meta-paths from many possible ones. Thus, to discover effective meta-paths automatically, we propose the Reinforcement learning-based Meta-path Selection (RMS) framework. Specifically, we define a vector encoding for meta-paths and design a policy network to extend meta-paths. The policy network is trained based on the results of downstream recommendation tasks and an early stopping approximation strategy is proposed to speed up training. RMS is a general model, and it can work with all existing MPRs. We also propose a new MPR called RMS-HRec, which uses an attention mechanism to aggregate information from the meta-paths. We conduct extensive experiments on real datasets. Compared with the manually selected meta-paths, the meta-paths identified by RMS consistently improve recommendation quality. Moreover, RMS-HRec outperforms state-of-the-art recommender systems by an average of 7% in hit ratio. The codes and datasets are available on https://github.com/Stevenn9981/RMS-HRec.

Rethinking the Use of Network Cycle in Time-Sensitive Networking (TSN) Flow Scheduling.

IFIP International Workshop on QoS (IWQoS, CCF-B), 2022

Authors: Jiashuo Lin, Weichao Li, Xingbo Feng, Shuangping Zhan, Jingbin Feng, Jian Cheng, Tao Wang, Qing Li, Yi Wang, Fuliang Li, Bo Tang.

Time-Sensitive Networking (TSN) is an emerging network architecture that provides bounded latency and reliable network services for time-sensitive applications. Since time-triggered flows in TSN are typically periodic, a concept of network cycle is widely used in both standards and academic researches. However, although network cycle has gained popularity, its rationale has not yet been analyzed systematically.In this paper, we mathematically evaluate the performance of several flow scheduling algorithms in terms of flow schedulability with and without employing network cycle. We observe that only when the network cycle is set to a proper value can the performance of flow scheduling be significantly improved. To better evaluate the scheduling effect, a novel assessment metric and a goal-based optimization algorithm are introduced. Our experiment results show that the network cycle-based algorithm can achieve a considerable improvement (40% - 170% improvement in the number of scheduled flows) compared to the ones with network cycle disabled.

2021

On m-Impact Regions and Standing Top-k Influence Problems.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2021

Authors: Bo Tang, Kyriakos Mouratidis, Mingji Han.

In this paper, we study the m-impact region problem (mIR). In a context where users look for available products with top-k queries, mIR identifies the part of the product space that attracts the most user attention. Specifically, mIR determines the kind of attribute values that lead a (new or existing) product to the top-k result for at least a fraction of the user population. mIR has several applications, ranging from effective marketing to product improvement. Importantly, it also leads to (exact and efficient) solutions for standing top-k impact problems, which were previously solved heuristically only, or whose current solutions face serious scalability limitations. We experiment, among others, on data mined from actual user reviews for real products, and demonstrate the practicality and efficiency of our algorithms, both for mIR and for standing top-k impact problems.

Marrying Top-k with Skyline Queries: Relaxing the Preference Input while Producing Output of Controllable Size .

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2021

Authors: Kyriakos Mouratidis, Keming Li, Bo Tang.

The two most common paradigms to identify records of preference in a multi-objective setting rely either on dominance (e.g., the skyline operator) or on a utility function defined over the records' attributes (typically, using a top-k query). Despite their proliferation, each of them has its own palpable drawbacks. Motivated by these drawbacks, we identify three hard requirements for practical decision support, namely, personalization, controllable output size, and flexibility in preference specification. With these requirements as a guide, we combine elements from both paradigms and propose two new operators, ORD and ORU. We perform a qualitative study to demonstrate how they work, and evaluate their performance against adaptations of previous work that mimic their output.

Vertex-Centric Visual Programming for Graph Neural Networks.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2021

Authors: Yidi Wu, Yuntao Gui, Tatiana Jin, James Cheng, Xiao Yan, Peiqi Yin, Yufei Cai, Bo Tang, Fan Yu.

Graph neural networks (GNNs) have achieved remarkable performance in many graph analytics tasks such as node classification, link prediction and graph clustering. Existing GNN systems (e.g., PyG and DGL) adopt a tensor-centric programming model and train GNNs with manually written operators. Such design results in poor usability due to the large semantic gap between the API and the GNN models, and suffers from inferior efficiency because of high memory consumption and massive data movement. We demonstrateSeastar, a novel GNN training framework that adopts avertex-centric programming paradigm and supportsautomatic kernel generation, to simplify model development and improve training efficiency. We will (i) show how to express GNN models succinctly using a visual "drag-and-drop'' interface or Seastar's vertex-centric python API; (ii) demonstrate the performance advantage of Seastar over existing GNN systems in convergence speed, training throughput and memory consumption; and (iii) illustrate how Seastar's optimizations (e.g., operator fusion and constant folding) improve training efficiency by profiling the run-time performance.

Fast Error-Bounded Distance Distribution Computation .

IEEE Transactions on Knowledge and Data Engineering (TKDE, CCF-A), 2021

Authors: Jiahao Zhang, Man Lung Yiu, Bo Tang, Qing Li.

In this work we study the distance distribution computation problem. It has been widely used in many real-world applications, e.g., human genome clustering, cosmological model analysis, and parameter tuning. The straightforward solution for the exact distance distribution computation problem is unacceptably slow due to (i) massive data size, and (ii) expensive distance computation. In this paper, we propose a novel method to compute approximate distance distributions with error bound guarantees. Furthermore, our method is generic to different distance measures. We conduct extensive experimental studies on three widely used distance measures with real-world datasets. The experimental results demonstrate that our proposed method outperforms the sampling-based solution (without error guarantees) by up to three orders of magnitude.

GAIPS: Accelerating Maximum Inner Product Search with GPU.

ACM International Conference on Research and Development in Information Retrieval (SIGIR, CCF-A), 2021

Authors: Long Xiang, Xiao Yan, Lan Lu, Bo Tang.

In this paper, we propose the GAIPS framework for efficient maximum inner product search (MIPS) on GPU. We observe that a query can usually find a good lower bound of its maximum inner product in some large norm items that take up only a small portion of the dataset and utilize this fact to facilitate pruning. In addition, we design norm-based, residue-based and hash-based pruning techniques to avoid computation for items that are unlikely to be the MIPS results. Experiment results show that compared with FAISS, the state-of-the-art GPU-based similarity search framework, GAIPS has significantly shorter query processing time at the same recall.

Fast Core-based Top-k Frequent Pattern Discovery in Knowledge Graphs.

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021

Authors: Jian Zeng, Leong Hou U, Xiao Yan, Mingji Han, Bo Tang.

Knowledge graph is a way of structuring information in graph form, by representing entities as nodes and relationships between entities as edges. A knowledge graph often consists of large amount of facts in real-world which can be used in supporting many analytical tasks, e.g., exceptional facts discovery and fact check of claims. In this work, we study a core-based top-k frequent pattern discovery problem which is frequently used as a subroutine in analyzing knowledge graphs. The main challenge of the problem is search space of the candidate patterns is exponential to the combinations of the nodes and edges in the knowledge graph.To reduce the search space, we devise a novel computation framework FastPat with a suite of optimizations. First, we devise a meta-index, which can be used to avoid generating invalid candidate patterns. Second, we propose an upper bound of the frequency score (i.e., MNI) of the candidate pattern that prunes unqualified candidates earlier and prioritize the enumeration order of the patterns. Lastly, we design a join-based approach to compute the MNI of candidate pattern efficiently. We conduct extensive experimental studies in real-world datasets to verify the superiority of our proposed method over the baselines. We also demonstrate the utility of the discovered frequent patterns by a case study in COVID-19 knowledge graph.

Towards Efficient MaxBRNN Computation for Streaming Updates .

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021

Authors: Wentao Ning, Xiao Yan, Bo Tang.

In this paper, we propose the streamingMaxBRNNquery, which finds the optimal region to deploy a new service point when both the service points and client points are under continuous updates. The streaming MaxBRNNquery has many applications such as taxi scheduling, shared bike placements, etc. Existing MaxBRNNsolutions are insufficient for streaming updates as they need to re-run from scratch even for a small amount of updates, resulting in long query processing time. To tackle this problem, we devise an efficient slot partitioning-based algorithm (SlotP), which divides the space into equal-sized slots and processes each slot independently. The superiorities of our proposal for streaming MaxBRNNquery are: (i) an update affects only a smaller number of slots and works done on the unaffected slots can be reused directly; (ii) the influence value upper bound of each slot can be derived efficiently and accurately, which facilitate pruning many slots from expensive computation. We conducted extensive experiments to validate the performance of the SlotPalgorithm. The results show that SlotPis 2-3 orders of magnitude faster than state-of-the-art baselines.

GRAB: Finding Time Series Natural Structures via A Novel Graph-Based Scheme .

IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021

Authors: Yi Lu, Peng Wang, Bo Tang, Shen Liang, Chen Wang, Wei Wang, Jianmin Wang.

In recent years, the widespread use of sensors has substantially stimulated researchers’ interest in time series data mining. Real-world time series often include natural structures. For example, a time series captured from a patient rehabilitation app can be divided into a series of movements, e.g., sitting, standing, and walking. Finding time series natural structures (i.e., latent semantic states) is one of the core subroutines in time series mining applications. However, this task is not trivial as it has two challenges: (1) how to determine the correct change points between consecutive segments, and (2) how to cluster segments into different states.In this paper, we propose a novel graph-based approach, GRAB, to discover time series natural structures. In particular, GRAB first partitions the time series into a set of non-overlapping fragments via the similarity between subsequences. Then, it constructs a fragment-based graph and employs a graph partition method to cluster the fragments into states. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of our GRAB method. Specifically, GRAB finds high-quality latent states, and it outperforms state-of-the-art solutions by orders of magnitude.

On Discovering Motifs and Frequent Patterns in Spatial Trajectories with Discrete Frechet Distance.

GeoInformatica (GeoInformatica , CCF-B), 2021

Authors: Bo Tang, Man Lung Yiu, Kyriakos Mouratidis, Jiahao Zhang, Kai Wang.

The discrete Fréchet distance (DFD) captures perceptual and geographical similarity between two trajectories. It has been successfully adopted in a multitude of applications, such as signature and handwriting recognition, computer graphics, as well as geographic applications. Spatial applications, e.g., sports analysis, traffic analysis, etc. require discovering similar subtrajectories within a single trajectory or across multiple trajectories. In this paper, we adopt DFD as the similarity measure, and study two representative trajectory analysis problems, namely, motif discovery and frequent pattern discovery. Due to the time complexity of DFD, these tasks are computationally challenging. We address that challenge with a suite of novel lower bound functions and a grouping-based solution. Our techniques apply directly when the analysis tasks are defined within the same or across multiple trajectories. An extensive empirical study on real trajectory datasets reveals that our approaches are 3 orders of magnitude faster than baseline solutions.

RCELF: A Residual-based Approach for Influence Maximization Problem.

Information Systems (Inform Systems, None), 2021

Authors: Shiqi Zhang, Xinxun Zeng, Bo Tang.

Influence Maximization Problem (IMP) is selecting a seed set of nodes in the social network to spread the influence as widely as possible. It has many applications in multiple domains, e.g., viral marketing is frequently used for new products or activities advertisement. While it is a classic and well-studied problem in computer science, unfortunately, all those proposed techniques are compromising among time efficiency, memory consumption, and result quality. In this paper, we conduct comprehensive experimental studies on the state-of-the-art IMP approximate approaches to reveal the underlying trade-off strategies. Interestingly, we find that even the state-of-the-art approaches are impractical when the propagation probability of the network have been taken into consideration. With the findings of existing approaches, we propose a novel residual-based approach (i.e., RCELF) for IMP, which (i) overcomes the deficiencies of existing approximate approaches, and (ii) provides theoretical guaranteed results with high efficiency in both time- and space-perspectives. We demonstrate the superiority of our proposal by extensive experimental evaluation on real datasets.

Drug repurposing for the treatment of COVID-19: a knowledge graph approach.

Advanced Therapeutics (Advanced Therapeutics, None), 2021

Authors: Vincent K. C. Yan, Xiaodong Li, Xuxiao Ye, Min Ou, Ruibang Luo, Qingpeng Zhang, Bo Tang, Benjamin J. Cowling, Ivan Hung, Chung Wah Siu, Ian C. K. Wong, Reynold C. K. Cheng, Esther W. Chan.

Identifying effective drug treatments for COVID-19 is essential to reduce morbidity and mortality. Although a number of existing drugs have been proposed as potential COVID-19 treatments, effective data platforms and algorithms to prioritize drug candidates for evaluation and application of knowledge graph for drug repurposing have not been adequately explored. A COVID-19 knowledge graph by integrating 14 public bioinformatic databases containing information on drugs, genes, proteins, viruses, diseases, symptoms and their linkages is developed. An algorithm is developed to extract hidden linkages connecting drugs and COVID-19 from the knowledge graph, to generate and rank proposed drug candidates for repurposing as treatments for COVID-19 by integrating three scores for each drug: motif scores, knowledge graph PageRank scores, and knowledge graph embedding scores. The knowledge graph contains over 48 000 nodes and 13 37 000 edges, including 13 563 molecules in the DrugBank database. From the 5624 molecules identified by the motif-discovery algorithms, ranking results show that 112 drug molecules had the top 2% scores, of which 50 existing drugs with other indications approved by health administrations reported. The proposed drug candidates serve to generate hypotheses for future evaluation in clinical trials and observational studies.

Effective and Efficient Summarization for Non-hierarchical Data

Ivannikov Ispras Open Conference (ISPRAS) (ISPRAS, None), 2021

Authors: Xiang Ji, Xiao Yan, Kaijun Ren, Xiang Wang, Bo Tang

Data cubes are ubiquitous in domains including meteorology, sales and demography, and data summarization is an important service that can be used to compress data cubes and extract insight. Existing data summarization methods require presupposed hierarchies for the data cube dimensions, which do not exist for many types of data (e.g., rainfall and temperature). To tackle this problem, we first define the non-hierarchical data summarization (NHDS) problem, which covers data cube using rectangle regions with an error bound and minimizes the summary size. We then show that the NHDS problem is NP-hard and design the Mark and Select (MS) algorithm to find an approximate solution. MS first identifies the qualified rectangles and then selects among the rectangles to cover the data cube. To improve efficiency, we show that it suffices to find only some of the qualified rectangles, devise a procedure to avoid checking rectangles that do not contribute to the result, save unnecessary computation during rectangle selection using sub-modularity. We conducted experiments on both real and synthetic datasets. The results show that MS significantly outperforms a state-of-the-art baseline in summary size, error and running time

2020

CheetahVIS: A Visual Analytical System for Large Urban Bus Data.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2020

Authors: Wentao Ning,Qiandong Tang, Yi Zhao, Chuan Yang, Xiaofeng Wang, Teng Wang, Haotian Liu, Chaozu Zhang, Zhiyuan Zhou, Qiaomu Shen, Bo Tang.

Recently, the spatial-temporal data of urban moving objects, e.g., cars and buses, are collected and widely used in urban trajectory exploratory analysis. Urban bus service is one of the most common public transportation services. Urban bus data analysis plays an important role in smart city applications. For example, data analysts in bus companies use the urban bus data to optimize their bus scheduling plan. Map services providers, e.g., Google map, Ten-cent map, take urban bus data into account to improve their service quality (e.g., broadcast road update instantly). Unlike urban moving cars or pedestrians, urban buses travel on known bus routes. The operating buses form the "bus flows" in a city. Efficient analyzing urban bus flows has many challenges, e.g., how to analyze the dynamics of given bus routes? How to help users to identify traffic flow of interests easily? In this work, we present CheetahVIS, a visual analytical system for efficient massive urban bus data analysis. CheetahVIS builds upon Spark and provides a visual analytical platform for the stakeholders (e.g., city planner, data analysts in bus company) to conduct effective and efficient analytical tasks. In the demonstration, demo visitors will be invited to experience our proposed CheetahVIS system with different urban bus data analytical functions, e.g., bus route analysis, public bus flow overview, multiple region analysis, in a real-world dataset. We also will present a case study, which compares different regions in a city, to demonstrate the effectiveness of CheetahVIS.

Draft and Edit: Automatic Storytelling Through Multi-Pass Hierarchical Conditional Variational Autoencoder .

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2020

Authors: Meng-Hsuan Yu, Juntao Li, Danyang Liu, Bo Tang, Haisong Zhang, Dongyan Zhao, Rui Yan.

Automatic Storytelling has consistently been a challenging area in the field of natural language processing. Despite considerable achievements have been made, the gap between automatically generated stories and human-written stories is still significant. Moreover, the limitations of existing automatic storytelling methods are obvious, e.g., the consistency of content, wording diversity. In this paper, we proposed a multi-pass hierarchical conditional variational autoencoder model to overcome the challenges and limitations in existing automatic storytelling models. While the conditional variational autoencoder (CVAE) model has been employed to generate diversified content, the hierarchical structure and multi-pass editing scheme allow the story to create more consistent content. We conduct extensive experiments on the ROCStories Dataset. The results verified the validity and effectiveness of our proposed model and yields substantial improvement over the existing state-of-the-art approaches.

Towards Self-Tuning Parameter Servers

IEEE International Conference on Big Data (IEEE BigData, CCF-C), 2020

Authors: Chris Liu, Pengfei Zhang, Bo Tang, Hang Shen, Ziliang Lai, Eric Lo, Korris Chung.

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training. Good efficiency, i.e., fast completion time of running a specific ML training job, therefore, is a key feature of a successful ML system. While the completion time of a long-running ML job is determined by the time required to reach model convergence, that is also largely influenced by the values of various system settings. In this paper, we contribute techniques towards building self-tuning parameter servers. Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested model, the system is also iteratively learning which system setting is more efficient for that job and applies it online. Our techniques are general enough to various PS-style ML systems. Experiments on TensorFlow show that our techniques can reduce the completion times of a variety of long-running TensorFlow jobs from 1.4× to 18×.

CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data .

SIGMOD2020 Programming Contest Finalist Paper (DI2KG, None), 2020

Authors: Nan Deng, Wendi Luan, Haotian Liu, Bo Tang

The SIGMOD Programming Contest 2020 raises a real-world entity resolution problem, which requires to identify product specifica- tions from multiple e-commerce websites that represent the same real-world cameras. Entity resolution has been extensively studied and the general solution framework consists of two phases: blocking and matching. Most existing works focus on the matching phase, which trains (complex) models on large volumes of data and uses the models to decide whether a pair of descriptions refers to the same real-world object. However, training a high-quality model is difficult for the SIGMOD contest because there is only a limited amount of labeled data and the product specifications can be dirty and incomplete. In this paper, we propose CheetahER, an accurate and efficient entity resolution system. Different from existing works, we focus on improving the effectiveness of the blocking phase, which is overlooked in both academia researches and industry systems, and propose a two-phase blocking framework to group the product specifications according to brand and model. The pre-processing and data cleaning procedures are also carefully designed to improve data quality. CheetahER ranks the 1st in accuracy among 53 teams and completes the task within 20 seconds. Even though some de- signs of CheetahER are specialized for camera datasets, its novel two-phase blocking framework and operators (i.e., merging and splitting) may generalize to other entity resolution tasks.

2019

Creating Top Ranking Options in the Continuous Option and Preference Space.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2019

Authors: Bo Tang, Kyriakos Mouratidis, Man Lung Yiu, Zhenyu Chen.

Top-k queries are extensively used to retrieve the k most relevant options (e.g., products, services, accommodation alternatives, etc) based on a weighted scoring function that captures user preferences. In this paper, we take the viewpoint of a business owner who plans to introduce a new option to the market, with a certain type of clientele in mind. Given a target region in the consumer spectrum, we determine what attribute values the new option should have, so that it ranks among the top-k for any user in that region. Our methodology can also be used to improve an existing option, at the minimum modification cost, so that it ranks consistently high for an intended type of customers. This is the first work on competitive option placement where no distinct user(s) are targeted, but a general clientele type, i.e., a continuum of possible preferences. Here also lies our main challenge (and contribution), i.e., dealing with the interplay between two continuous spaces: the targeted region in the preference spectrum, and the option domain (where the new option will be placed). At the core of our methodology lies a novel and powerful interlinking between the two spaces. Our algorithms offer exact answers in practical response times, even for the largest of the standard benchmark datasets.

Accelerating Exact Inner Product Retrieval by CPU-GPU System.

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, CCF-A), 2019

Authors: Long Xiang, Bo Tang, Chuan Yang

Recommender systems are widely used in many applications, e.g., social network, e-commerce. Inner product retrieval IPR is the core subroutine in Matrix Factorization (MF) based recommender systems. It consists of two phases: i) inner product computation and ii) top-k items retrieval. The performance bottleneck of existing solutions is inner product computation phase. Exploiting Graphics Processing Units (GPUs) to accelerate the computation intensive workloads is the gold standard in data mining and machine learning communities. However, it is not trivial to apply CPU-GPU systems to boost the performance of IPR solutions due to the nature complex of the IPR problem. In this work, we analyze the time cost of each phase in IPR solutions at first. Second, we exploit the characteristics of CPU-GPU systems to improve performance. Specifically, the computation tasks of IPR solution are heterogeneously processed in CPU-GPU systems. Third, we demonstrate the efficiency of our proposal on four standard real datasets.

Vaite: a Visualization-Assisted Interactive Big Urban Trajectory Data Exploration System.

International Conference on Data Engineering (ICDE, CCF-A), 2019

Authors: Chuang Yang, Yilan Zhang, Bo Tang, Min Zhu.

Big urban trajectory exploration extracts insights from trajectories. It has many smart-city applications, e.g., traffic jam detection, taxi movement pattern analysis. The challenges of big urban trajectory data exploration are: (i) the data analysts probably do not have experience or knowledge on issuing their analysis tasks by SQL-like queries or analysis operations accurately; and (ii) big urban trajectory data is naturally complex, e.g., unpredictability, interrelation, etc. In this work, we architect and implement a visualization-assisted big urban trajectory data exploration system (Vaiet) to address these chanllenges. Vaiet includes three layers, from data collection to results visualization. We devise novel visualization views in Vaiet to support interactive big urban trajectory exploratory analysis. We demonstrate the effectiveness of Vaiet by the real world applications.

Insufficient Data Can Also Rock!Learning to Converse Using Smaller Data with Augmentation.

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2019

Authors: Juntao Li, Lisong Qiu, Bo Tang, Dongmin Chen, Dongyan Zhao, Rui Yan.

Recent successes of open-domain dialogue generation mainly rely on the advances of deep neural networks. The effectiveness of deep neural network models depends on the amount of training data. As it is laboursome and expensive to acquire a huge amount of data in most scenarios, how to effectively utilize existing data is the crux of this issue. In this paper, we use data augmentation techniques to improve the performance of neural dialogue models on the condition of insufficient data. Specifically, we propose a novel generative model to augment existing data, where the conditional variational autoencoder (CVAE) is employed as the generator to output more training data with diversified expressions. To improve the correlation of each augmented training pair, we design a discriminator with adversarial training to supervise the augmentation process. Moreover, we thoroughly investigate various data augmentation schemes for neural dialogue system with generative models, both GAN and CVAE. Experimental results on two open corpora, Weibo and Twitter, demonstrate the superiority of our proposed data augmentation model.

Find a Reasonable Ending for Stories: Does Logic Relation Help the Story Cloze Test?

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2019

Authors: Mingyue Shang, Zhenxin Fu, Hongzhi Yin, Bo Tang, Dongyan Zhao and Rui Yan.

Natural language understanding is a challenging problem that covers a wide range of tasks. While previous methods generally train each task separately, we consider combining the cross-task features to enhance the task performance. In this paper, we incorporate the logic information with the help of the Natural Language Inference (NLI) task to the Story Cloze Test (SCT). Previous work on SCT considered various semantic information, such as sentiment and topic, but lack the logic information between sentences which is an essential element of stories. Thus we propose to extract the logic information during the course of the story to improve the understanding of the whole story. The logic information is modeled with the help of the NLI task. Experimental results prove the strength of the logic information.

Fast Trajectory Range Query with Discrete Frechet Distance.

Advances in Database Technology (EDBT, CCF-B), 2019

Authors: Jiahao Zhang, Bo Tang, Man Lung Yiu.

The discrete Fréchet distance (DFD) is widely used to measure the similarity between two trajectories. Trajectory range query has been extensively studied in trajectory analytical applications, e.g., outlier detection, movement pattern analysis. With the discrete Fréchet distance, the above applications are computation bound rather than disk I/O bound. In this work, we propose new lower and upper bound functions to speedup the evaluation of trajectory range queries. Experimental studies on three real datasets demonstrate the superiority of our proposal.

Mining Heterogeneous Urban Data for Retail Store Placement .

Proceedings of the ACM Turing Celebration Conference-China (SIGMOD China, None), 2019

Authors: Jian Zeng, Bo Tang.

Retail store placement problem has been extensively studied in both academic and industry as it decides the retail success of a business. Existing methods exploited either consumer studies (e.g., consultant-based solutions) or geographic features (e.g., points of interests) to settle it. However, due to the limitations of data sources (i.e., costly in time and labor), none of these methods could provide an accurate and timely solution. In this paper, we rethink retail store placement problem by mining heterogeneous urban data. In particular, unlike existing works which only used geographic features or consumer studies solely, we extract three categories of features (i.e., human movement features, commercial area features and geographic features) from heterogeneous urban data, and integrate them into various machine learning models to predict the popularity of a prospective retail store in the candidate area. We conduct a case study with real data in Shenzhen to demonstrate the predictive power of our proposal.

2018

Exact Processing of Uncertain Top-k Queries in Multi-criteria Settings.

Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2018

Authors: Kyriakos Mouratidis, Bo Tang.

Traditional rank-aware processing assumes a dataset that contains available options to cover a specific need (e.g., restaurants, hotels, etc) and users who browse that dataset via top-k queries with linear scoring functions, i.e., by ranking the options according to the weighted sum of their attributes, for a set of given weights. In practice, however, user preferences (weights) may only be estimated with bounded accuracy, or may be inherently uncertain due to the inability of a human user to specify exact weight values with absolute accuracy. Motivated by this, we introduce the uncertain top-k query (UTK). Given uncertain preferences, that is, an approximate description of the weight values, the UTK query reports all options that may belong to the top-k set. A second version of the problem additionally reports the exact top-k set for each of the possible weight settings. We develop a scalable processing framework for both UTK versions, and demonstrate its efficiency using standard benchmark datasets.

Deriving Real-time City Crowd Flows by Heterogeneous Big Urban Data.

IEEE International Conference on Big Data (IEEE BigData, CCF-C), 2018

Authors: Bo Tang, Chuan Yang, Long Xiang, Jian Zeng.

Real-time city crowd flows are extremely important for facility placement, transportation management, and public safety. In this paper, we show how to derive the real-time city crowd flows with heterogeneous urban data. Unlike existing prediction-based approaches, our proposal does not rely on the training data and learning models. We propose a computation framework for it by exploiting the massive heterogeneous urban data, which includes both immutable data (i.e., bus and subway stations, commercial-based regions) and mutable data (i.e., real-time taxi, bus, subway transactions and trajectories). In addition, our solution provides accurate and timely city crowd-flows.To provide a practical solution for it, we first partition the city into commercial-based regions with geographic information (e.g., road network, administrative regions). Then, we devise three major components (i.e., urban data fusion model, heterogeneous urban data integration model, and effective crowd flows computation model) in the computation framework to process massive heterogeneous urban data effectively and derive city crowd flows accurately. Finally, we conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed solution with real heterogeneous urban data in Shenzhen, China.

Efficient Longest Streak Discovery in Multidimensional Sequence Data .

Web and Big Data: Second International Joint Conference (APWeb-WAIM, CCF-C), 2018

Authors: Wentao Wang, Bo Tang, Min Zhu.

This paper studies the problem of discovering longest streak in multidimensional sequence dataset. Given a multidimensional sequence dataset, the contextual longest streak is the longest consecutive tuples in a context subspace which match with a specific measure constraint. It has various applications in social network analysis, computational journalism, etc. The challenges of the longest streak discovery problem are (i) huge search space, and (ii) non-monotonicity property of streak lengths. In this paper, we propose a novel computation framework with a suite of optimization techniques for it. Our solutions outperform the baseline solution by two orders of magnitude in both real and synthetic datasets. In addition, we validate the effectiveness of our proposal by a real-world case study.

Joint Face Alignment and 3D Face Reconstruction with Application to Face Recognition

IEEE transactions on pattern analysis and machine intelligence (PAMI, None), 2018

Authors: F Liu, Q Zhao, X Liu, D Zeng

Face alignment and 3D face reconstruction are traditionally accomplished as separated tasks. By exploring the strong correlation between 2D landmarks and 3D shapes, in contrast, we propose a joint face alignment and 3D face reconstruction method to simultaneously solve these two problems for 2D face images of arbitrary poses and expressions. This method, based on a summation model of 3D faces and cascaded regression in 2D and 3D shape spaces, iteratively and alternately applies two cascaded regressors, one for updating 2D landmarks and the other for 3D shape. The 3D shape and the landmarks are correlated via a 3D-to-2D mapping matrix, which is updated in each iteration to refine the location and visibility of 2D landmarks. Unlike existing methods, the proposed method can fully automatically generate both pose-and-expression-normalized (PEN) and expressive 3D faces and localize both visible and invisible 2D landmarks. Based on the PEN 3D faces, we devise a method to enhance face recognition accuracy across poses and expressions. Both linear and nonlinear implementations of the proposed method are presented and evaluated in this paper. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face.

A Five-layer Architecture for Big Data Processing and Analytics.

international Journal of big Data intelligence (IJBDI, None), 2018

Authors: Julie Yixuan Zhu, Bo Tang, Victor Li.

Big data technologies have attracted much attention in recent years. The academia and industry have reached a consensus, that is, the ultimate goal of big data is about transforming 'big data' to 'real value'. In this article, we discuss how to achieve this goal and propose five-layer architecture for big data processing and analytics (BDPA), including a collection layer, a storage layer, a processing layer, an analytics layer, and an application layer. The five-layer architecture targets to set up a de facto standard for current BDPA solutions, to collect, manage, process, and analyse the vast volume of both static data and online data streams, and make valuable decisions for all types of industries. Functionalities and challenges of the five-layers are illustrated, with the most recent technologies and solutions discussed accordingly. We conclude with the requirements for the future BDPA solutions, which may serve as a foundation for the future big data ecosystem.

2017

Determining the Impact Regions of Competing Options in Preference Space.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2017

Authors: Bo Tang, Kyriakos Mouratidis, Man Lung Yiu.

In rank-aware processing, user preferences are typically represented by a numeric weight per data attribute, collectively forming a weight vector. The score of an option (data record) is defined as the weighted sum of its individual attributes. The highest-scoring options across a set of alternatives (dataset) are shortlisted for the user as the recommended ones. In that setting, the user input is a vector (equivalently, a point) in a d-dimensional preference space, where d is the number of data attributes. In this paper we study the problem of determining in which regions of the preference space the weight vector should lie so that a given option (focal record) is among the top-k score-wise. In effect, these regions capture all possible user profiles for which the focal record is highly preferable, and are therefore essential in market impact analysis, potential customer identification, profile-based marketing, targeted advertising, etc. We refer to our problem as k-Shortlist Preference Region identification (kSPR), and exploit its computational geometric nature to develop a framework for its efficient (and exact) processing. Using real and synthetic benchmarks, we show that our most optimized algorithm outperforms by three orders of magnitude a competitor we constructed from previous work on a different problem.

Extracting Top-K Insights from Multi-dimensional Data.

Proceedings of the ACM on Management of Data (SIGMOD, CCF-A), 2017

Authors: Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, Dongmei Zhang.

OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

Efficient Motif Discovery in Spatial Trajectories Using Discrete Fréchet Distance.

Extending Database Technology (EDBT, CCF-B), 2017

Authors: Bo Tang, Man Lung Yiu, Kyriakos Mouratidis, Kai Wang.

The discrete Fréchet distance (DFD) captures perceptual and geographical similarity between discrete trajectories. It has been successfully adopted in a multitude of applications, such as signature and handwriting recognition, computer graphics, as well as geographic applications. Spatial applications, e.g., sports analysis, traffic analysis, etc. require discovering the pair of most similar subtrajectories, be them parts of the same or of different input trajectories. The identified pair of subtrajectories is called a motif. The adoption of DFD as the similarity measure in motif discovery, although semantically ideal, is hindered by the high computational complexity of DFD calculation. In this paper, we propose a suite of novel lower bound functions and a grouping-based solution with multi-level pruning in order to compute motifs with DFD efficiently. Our techniques apply directly to motif discovery within the same or between different trajectories. An extensive empirical study on three real trajectory datasets reveals that our approach is 3 orders of magnitude faster than a baseline solution.

Fast Subsequence Search on Time Series Data.

Extending Database Technology (EDBT, CCF-B), 2017

Authors: Bo Tang#, Yuhong Li#, Leong Hou U, Man Lung Yiu, Zhiguo Gong.

We present an efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequences into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. In more detail, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.

2016

Exploit Every Cycle: Vectorized Time Series Algorithms on Modern Commodity CPUs.

Data Management on New Hardware (ADMS, CCF-A), 2016

Authors: Bo Tang, Man Lung Yiu, Yuhong Li, Leong Hou U.

Many time series algorithms reduce the computation cost by pruning unpromising candidates with lower-bound distance functions. In this paper, we focus on an orthogonal research direction that further boosts the performance by unlocking the potentials of modern commodity CPUs. First, we conduct a performance profiling on existing algorithms to understand where does time go. Second, we design vectorized implementations for lower-bound and distance functions that can enjoy characteristics (e.g., data parallelism, caching, branch prediction) provided by CPU. Third, our vectorized methods are general and applicable to many time series problems such as subsequence search, motif discovery and kNN classification. Our experimental study on real datasets shows that our proposal can achieve up to 6 times of speedup.

Exploit Every Bit: Effective Caching for High-Dimensional Nearest Neighbor Search.

IEEE Transactions on Knowledge and Data Engineering (TKDE, CCF-A), 2016

Authors: Bo Tang, Man Lung Yiu, Kien A Hua.

High-dimensional k nearest neighbor (kNN) search has a wide range of applications in multimedia information retrieval. Existing disk-based k NN search methods incur significant I/O costs in the candidate refinement phase. In this paper, we propose to cache compact approximate representations of data points in main memory in order to reduce the candidate refinement time during kNN search. This problem raises two challenging issues: (i) which is the most effective encoding scheme for data points to support k NN search? and (ii) what is the optimal number of bits for encoding a data point? For (i), we formulate and solve a novel histogram optimization problem that decides the most effective encoding scheme. For (ii), we develop a cost model for automatically tuning the optimal number of bits for encoding points. In addition, our approach is generic and applicable to exact / approximate k NN search methods. Extensive experimental results on real datasets demonstrate that our proposal can accelerate the candidate refinement time of kNN search by at least an order of magnitude.

2015

Diversified caching for replicated web search engines.

International Conference on Data Engineering (ICDE, CCF-A), 2015

Authors: Chuanfei Xu, Bo Tang, Man Lung Yiu.

Commercial web search engines adopt parallel and replicated architecture in order to support high query throughput. In this paper, we investigate the effect of caching on the throughput in such a setting. A simple scheme, called uniform caching, would replicate the cache content to all servers. Unfortunately, it does not exploit the variations among queries, thus wasting memory space on caching the same cache content redundantly on multiple servers. To tackle this limitation, we propose a diversified caching problem, which aims to diversify the types of queries served by different servers, and maximize the sharing of terms among queries assigned to the same server. We show that it is NP-hard to find the optimal diversified caching scheme, and identify intuitive properties to seek good solutions. Then we present a framework with a suite of techniques and heuristics for diversified caching. Finally, we evaluate the proposed solution with competitors by using a real dataset and a real query log