Face2Exp: Combating Data Biases for Facial Expression Recognition
IEEE Conference on Computer Vision and Pattern Recognition (CVPR, CCF-A), 2022
𝜏-LevelIndex: Towards Efficient Query Processing in Continuous Preference Space
ACM Conference on Management of Data (SIGMOD, CCF-A), 2022
Spatial Data Quality in the IoT Era: Management and Exploitation
ACM Conference on Management of Data (SIGMOD, CCF-A), 2022
GHive: A Demonstration of GPU-Accelerated Query Processing in Apache Hive
ACM Conference on Management of Data (SIGMOD, CCF-A), 2022
Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments
Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2022
Constructing Compact Time Series Index for Efficient Window Query Processing
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022
imDedup: a Lossless Deduplication Scheme to Eliminate Fine-grained Redundancy among Images
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022
Images occupy a large amount of storage in data centers. To cope with the explosive growth of the image storage requirement, image compression techniques are devised to shrink the size of every single image at first. Furthermore, image deduplication methods are proposed to reduce the storage cost as they could be used to eliminate redundancy among images. However, state-of-the-art image deduplication methods either can only eliminate file-level coarse-grained redundancy or cannot guarantee lossless deduplication. In this work, we propose a new lossless image deduplication framework to eliminate fine-grained redundancy among images. It first decodes images to expose similarity, then eliminates fine-grained redundancy on the decoded data by delta compression , and finally re-compresses the remaining data by image compression encoding. Based on this framework, we propose a novel lossless similarity-based deduplication (SBD) scheme for decoded image data (called imDedup). Specifically, imDedup uses a novel and fast sampling method (called Feature Map) to detect similar images in a two-dimensional way, which greatly reduces computation overhead. Meanwhile, it uses a novel delta encoder (called Idelta) which incorporates image compression encoding characteristics into deduplication to guarantee the remaining deduplicated image data to be friendly re-compressed via image encoding, which significantly improves the compression ratio. We implement a prototype of imDedup for JPEG images, and demonstrate its superiority on four datasets: Compared with exact image deduplication, imDedup achieves a 19%-38% higher compression ratio by efficiently eliminating fine-grained redundancy. Compared with the similarity detector and delta encoder of state-of-the-art SBD schemes running on the decoded image data, imDedup achieves a 1.8×-3.4× higher throughput and a 1.3×-1.6× higher compression ratio, respectively.
CheetahKG: A Demonstration for Core-based Top-k Frequent Pattern Discovery on Knowledge Graphs
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022
Fast Error-Bounded Distance Distribution Computation (Extended Abstract)
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2022
In this work we study the distance distribution computation problem. It has been widely used in many real-world applications, e.g., human genome clustering, cosmological model analysis, and parameter tuning. The straightforward solution for the exact distance distribution computation problem is unacceptably slow due to (i) massive data size, and (ii) expensive distance computation. In this paper, we propose a novel method to compute approximate distance distributions with error bound guarantees. Furthermore, our method is generic to different distance measures. We conduct extensive experimental studies on three widely used distance measures with real-world datasets. The experimental results demonstrate that our proposed method outperforms sampling-based solution (without error guarantees) by up to three orders of magnitude.
Rank-based filter pruning for real-time UAV tracking
IEEE International Conference on Multimedia & Expo (ICME, CCF-B), 2022
On m-Impact Regions and Standing Top-k Influence Problems
ACM Conference on Management of Data (SIGMOD, CCF-A), 2021
In this paper, we study the m-impact region problem (mIR). In a context where users look for available products with top-k queries, mIR identifies the part of the product space that attracts the most user attention. Specifically, mIR determines the kind of attribute values that lead a (new or existing) product to the top-k result for at least a fraction of the user population. mIR has several applications, ranging from effective marketing to product improvement. Importantly, it also leads to (exact and efficient) solutions for standing top-k impact problems, which were previously solved heuristically only, or whose current solutions face serious scalability limitations. We experiment, among others, on data mined from actual user reviews for real products, and demonstrate the practicality and efficiency of our algorithms, both for mIR and for standing top-k impact problems.
Marrying Top-k with Skyline Queries: Relaxing the Preference Input while Producing Output of Controllable Size
ACM Conference on Management of Data (SIGMOD, CCF-A), 2021
The two most common paradigms to identify records of preference in a multi-objective setting rely either on dominance (e.g., the skyline operator) or on a utility function defined over the records' attributes (typically, using a top-k query). Despite their proliferation, each of them has its own palpable drawbacks. Motivated by these drawbacks, we identify three hard requirements for practical decision support, namely, personalization, controllable output size, and flexibility in preference specification. With these requirements as a guide, we combine elements from both paradigms and propose two new operators, ORD and ORU. We perform a qualitative study to demonstrate how they work, and evaluate their performance against adaptations of previous work that mimic their output.
Vertex-Centric Visual Programming for Graph Neural Networks
ACM Conference on Management of Data (SIGMOD, CCF-A), 2021
Graph neural networks (GNNs) have achieved remarkable performance in many graph analytics tasks such as node classification, link prediction and graph clustering. Existing GNN systems (e.g., PyG and DGL) adopt a tensor-centric programming model and train GNNs with manually written operators. Such design results in poor usability due to the large semantic gap between the API and the GNN models, and suffers from inferior efficiency because of high memory consumption and massive data movement. We demonstrateSeastar, a novel GNN training framework that adopts avertex-centric programming paradigm and supportsautomatic kernel generation, to simplify model development and improve training efficiency. We will (i) show how to express GNN models succinctly using a visual "drag-and-drop'' interface or Seastar's vertex-centric python API; (ii) demonstrate the performance advantage of Seastar over existing GNN systems in convergence speed, training throughput and memory consumption; and (iii) illustrate how Seastar's optimizations (e.g., operator fusion and constant folding) improve training efficiency by profiling the run-time performance.
GAIPS: Accelerating Maximum Inner Product Search with GPU
International Conference on Research on Development in Information Retrieval (SIGIR, CCF-A), 2021
In this paper, we propose the GAIPS framework for efficient maximum inner product search (MIPS) on GPU. We observe that a query can usually find a good lower bound of its maximum inner product in some large norm items that take up only a small portion of the dataset and utilize this fact to facilitate pruning. In addition, we design norm-based, residue-based and hash-based pruning techniques to avoid computation for items that are unlikely to be the MIPS results. Experiment results show that compared with FAISS, the state-of-the-art GPU-based similarity search framework, GAIPS has significantly shorter query processing time at the same recall.
Fast Core-based Top-k Frequent Pattern Discovery in Knowledge Graphs
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021
Knowledge graph is a way of structuring information in graph form, by representing entities as nodes and relationships between entities as edges. A knowledge graph often consists of large amount of facts in real-world which can be used in supporting many analytical tasks, e.g., exceptional facts discovery and fact check of claims. In this work, we study a core-based top-k frequent pattern discovery problem which is frequently used as a subroutine in analyzing knowledge graphs. The main challenge of the problem is search space of the candidate patterns is exponential to the combinations of the nodes and edges in the knowledge graph.To reduce the search space, we devise a novel computation framework FastPat with a suite of optimizations. First, we devise a meta-index, which can be used to avoid generating invalid candidate patterns. Second, we propose an upper bound of the frequency score (i.e., MNI) of the candidate pattern that prunes unqualified candidates earlier and prioritize the enumeration order of the patterns. Lastly, we design a join-based approach to compute the MNI of candidate pattern efficiently. We conduct extensive experimental studies in real-world datasets to verify the superiority of our proposed method over the baselines. We also demonstrate the utility of the discovered frequent patterns by a case study in COVID-19 knowledge graph.
Towards Efficient MaxBRNN Computation for Streaming Updates
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021
In this paper, we propose the streaming MaxBRNNquery, which finds the optimal region to deploy a new service point when both the service points and client points are under continuous updates. The streaming MaxBRNN query has many applications such as taxi scheduling, shared bike placements, etc. Existing MaxBRNN solutions are insufficient for streaming updates as they need to re-run from scratch even for a small amount of updates, resulting in long query processing time. To tackle this problem, we devise an efficient slot partitioning-based algorithm (SlotP), which divides the space into equal-sized slots and processes each slot independently. The superiorities of our proposal for streaming MaxBRNN query are: (i) an update affects only a smaller number of slots and works done on the unaffected slots can be reused directly; (ii) the influence value upper bound of each slot can be derived efficiently and accurately, which facilitate pruning many slots from expensive computation. We conducted extensive experiments to validate the performance of the SlotP algorithm. The results show that SlotP is 2-3 orders of magnitude faster than state-of-the-art baselines.
GRAB: Finding Time Series Natural Structures via A Novel Graph-Based Scheme
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2021
In recent years, the widespread use of sensors has substantially stimulated researchers' interest in time series data mining. Real-world time series often include natural structures. For example, a time series captured from a patient rehabilitation app can be divided into a series of movements, e.g., sitting, standing, and walking. Finding time series natural structures (i.e., latent semantic states) is one of the core subroutines in time series mining applications. However, this task is not trivial as it has two challenges: (1) how to determine the correct change points between consecutive segments, and (2) how to cluster segments into different states.In this paper, we propose a novel graph-based approach, GRAB, to discover time series natural structures. In particular, GRAB first partitions the time series into a set of non-overlapping fragments via the similarity between subsequences. Then, it constructs a fragment-based graph and employs a graph partition method to cluster the fragments into states. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of our GRAB method. Specifically, GRAB finds high-quality latent states, and it outperforms state-of-the-art solutions by orders of magnitude.
Combating Ambiguity for hash-code Learning in Medical Instance Retrieval
IEEE Journal of Biomedical and Health Informatics (JCR-Q1), 2021
When encountering a dubious diagnostic case, medical instance retrieval can help radiologists make evidence-based diagnoses by finding images containing instances similar to a query case from a large image database. The similarity between the query case and retrieved similar cases is determined by visual features extracted from pathologically abnormal regions. However, the manifestation of these regions often lacks specificity, i.e., different diseases can have the same manifestation, and different manifestations may occur at different stages of the same disease. To combat the manifestation ambiguity in medical instance retrieval, we propose a novel deep framework called Y-Net, encoding images into compact hash-codes generated from convolutional features by feature aggregation. Y-Net can learn highly discriminative convolutional features by unifying the pixel-wise segmentation loss and classification loss. The segmentation loss allows exploring subtle spatial differences for good spatial-discriminability while the classification loss utilizes class-aware semantic information for good semantic-separability. As a result, Y-Net can enhance the visual features in pathologically abnormal regions and suppress the disturbing of the background during model training, which could effectively embed discriminative features into the hash-codes in the retrieval stage. Extensive experiments on two medical image datasets demonstrate that Y-Net can alleviate the ambiguity of pathologically abnormal regions and its retrieval performance outperforms the state-of-the-art method by an average of 9.27% on the returned list of 10.
Spatial Data Quality in the Internet of Things: Management, Exploitation, and Prospects
ACM Computing Surveys (CSUR, JCR-Q1), 2021
With the continued deployment of the Internet of Things (IoT), increasing volumes of devices are being deployed that emit massive spatially referenced data. Due in part to the dynamic, decentralized, and heterogeneous architecture of the IoT, the varying and often low quality of spatial IoT data (SID) presents challenges to applications built on top of this data. This survey aims to provide unique insight to practitioners who intend to develop IoT-enabled applications and to researchers who wish to conduct research that relates to data quality in the IoT setting. The survey offers an inventory analysis of major data quality dimensions in SID and covers significant data characteristics and associated quality considerations. The survey summarizes data quality related technologies from both task and technique perspectives. Organizing the technologies from the task perspective, it covers recent progress in SID quality management, encompassing location refinement, uncertainty elimination, outlier removal, fault correction, data integration, and data reduction; and it covers low-quality SID exploitation, encompassing querying, analysis, and decision-making techniques. Finally, the survey covers emerging trends and open issues concerning the quality of SID.
RCELF: A Residual-based Approach for Influence Maximization Problem
Information Systems (IS, CCF-B), 2021
Influence Maximization Problem (IMP) is selecting a seed set of nodes in the social network to spread the influence as widely as possible. It has many applications in multiple domains, e.g., viral marketing is frequently used for new products or activities advertisement. While it is a classic and well-studied problem in computer science, unfortunately, all those proposed techniques are compromising among time efficiency, memory consumption, and result quality. In this paper, we conduct comprehensive experimental studies on the state-of-the-art IMP approximate approaches to reveal the underlying trade-off strategies. Interestingly, we find that even the state-of-the-art approaches are impractical when the propagation probability of the network have been taken into consideration. With the findings of existing approaches, we propose a novel residual-based approach (i.e., RCELF) for IMP, which (i) overcomes the deficiencies of existing approximate approaches, and (ii) provides theoretical guaranteed results with high efficiency in both time- and space-perspectives. We demonstrate the superiority of our proposal by extensive experimental evaluation on real datasets.
On Discovering Motifs and Frequent Patterns in Spatial Trajectories with Discrete Frechet Distance
GeoInformatica (GeoInformatica, CCF-B), 2021
The discrete Fréchet distance (DFD) captures perceptual and geographical similarity between two trajectories. It has been successfully adopted in a multitude of applications, such as signature and handwriting recognition, computer graphics, as well as geographic applications. Spatial applications, e.g., sports analysis, traffic analysis, etc. require discovering similar subtrajectories within a single trajectory or across multiple trajectories. In this paper, we adopt DFD as the similarity measure, and study two representative trajectory analysis problems, namely, motif discovery and frequent pattern discovery. Due to the time complexity of DFD, these tasks are computationally challenging. We address that challenge with a suite of novel lower bound functions and a grouping-based solution. Our techniques apply directly when the analysis tasks are defined within the same or across multiple trajectories. An extensive empirical study on real trajectory datasets reveals that our approaches are 3 orders of magnitude faster than baseline solutions.
DGCL: An efficient communication library for distributed GNN training
European Conference on Computer Systems (EuroSys, CCF-B), 2021
Graph neural networks (GNNs) have gained increasing popularity in many areas such as e-commerce, social networks and bio-informatics. Distributed GNN training is essential for handling large graphs and reducing the execution time. However, for distributed GNN training, a peer-to-peer communication strategy suffers from high communication overheads. Also, different GPUs require different remote vertex embeddings, which leads to an irregular communication pattern and renders existing communication planning solutions unsuitable. We propose the distributed graph communication library (DGCL) for efficient GNN training on multiple GPUs. At the heart of DGCL is a communication planning algorithm tailored for GNN training, which jointly considers fully utilizing fast links, fusing communication, avoiding contention and balancing loads on different links. DGCL can be easily adopted to extend existing single-GPU GNN systems to distributed training. We conducted extensive experiments on different datasets and network configurations to compare DGCL with alternative communication schemes. In our experiments, DGCL reduces the communication time of the peer-to-peer communication by 77.5% on average and the training time for an epoch by up to 47%.
Self-enhanced gnn: Improving graph neural networks using model outputs
International Joint Conference on Neural Networks (IJCNN, CCF-C), 2021
Graph neural networks (GNNs) have received much attention recently because of their excellent performance on graph-based tasks. However, existing research on GNNs focuses on designing more effective models without considering much about the quality of the input data. In this paper, we propose self-enhanced GNN (SEG), which improves the quality of the input data using the outputs of existing GNN models for better performance on semi-supervised node classification. As graph data consist of both topology and node labels, we improve input data quality from both perspectives. For topology, we observe that higher classification accuracy can be achieved when the ratio of inter-class edges (connecting nodes from different classes) is low and propose topology update to remove inter-class edges and add intra-class edges. For node labels, we propose training node augmentation, which enlarges the training set using the labels predicted by existing GNN models. SEG is a general framework that can be easily combined with existing GNN models. Experimental results validate that SEG consistently improves the performance of well-known GNN models such as GCN, GAT and SGC across different datasets.
Occlusion Invariant Face Recognition Using Simultaneous Segmentation
IET Biometrics (JCR-Q3), 2021
When using convolutional neural network (CNN) models to extract features of an occluded face, the occluded part will inevitably be embedded into the representation just as with other facial regions. Existing methods deal with occluded face recognition either by augmenting the training dataset with synthesized occluded faces or by segmenting occlusions first and subsequently recognize the face based on unoccluded facial regions. Instead, simultaneous occlusion segmentation and face recognition is developed to make the most of these correlated two tasks. This is inspired by the phenomenon that features corrupted by occlusion are traceable within a CNN trained to segment occluded parts in face images. Specifically, a simultaneous occlusion invariant deep network (SOIDN) is proposed that contains simultaneously operating face recognition and occlusion segmentation networks coupled with an occlusion mask adaptor module as their bridge to learn occlusion invariant features. The training of SOIDN is jointly supervised by classification and segmentation losses aiming to obtain (1) occlusion invariant features, (2) occlusion segmentation, and (3) an occlusion feature mask that weighs the reliability of features. Experiments on synthesized occluded dataset (e.g. LFW-occ) and real occluded face dataset (e.g. AR) demonstrate that SOIDN outperforms state of the art methods for face verification and identification.
A survey of face recognition techniques under occlusion
IET Biometrics (JCR-Q3), 2021
The limited capacity to recognize faces under occlusions is a long-standing problem that presents a unique challenge for face recognition systems and even for humans. The problem regarding occlusion is less covered by research when compared to other challenges such as pose variation, different expressions, etc. Nevertheless, occluded face recognition is imperative to exploit the full potential of face recognition for real-world applications. In this paper, we restrict the scope to occluded face recognition. First, we explore what the occlusion problem is and what inherent difficulties can arise. As a part of this review, we introduce face detection under occlusion, a preliminary step in face recognition. Second, we present how existing face recognition methods cope with the occlusion problem and classify them into three categories, which are 1) occlusion robust feature extraction approaches, 2) occlusion aware face recognition approaches, and 3) occlusion recovery based face recognition approaches. Furthermore, we analyze the motivations, innovations, pros and cons, and the performance of representative approaches for comparison. Finally, future challenges and method trends of occluded face recognition are thoroughly discussed.
Drug repurposing for the treatment of COVID-19: a knowledge graph approach
Advanced Therapeutics (None), 2021
Identifying effective drug treatments for COVID-19 is essential to reduce morbidity and mortality. Although a number of existing drugs have been proposed as potential COVID-19 treatments, effective data platforms and algorithms to prioritize drug candidates for evaluation and application of knowledge graph for drug repurposing have not been adequately explored. A COVID-19 knowledge graph by integrating 14 public bioinformatic databases containing information on drugs, genes, proteins, viruses, diseases, symptoms and their linkages is developed. An algorithm is developed to extract hidden linkages connecting drugs and COVID-19 from the knowledge graph, to generate and rank proposed drug candidates for repurposing as treatments for COVID-19 by integrating three scores for each drug: motif scores, knowledge graph PageRank scores, and knowledge graph embedding scores. The knowledge graph contains over 48 000 nodes and 13 37 000 edges, including 13 563 molecules in the DrugBank database. From the 5624 molecules identified by the motif-discovery algorithms, ranking results show that 112 drug molecules had the top 2% scores, of which 50 existing drugs with other indications approved by health administrations reported. The proposed drug candidates serve to generate hypotheses for future evaluation in clinical trials and observational studies.
Effective and Efficient Summarization for Non-hierarchical Data
Ivannikov Ispras Open Conference (ISPRAS, None), 2021
Data cubes are ubiquitous in domains including meteorology, sales and demography, and data summarization is an important service that can be used to compress data cubes and extract insight. Existing data summarization methods require presupposed hierarchies for the data cube dimensions, which do not exist for many types of data (e.g., rainfall and temperature). To tackle this problem, we first define the non-hierarchical data summarization (NHDS) problem, which covers data cube using rectangle regions with an error bound and minimizes the summary size. We then show that the NHDS problem is NP-hard and design the Mark and Select (MS) algorithm to find an approximate solution. MS first identifies the qualified rectangles and then selects among the rectangles to cover the data cube. To improve efficiency, we show that it suffices to find only some of the qualified rectangles, devise a procedure to avoid checking rectangles that do not contribute to the result, save unnecessary computation during rectangle selection using sub-modularity. We conducted experiments on both real and synthetic datasets. The results show that MS significantly outperforms a state-of-the-art baseline in summary size, error and running time.
Joint Face Alignment and 3D Face Reconstruction with Application to Face Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, CCF-A, JCR-Q1), 2020
Face alignment and 3D face reconstruction are traditionally accomplished as separated tasks. By exploring the strong correlation between 2D landmarks and 3D shapes, in contrast, we propose a joint face alignment and 3D face reconstruction method to simultaneously solve these two problems for 2D face images of arbitrary poses and expressions. This method, based on a summation model of 3D faces and cascaded regression in 2D and 3D shape spaces, iteratively and alternately applies two cascaded regressors, one for updating 2D landmarks and the other for 3D shape. The 3D shape and the landmarks are correlated via a 3D-to-2D mapping matrix, which is updated in each iteration to refine the location and visibility of 2D landmarks. Unlike existing methods, the proposed method can fully automatically generate both pose-and-expression-normalized (PEN) and expressive 3D faces and localize both visible and invisible 2D landmarks. Based on the PEN 3D faces, we devise a method to enhance face recognition accuracy across poses and expressions. Both linear and nonlinear implementations of the proposed method are presented and evaluated in this paper. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face.
CheetahVIS: A Visual Analytical System for Large Urban Bus Data
International Conference on Very Large Data Bases (VLDB, CCF-A), 2020
Recently, the spatial-temporal data of urban moving objects, e.g., cars and buses, are collected and widely used in urban trajectory exploratory analysis. Urban bus service is one of the most common public transportation services. Urban bus data analysis plays an important role in smart city applications. For example, data analysts in bus companies use the urban bus data to optimize their bus scheduling plan. Map services providers, e.g., Google map, Ten-cent map, take urban bus data into account to improve their service quality (e.g., broadcast road update instantly). Unlike urban moving cars or pedestrians, urban buses travel on known bus routes. The operating buses form the "bus flows" in a city. Efficient analyzing urban bus flows has many challenges, e.g., how to analyze the dynamics of given bus routes? How to help users to identify traffic flow of interests easily?
In this work, we present CheetahVIS, a visual analytical system for efficient massive urban bus data analysis. CheetahVIS builds upon Spark and provides a visual analytical platform for the stakeholders (e.g., city planner, data analysts in bus company) to conduct effective and efficient analytical tasks. In the demonstration, demo visitors will be invited to experience our proposed CheetahVIS system with different urban bus data analytical functions, e.g., bus route analysis, public bus flow overview, multiple region analysis, in a real-world dataset. We also will present a case study, which compares different regions in a city, to demonstrate the effectiveness of CheetahVIS.
Draft and Edit: Automatic Storytelling Through Multi-Pass Hierarchical Conditional Variational Autoencoder
AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2020
Automatic Storytelling has consistently been a challenging area in the field of natural language processing. Despite considerable achievements have been made, the gap between automatically generated stories and human-written stories is still significant. Moreover, the limitations of existing automatic storytelling methods are obvious, e.g., the consistency of content, wording diversity. In this paper, we proposed a multi-pass hierarchical conditional variational autoencoder model to overcome the challenges and limitations in existing automatic storytelling models. While the conditional variational autoencoder (CVAE) model has been employed to generate diversified content, the hierarchical structure and multi-pass editing scheme allow the story to create more consistent content. We conduct extensive experiments on the ROCStories Dataset. The results verified the validity and effectiveness of our proposed model and yields substantial improvement over the existing state-of-the-art approaches.
Towards Self-Tuning Parameter Servers
IEEE International Conference on Big Data (IEEE BigData, CCF-C), 2020
Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training. Good efficiency, i.e., fast completion time of running a specific ML training job, therefore, is a key feature of a successful ML system. While the completion time of a long-running ML job is determined by the time required to reach model convergence, that is also largely influenced by the values of various system settings. In this paper, we contribute techniques towards building self-tuning parameter servers. Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested model, the system is also iteratively learning which system setting is more efficient for that job and applies it online. Our techniques are general enough to various PS-style ML systems. Experiments on TensorFlow show that our techniques can reduce the completion times of a variety of long-running TensorFlow jobs from 1.4× to 18×.
CheetahER: A Fast Entity Resolution System for Heterogeneous Camera Data
International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs (DI2KG, None), 2020
The SIGMOD Programming Contest 2020 raises a real-world entity resolution problem, which requires to identify product specifica?tions from multiple e-commerce websites that represent the same real-world cameras. Entity resolution has been extensively studied and the general solution framework consists of two phases: blocking and matching. Most existing works focus on the matching phase, which trains (complex) models on large volumes of data and uses the models to decide whether a pair of descriptions refers to the same real-world object. However, training a high-quality model is difficult for the SIGMOD contest because there is only a limited amount of labeled data and the product specifications can be dirty and incomplete. In this paper, we propose CheetahER, an accurate and efficient entity resolution system. Different from existing works, we focus on improving the effectiveness of the blocking phase, which is overlooked in both academia researches and industry systems, and propose a two-phase blocking framework to group the product specifications according to brand and model. The pre-processing and data cleaning procedures are also carefully designed to improve data quality. CheetahER ranks the 1st in accuracy among 53 teams and completes the task within 20 seconds. Even though some de?signs of CheetahER are specialized for camera datasets, its novel two-phase blocking framework and operators (i.e., merging and splitting) may generalize to other entity resolution tasks.
Creating Top Ranking Options in the Continuous Option and Preference Space
Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2019
Top-k queries are extensively used to retrieve the k most relevant options (e.g., products, services, accommodation alternatives, etc) based on a weighted scoring function that captures user preferences. In this paper, we take the viewpoint of a business owner who plans to introduce a new option to the market, with a certain type of clientele in mind. Given a target region in the consumer spectrum, we determine what attribute values the new option should have, so that it ranks among the top-k for any user in that region. Our methodology can also be used to improve an existing option, at the minimum modification cost, so that it ranks consistently high for an intended type of customers. This is the first work on competitive option placement where no distinct user(s) are targeted, but a general clientele type, i.e., a continuum of possible preferences. Here also lies our main challenge (and contribution), i.e., dealing with the interplay between two continuous spaces: the targeted region in the preference spectrum, and the option domain (where the new option will be placed). At the core of our methodology lies a novel and powerful interlinking between the two spaces. Our algorithms offer exact answers in practical response times, even for the largest of the standard benchmark datasets.
Accelerating Exact Inner Product Retrieval by CPU-GPU System
International Conference on Research on Development in Information Retrieval (SIGIR, CCF-A), 2019
Recommender systems are widely used in many applications, e.g., social network, e-commerce. Inner product retrieval IPR is the core subroutine in Matrix Factorization (MF) based recommender systems. It consists of two phases: i) inner product computation and ii) top-k items retrieval. The performance bottleneck of existing solutions is inner product computation phase. Exploiting Graphics Processing Units (GPUs) to accelerate the computation intensive workloads is the gold standard in data mining and machine learning communities. However, it is not trivial to apply CPU-GPU systems to boost the performance of IPR solutions due to the nature complex of the IPR problem. In this work, we analyze the time cost of each phase in IPR solutions at first. Second, we exploit the characteristics of CPU-GPU systems to improve performance. Specifically, the computation tasks of IPR solution are heterogeneously processed in CPU-GPU systems. Third, we demonstrate the efficiency of our proposal on four standard real datasets.
Vaite: a Visualization-Assisted Interactive Big Urban Trajectory Data Exploration System
IEEE International Conference on Data Engineering (ICDE, CCF-A), 2019
Big urban trajectory exploration extracts insights from trajectories. It has many smart-city applications, e.g., traffic jam detection, taxi movement pattern analysis. The challenges of big urban trajectory data exploration are: (i) the data analysts probably do not have experience or knowledge on issuing their analysis tasks by SQL-like queries or analysis operations accurately; and (ii) big urban trajectory data is naturally complex, e.g., unpredictability, interrelation, etc. In this work, we architect and implement a visualization-assisted big urban trajectory data exploration system (Vaiet) to address these chanllenges. Vaiet includes three layers, from data collection to results visualization. We devise novel visualization views in Vaiet to support interactive big urban trajectory exploratory analysis. We demonstrate the effectiveness of Vaiet by the real world applications.
Insufficient Data Can Also Rock!Learning to Converse Using Smaller Data with Augmentation
AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2019
Recent successes of open-domain dialogue generation mainly rely on the advances of deep neural networks. The effectiveness of deep neural network models depends on the amount of training data. As it is laboursome and expensive to acquire a huge amount of data in most scenarios, how to effectively utilize existing data is the crux of this issue. In this paper, we use data augmentation techniques to improve the performance of neural dialogue models on the condition of insufficient data. Specifically, we propose a novel generative model to augment existing data, where the conditional variational autoencoder (CVAE) is employed as the generator to output more training data with diversified expressions. To improve the correlation of each augmented training pair, we design a discriminator with adversarial training to supervise the augmentation process. Moreover, we thoroughly investigate various data augmentation schemes for neural dialogue system with generative models, both GAN and CVAE. Experimental results on two open corpora, Weibo and Twitter, demonstrate the superiority of our proposed data augmentation model.
Find a Reasonable Ending for Stories: Does Logic Relation Help the Story Cloze Test?
AAAI Conference on Artificial Intelligence (AAAI, CCF-A), 2019
Natural language understanding is a challenging problem that covers a wide range of tasks. While previous methods generally train each task separately, we consider combining the cross-task features to enhance the task performance. In this paper, we incorporate the logic information with the help of the Natural Language Inference (NLI) task to the Story Cloze Test (SCT). Previous work on SCT considered various semantic information, such as sentiment and topic, but lack the logic information between sentences which is an essential element of stories. Thus we propose to extract the logic information during the course of the story to improve the understanding of the whole story. The logic information is modeled with the help of the NLI task. Experimental results prove the strength of the logic information.
Fast Trajectory Range Query with Discrete Frechet Distance
International Conference on Extending DB Technology (EDBT, CCF-B), 2019
The discrete Fréchet distance (DFD) is widely used to measure the similarity between two trajectories. Trajectory range query has been extensively studied in trajectory analytical applications, e.g., outlier detection, movement pattern analysis. With the discrete Fréchet distance, the above applications are computation bound rather than disk I/O bound. In this work, we propose new lower and upper bound functions to speedup the evaluation of trajectory range queries. Experimental studies on three real datasets demonstrate the superiority of our proposal.
Mining Heterogeneous Urban Data for Retail Store Placement
Proceedings of the ACM Turing Celebration Conference-China (SIGMOD China, None), 2019
Retail store placement problem has been extensively studied in both academic and industry as it decides the retail success of a business. Existing methods exploited either consumer studies (e.g., consultant-based solutions) or geographic features (e.g., points of interests) to settle it. However, due to the limitations of data sources (i.e., costly in time and labor), none of these methods could provide an accurate and timely solution.
In this paper, we rethink retail store placement problem by mining heterogeneous urban data. In particular, unlike existing works which only used geographic features or consumer studies solely, we extract three categories of features (i.e., human movement features, commercial area features and geographic features) from heterogeneous urban data, and integrate them into various machine learning models to predict the popularity of a prospective retail store in the candidate area. We conduct a case study with real data in Shenzhen to demonstrate the predictive power of our proposal.
Exact Processing of Uncertain Top-k Queries in Multi-criteria Settings
Proceedings of the VLDB Endowment (PVLDB, CCF-A), 2018
Traditional rank-aware processing assumes a dataset that contains available options to cover a specific need (e.g., restaurants, hotels, etc) and users who browse that dataset via top-k queries with linear scoring functions, i.e., by ranking the options according to the weighted sum of their attributes, for a set of given weights. In practice, however, user preferences (weights) may only be estimated with bounded accuracy, or may be inherently uncertain due to the inability of a human user to specify exact weight values with absolute accuracy. Motivated by this, we introduce the uncertain top-k query (UTK). Given uncertain preferences, that is, an approximate description of the weight values, the UTK query reports all options that may belong to the top-k set. A second version of the problem additionally reports the exact top-k set for each of the possible weight settings. We develop a scalable processing framework for both UTK versions, and demonstrate its efficiency using standard benchmark datasets.
Deriving Real-time City Crowd Flows by Heterogeneous Big Urban Data
IEEE International Conference on Big Data (IEEE BigData, CCF-C), 2018
Real-time city crowd flows are extremely important for facility placement, transportation management, and public safety. In this paper, we show how to derive the real-time city crowd flows with heterogeneous urban data. Unlike existing prediction-based approaches, our proposal does not rely on the training data and learning models. We propose a computation framework for it by exploiting the massive heterogeneous urban data, which includes both immutable data (i.e., bus and subway stations, commercial-based regions) and mutable data (i.e., real-time taxi, bus, subway transactions and trajectories). In addition, our solution provides accurate and timely city crowd-flows.To provide a practical solution for it, we first partition the city into commercial-based regions with geographic information (e.g., road network, administrative regions). Then, we devise three major components (i.e., urban data fusion model, heterogeneous urban data integration model, and effective crowd flows computation model) in the computation framework to process massive heterogeneous urban data effectively and derive city crowd flows accurately. Finally, we conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed solution with real heterogeneous urban data in Shenzhen, China.
Efficient Longest Streak Discovery in Multidimensional Sequence Data
Proceedings of the APWeb-WAIM joint Conference (APWeb-WAIM, CCF-C), 2018
This paper studies the problem of discovering longest streak in multidimensional sequence dataset. Given a multidimensional sequence dataset, the contextual longest streak is the longest consecutive tuples in a context subspace which match with a specific measure constraint. It has various applications in social network analysis, computational journalism, etc. The challenges of the longest streak discovery problem are (i) huge search space, and (ii) non-monotonicity property of streak lengths. In this paper, we propose a novel computation framework with a suite of optimization techniques for it. Our solutions outperform the baseline solution by two orders of magnitude in both real and synthetic datasets. In addition, we validate the effectiveness of our proposal by a real-world case study.
A Five-layer Architecture for Big Data Processing and Analytics
International Journal of Big Data Intelligenc (IJBDI, None), 2018
Big data technologies have attracted much attention in recent years. The academia and industry have reached a consensus, that is, the ultimate goal of big data is about transforming ‘big data’ to ‘real value’. In this article, we discuss how to achieve this goal and propose five-layer architecture for big data processing and analytics (BDPA), including a collection layer, a storage layer, a processing layer, an analytics layer, and an application layer. The five-layer architecture targets to set up a de facto standard for current BDPA solutions, to collect, manage, process, and analyse the vast volume of both static data and online data streams, and make valuable decisions for all types of industries. Functionalities and challenges of the five-layers are illustrated, with the most recent technologies and solutions discussed accordingly. We conclude with the requirements for the future BDPA solutions, which may serve as a foundation for the future big data ecosystem.
Determining the Impact Regions of Competing Options in Preference Space
Proceedings of the ACM Conference on Management of Data (SIGMOD, CCF-A), 2017
In rank-aware processing, user preferences are typically represented by a numeric weight per data attribute, collectively forming a weight vector. The score of an option (data record) is defined as the weighted sum of its individual attributes. The highest-scoring options across a set of alternatives (dataset) are shortlisted for the user as the recommended ones. In that setting, the user input is a vector (equivalently, a point) in a d-dimensional preference space, where d is the number of data attributes. In this paper we study the problem of determining in which regions of the preference space the weight vector should lie so that a given option (focal record) is among the top-k score-wise. In effect, these regions capture all possible user profiles for which the focal record is highly preferable, and are therefore essential in market impact analysis, potential customer identification, profile-based marketing, targeted advertising, etc. We refer to our problem as k-Shortlist Preference Region identification (kSPR), and exploit its computational geometric nature to develop a framework for its efficient (and exact) processing. Using real and synthetic benchmarks, we show that our most optimized algorithm outperforms by three orders of magnitude a competitor we constructed from previous work on a different problem.
Extracting Top-K Insights from Multi-dimensional Data
Proceedings of the ACM Conference on Management of Data (SIGMOD, CCF-A), 2017
OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.
Exploit Every Bit: Effective Caching for High-Dimensional Nearest Neighbor Search (Extended Abstract)
International Conference on Data Engineering (ICDE, CCF-A), 2017
High-dimensional k nearest neighbor (kNN) search has a wide range of applications in multimedia information retrieval. Existing disk-based k NN search methods incur significant I/O costs in the candidate refinement phase. In this paper, we propose to cache compact approximate representations of data points in main memory in order to reduce the candidate refinement time during k NN search. This problem raises two challenging issues: (i) which is the most effective encoding scheme for data points to support k NN search? and (ii) what is the optimal number of bits for encoding a data point? For (i), we formulate and solve a novel histogram optimization problem that decides the most effective encoding scheme. For (ii), we develop a cost model for automatically tuning the optimal number of bits for encoding points. In addition, our approach is generic and applicable to exact / approximate k NN search methods. Extensive experimental results on real datasets demonstrate that our proposal can accelerate the candidate refinement time of k NN search by at least an order of magnitude.
Efficient Motif Discovery in Spatial Trajectories Using Discrete Fréchet Distance
Proceedings of the International Conference on Extending Database Technology (EDBT, CCF-B), 2017
The discrete Fréchet distance (DFD) captures perceptual and geographical similarity between discrete trajectories. It has been successfully adopted in a multitude of applications, such as signature and handwriting recognition, computer graphics, as well as geographic applications. Spatial applications, e.g., sports analysis, traffic analysis, etc. require discovering the pair of most similar subtrajectories, be them parts of the same or of different input trajectories. The identified pair of subtrajectories is called a motif. The adoption of DFD as the similarity measure in motif discovery, although semantically ideal, is hindered by the high computational complexity of DFD calculation. In this paper, we propose a suite of novel lower bound functions and a grouping-based solution with multi-level pruning in order to compute motifs with DFD efficiently. Our techniques apply directly to motif discovery within the same or between different trajectories. An extensive empirical study on three real trajectory datasets reveals that our approach is 3 orders of magnitude faster than a baseline solution.
Fast Subsequence Search on Time Series Data
Proceedings of the International Conference on Extending Database Technology (EDBT, CCF-B), 2017
Many applications on time series data require solving the subsequence search problem, which has been extensively studied in the database and data mining communities. These applications are computation bound rather than disk I/O bound. In this work, we further propose effective and cheap lower-bounds to reduce the computation cost of the subsequence search problem. Experimental studies show that the proposed lower-bounds can boost the performance of the state-of-the-art solution by up to an order of magnitude.
Exploit Every Cycle: Vectorized Time Series Algorithms on Modern Commodity CPUs
Proceedings of International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS), affiliated with PVLDB (CCF-A), 2016
Many time series algorithms reduce the computation cost by pruning unpromising candidates with lower-bound distance functions. In this paper, we focus on an orthogonal research direction that further boosts the performance by unlocking the potentials of modern commodity CPUs. First, we conduct a performance profiling on existing algorithms to understand where does time go. Second, we design vectorized implementations for lower-bound and distance functions that can enjoy characteristics (e.g., data parallelism, caching, branch prediction) provided by CPU. Third, our vectorized methods are general and applicable to many time series problems such as subsequence search, motif discovery and kNN classification. Our experimental study on real datasets shows that our proposal can achieve up to 6 times of speedup.
Exploit Every Bit: Effective Caching for High-Dimensional Nearest Neighbor Search
IEEE Transactions on Knowledge and Data Engineering (TKDE, CCF-A), 2016
High-dimensional k nearest neighbor (kNN) search has a wide range of applications in multimedia information retrieval. Existing disk-based k NN search methods incur significant I/O costs in the candidate refinement phase. In this paper, we propose to cache compact approximate representations of data points in main memory in order to reduce the candidate refinement time during k NN search. This problem raises two challenging issues: (i) which is the most effective encoding scheme for data points to support k NN search? and (ii) what is the optimal number of bits for encoding a data point? For (i), we formulate and solve a novel histogram optimization problem that decides the most effective encoding scheme. For (ii), we develop a cost model for automatically tuning the optimal number of bits for encoding points. In addition, our approach is generic and applicable to exact / approximate k NN search methods. Extensive experimental results on real datasets demonstrate that our proposal can accelerate the candidate refinement time of k NN search by at least an order of magnitude.
Diversified caching for replicated web search engines
IEEE International Conference on Data Engineering (ICED, CCF-A), 2015
Commercial web search engines adopt parallel and replicated architecture in order to support high query throughput. In this paper, we investigate the effect of caching on the throughput in such a setting. A simple scheme, called uniform caching, would replicate the cache content to all servers. Unfortunately, it does not exploit the variations among queries, thus wasting memory space on caching the same cache content redundantly on multiple servers. To tackle this limitation, we propose a diversified caching problem, which aims to diversify the types of queries served by different servers, and maximize the sharing of terms among queries assigned to the same server. We show that it is NP-hard to find the optimal diversified caching scheme, and identify intuitive properties to seek good solutions. Then we present a framework with a suite of techniques and heuristics for diversified caching. Finally, we evaluate the proposed solution with competitors by using a real dataset and a real query log.