Final ACO Doctoral Examination and Defense of Dissertation of Osman Dai: December 5, 2023

Title: Fundamental Limits and Algorithms for Database and Graph Alignment

Date: December 5, 2023
Time: 1:00 PM
Location: ISyE Groseclose 226

Co-advisors:
Dr. Negar Kiyavash, Business Analytics (Ecole Polytechnique Federale de Lausanne)
Dr. Mohit Singh, School of Indsutrial and Systems Engineering (Georgia Tech)

Committee Members:
Dr. Daniel Cullina, School of Electrical Engineering and Computer Science (Pennsylvania State University)
Dr. Cheng Mao, School of Mathematics (Georgia Tech)
Dr. Ashwin Pananjady, Electrical Engineering and Computer Science (Georgia Tech)

Reader:
Dr. Cheng Mao, School of Mathematics (Georgia Tech)

Thesis: https://drive.google.com/file/d/1bHCgBMWUEbXuTxFz6e3DIdQWJQOTIYMr/view?u...

Abstract: Data alignment refers to a class of problems where given two sets of anonymized data pertaining to overlapping sets of users, the goal is to identify the correspondences between the two sets. If the data of a user is contained in both sets, the correlation between the two data points associated with the user might make it possible to determine that both belong to the same user and hence link the data points. Alignment problems are of practical interest in applications such as privacy and data junction. Data alignment can be used to de-anonymize data, therefore, studying the feasibility of alignment allows for a more reliable understanding of the limitations of anonymization schemes put in place to protect against privacy breaches. Additionally, data alignment can aid in finding the correspondence between data from different sources, e.g. different sensors. The data fusion performed through data alignment in turn can help with a variety of inference problems that arise in scientific and engineering applications.

This thesis considers two types of data alignment problems: database and graph alignment. Database alignment refers to the setting where all features (i.e. data points) in each data set are associated with a single user. Graph alignment refers to the setting where data points in each data set are associated with pairs of users. For both problems, we are particularly interested in the asymptotic case where n, the number of users with data in both sets, goes to infinity, although our analyses often yield results applicable to the finite n case. To develop a preliminary understanding of the database alignment problem, we first study the closely related problem of planted matching with Gaussian weights of unit variance, and derive tight achievability bounds that match our converse bounds: Specifically we identify different inequalities between the signal strength (which corresponds to the square of the difference between the mean weights of planted and non-planted edges) and log n that guarantee upper bounds on the log of the expected number of errors. Then, we study the database alignment problem with Gaussian features in the low per-feature correlation setting where the number of dimensions of each feature scales as ω(log n): We derive inequalities between signal strength (which, for database alignment, corresponds to the mutual information between correlated features) that guarantee error bounds that match those of the planted matching setting, supporting the claimed connection between the two problems. Then, relaxing the restriction on the number of dimensions of features, we derive conditions on signal strength and dimensionality that guarantee smaller upper bounds on the log of the expected number of errors. The stronger results in the O(log n)-dimensional-feature setting for Gaussian databases show how planted matching, while useful, is not a perfect substitute to understand the dynamics of the more complex problem of database alignment. For graph alignment, we focus on the correlated Erdős–Rényi graph model where the data point (i.e. edge) associated with each pair users in a graph is a Bernoulli random variable that is correlated with the data point associated with the same pair in the other graph. We study a canonical labeling algorithm for alignment and identify conditions on the density of the graphs and correlation between edges across graphs that guarantees the recovery of the true alignment with high probability.