publications | Yangdi Jiang

2024

Analysis of Differentially Private Synthetic Data: A Measurement Error Approach

Yangdi Jiang, Yi Liu, Xiaodong Yan, and 3 more authors

Proceedings of the AAAI Conference on Artificial Intelligence, Mar 2024

Abs DOI PDF

Differentially private (DP) synthetic datasets have been receiving significant attention from academia, industry, and government. However, little is known about how to perform statistical inference using DP synthetic datasets. Naive approaches that do not take into account the induced uncertainty due to the DP mechanism will result in biased estimators and invalid inferences. In this paper, we present a class of maximum likelihood estimator (MLE)-based easy-to-implement bias-corrected DP estimators with valid asymptotic confidence intervals (CI) for parameters in regression settings, by establishing the connection between additive DP mechanisms and measurement error models. Our simulation shows that our estimator has comparable performance to the widely used sufficient statistic perturbation (SSP) algorithm in some scenarios but with the advantage of releasing a synthetic dataset and obtaining statistically valid asymptotic CIs, which can achieve better coverage when compared to the naive CIs obtained by ignoring the DP mechanism.
Modelling impacts of climate change on snow drought, groundwater drought, and their feedback mechanism in a snow-dominated watershed in western Canada

Yinlong Huang, Yangdi Jiang, Bei Jiang, and 4 more authors

Journal of Hydrology, Jun 2024

Abs DOI

Snow accumulation and its melt are key hydrological processes in cold watersheds, which can affect groundwater (GW). With climate change projected to alter snow processes in these regions, understanding their impacts on the development of droughts is vital. A deficit in snow precipitation or accelerated snowmelt due to warming can trigger snow drought, potentially leading to GW drought. To investigate this relationship at a watershed scale, we coupled the Soil and Water Assessment Tool (SWAT) and MODFLOW to simulate various surface and groundwater processes under historical (1980–2013) and future (2040–2073) warming scenarios in western Canada. We calibrated, validated, and verified our models using streamflow, GW heads, and snow depth data from multiple hydrometric stations, observation wells, and a grid product. Using simulated data, we analyzed snow and GW drought characteristics, their dominant physical processes, and GW response time across eco-hydro(geo)logical regions like Mountains, Foothills, and Plains. Historical data show that Mountains are experiencing more snow droughts while Plains are facing greater GW droughts. However, future scenarios suggest increased snow droughts in all regions and a shift towards more severe GW droughts in Plains. Historical response time of GW to changes in snow processes spans 4–6 months from Mountains to Plains, with projected reductions in Mountains and Foothills, and a slight increase in Plains in the future. The dominant physical processes controlling GW response across all regions are soil moisture and percolation, with curve number displaying more significance in Mountains, and water yield exerts more control in Foothills and Plains. During cold seasons, SWE and snowmelt had minimal impact on GW response in Plains, while they presented a major role in Mountains. This study lays the basis for further research on snow-groundwater interactions in cold watersheds, aiding water resource management in mid-to-high latitude regions and providing a unified framework for analyzing snow and GW drought relationships.

2023

Gaussian differential privacy on Riemannian manifolds

Yangdi Jiang, Xiaotian Chang, Yi Liu, and 3 more authors

In Proceedings of the 37th International Conference on Neural Information Processing Systems, Dec 2023

Abs PDF

We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance, allowing us to achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best of our knowledge, this work marks the first instance of extending the GDP framework to accommodate general Riemannian manifolds, encompassing curved spaces, and circumventing the reliance on tangent space summaries. We provide a simple algorithm to evaluate the privacy budget μ on any one-dimensional manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based algorithm to calculate μ on any Riemannian manifold with constant curvature. Through simulations on one of the most prevalent manifolds in statistics, the unit sphere Sd, we demonstrate the superior utility of our Riemannian Gaussian mechanism in comparison to the previously proposed Riemannian Laplace mechanism for implementing GDP.

2022

Measuring re-identification risk using a synthetic estimator to enable data sharing

Yangdi Jiang, Lucy Mosquera, Bei Jiang, and 2 more authors

PLOS ONE, Jun 2022

Abs DOI PDF

Background One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. Objectives Develop an accurate risk estimator for the sample-to-population attack. Methods A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. Results Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. Conclusions The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.