Abstract
Differential privacy provides a strong form of privacy and allows preserving most of the original characteristics of the dataset. Utilizing these benefits requires one to design specific differentially private data analysis algorithms. In this work, we present three tree-based algorithms for mining redescriptions while preserving differential privacy. Redescription mining is an exploratory data analysis method for finding connections between two views over the same entities, such as phenotypes and genotypes of medical patients, for example. It has applications in many fields, including some, like health care informatics, where privacy-preserving access to data is desired. Our algorithms are the first tree-based differentially private redescription mining algorithms, and we show via experiments that, despite the inherent noise in differential privacy, it can return trustworthy results even in smaller datasets where noise typically has a stronger effect.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Modern data analysis methods have had a huge impact on analysing health and bioinformatics data, with personalised medicine (Fröhlich et al. 2018), computer-aided diagnoses (Van Ginneken et al. 2001), and new drugs (Napolitano et al. 2013) being just some examples. Exploratory data analysis has still much more to contribute to these fields, but the access to the data is often restricted due to the sensitivity of health information. The standard procedure for requesting access to such data typically involves submitting a detailed research plan, which is often required to take the form of a confirmatory study, meaning that the researchers must present a hypothesis to be tested. This standard procedure is ill-suited for exploratory analysis, which relies on trying out a number of different approaches in order to see whether the data reveals some novel knowledge. As a consequence, requests for permission to access data for exploratory research often get rejected.
The particular method of exploratory data analysis that we focus on in this paper is redescription mining. It aims at finding pairs of queries over the data attributes such that the queries describe approximately the same sets of entities. It can be used, among other things, to find connections between patient diagnoses and lab test results; for instance, finding that patients with following laboratory results
-
1.
normal total iron binding capacity, total \(\textrm{CO}_2\) and hemoglobin not tested; or
-
2.
abnormal total \(\textrm{CO}_2\) and normal eosinophils levels with triglycerides not tested; or
-
3.
normal \(\textrm{CO}_2\) level and blood potassium level, polys (neutrophils) levels not tested
suffer either from chronic pulmonary heart diseases, hypovolemia (decreased blood volume) or hypertension (and vice versa). Information like this—obtained from real-world data with a method proposed in this paper—can be used to guide doctors to best practices (which lab tests should be ordered), to identify particularly strong or weak combinations of tests, or to improve diagnoses.
The problem with the above data analysis is that it is often impossible to do, as the data are confidential. And even if it would be done by a trusted party on the hospital data, sharing the results would still be a problem, as it could lead to leaking private patient information.
The solution to these problems lies in the well-known method of differential privacy (Dwork et al. 2006). By giving researchers differentially private access to the dataset, the data owner can rest assured that they will not be able to infer information about any individual in the dataset, providing much stronger privacy guarantees than standard anonymisation methods. Consequently, the data owner can give researchers differentially private access to the data under a more lenient application process, thus allowing them to experiment with new methods more easily. Also, the results can be shared without fear of leaking sensitive information.
In addition to providing strong theoretical privacy guarantees, differential privacy has the advantage of allowing the use of unchanged, original data via different mechanisms that guarantee the privacy of the subjects in all information obtained and potentially publicly released using these mechanisms. A very important part of this information can be lost using other techniques for preserving privacy. For example, perturbation-based techniques affect the range and values of the attributes (Agrawal and Aggarwal 2001), projection-based perturbation techniques in addition cause losing information about the meaning and number of the original attributes (Kenthapadi et al. 2013; Liu et al. 2006), and using fuzzy logic-based perturbation techniques potentially obfuscate the numerical boundaries and ranges of the original attributes (Jahan et al. 2014).
On the other hand, developing a differentially-private algorithm for a pattern-based data mining task, such as redescription mining, is perhaps even harder than many other applications of differential privacy, such as classification. A small perturbation can break a pattern, while it would not typically have any major impact on classification.
In this paper, we propose three tree-based differentially private redescription mining algorithms and study their properties. Two algorithms closely follow alternations of the well known CARTwheels algorithm (Ramakrishnan et al. 2004) using differentially private decision trees: the AltExpM algorithm is based on the exponential mechanism (Friedman and Schuster 2010) and the AltMCMC algorithm uses embedding of differential privacy into decision tree construction (Bai et al. 2017). The third algorithm, denoted MCMCTreePair, is a differentially private redescription mining algorithm that uses a completely novel technique for embedding differential privacy into the construction of pairs of decision trees.
Concretely, we present the following contributions:
-
A novel methodology for the construction of differentially private redescriptions (extractReds) using a pair of differentially private decision trees. This technique allows using much lower noise levels for differential privacy, and consequently, produces more accurate results.
-
A novel algorithm (sampleTreePair) for the creation of a differentially private pair of decision trees using MCMC sampling. This extends the ideas of Bai et al. (2017) to simultaneously create pairs of differentially private trees with matching leaves so that they are suitable for redescription construction.
-
An adaptation of the alternating procedure for redescription construction (Ramakrishnan et al. 2004) to the setting where differential privacy needs to be preserved (createRedsAlt).
-
An adaptation of an existing differentially private decision tree algorithm based on MCMC sampling (Bai et al. 2017) for creating differentially private redescriptions (createDTMCMC). Our approach utilizes a fixed tree depth and avoids unnecessary computations.
-
A similar adaptation of an existing differentially private decision tree algorithm based on the exponential mechanism (Friedman and Schuster 2010) for creating differentially private redescriptions (createDTExpMech).
We performed a thorough experimental evaluation of the three developed approaches for differentially private redescription mining. The obtained results show that the MCMCTreePair algorithm outperforms the AltMCMC and AltExpM algorithms on smaller datasets (although not always significantly), but AltMCMC and AltExpM significantly outperform the MCMCTreePair algorithm on medium-sized datasets. As one allows more and more redescriptions to be generated during the experiments (which necessitates stricter privacy setting), MCMCTreePair outperforms AltMCMC and AltExpM on an increasing number of medium-sized datasets as well. Additionally, it is possible that AltMCMC and AltExpM do not produce any redescriptions in some runs. This occurs with relatively small probability (\(\le 0.1\) on our test datasets), but it entirely depends on the quality of the input data, which can only be influenced by the data providers. For this reason, we propose a way to significantly reduce the probability of such failure (to 0 in our experiments). This procedure reduces the overall performance of the AltMCMC and AltExpM algorithms, reducing their advantage compared to the MCMCTreePair algorithm.
To conclude, all three approaches have their advantages and there are scenarios in which they excel. The MCMCTreePair algorithm excels on small, potentially unprepared data, or where a strict privacy setting is required, the robust versions of the AltMCMC and AltExpM algorithms excel on medium-sized or large data, potentially not preprocessed, where a medium or small number of repeated differentially private experiments needs to be performed, and the non-robust AltMCMC and AltExpM algorithms excel on well preprocessed medium-sized or large datasets where very few, or just one, differentially private experiments needs to be performed.
The rest of the paper is organized as follows. As the paper combines two topics, we start with brief primers and reviews on related work of either of them: Sect. 2 is about redescription mining and Sect. 3 about differential privacy. The tree-based differentially private algorithms are presented in Sect. 4, the experimental evaluation in Sect. 5, and conclusions in Sect. 6.
2 Primer on redescription mining
Redescription mining aims at finding two different ways to describe almost the same observations. In this paper, the dataset consists of a pair of tables, denoted \(\textbf{D}_{L}\) and \(\textbf{D}_{R}\), i.e. \(\textbf{D}= (\textbf{D}_{L}, \textbf{D}_{R})\). Both tables are over the same entities (rows) but have disjoint collections of attributes (columns). The set of attributes of a table \(\textbf{D}_X\) is denoted \(\texttt {attrs} (\textbf{D}_X)\) while \(\texttt {view} (a)\) indicates which table a given attribute a belongs to, so that for any \(a \in \texttt {attrs} (\textbf{D}_{X})\), \(\texttt {view} (a) = X\).
The pair of data tables, together with user-defined constraints such as, in particular, minimum support and minimum accuracy, constitute the input of redescription mining. In this context, a redescription consists of a pair of logical queries, one over either data table, respectively denoted \(q_{L}\) and \(q_{R}\). More specifically, a query consists of literals over Boolean, categorical or numerical attributes, combined with logical conjunction and disjunction operators, possibly involving negations. For instance, the following query
describes entities, in this case individuals, who either do not suffer from hypertension and reside in California, or are born before 1950. The collection of entities described by the queries is called the support of the query, denoted \({{\,\textrm{supp}\,}}(q)\). In other words, \({{\,\textrm{supp}\,}}(q)\) is the set of entities that satisfy the query \(q\).
The main quality of a redescription is that the queries have similar supports. The similarity of supports is typically measured using the Jaccard similarity index. Hence, the accuracy of a redescription \((q_{L}, q_{R})\) is defined as
In addition to having a high accuracy, to be of interest a redescription should cover neither too few nor too many entities. This is typically ensured by setting minimum and maximum support thresholds. Finally, the size of the intersection of the queries’ supports should not be a direct consequence of their respective supports. For instance, if both queries cover almost the entire dataset, it is not at all surprising, and hence not interesting, that they have a large overlap. This is controlled by computing \(p\text {-value}\) s representing the probability that the intersection of the supports of two randomly chosen queries is as large as or larger than observed in the studied redescription. Queries are randomly chosen with a priori probabilities equal to the fraction of entities described by redescription’s queries. The assumption is that all entities can be sampled with equal probability. Redescriptions with \(p\text {-value}\) s over a chosen significance threshold (typically 0.01) are then deemed uninteresting. For details about the computation of \(p\text {-value}\) s and a more extensive account of redescription mining, we direct the reader to the work of Galbrun and Miettinen (2018b) and references therein.
2.1 Related work
Since redescription mining was introduced by Ramakrishnan et al. (2004), various algorithms have been proposed for the task. Some algorithms are based, for instance, on iteratively growing the queries in a greedy fashion (Gallo et al. 2008; Galbrun and Miettinen 2012a), or on mining and combining frequent itemsets (Gallo et al. 2008).
Several algorithms, including the pioneering one (Ramakrishnan et al. 2004), are based on decision trees, be it classification trees (Zinchenko et al. 2015), predictive clustering trees (PCT) (Mihelčić et al. 2017) or random forests of PCTs (Mihelčić et al. 2018). These algorithms build the queries by inducing a decision tree over one side of the data, using the current query on the other side or an initialization vector as the target, and alternating between the sides. Different algorithms are obtained depending on precisely how the trees are induced, growing entire trees to full depth at once (Mihelčić et al. 2017; Ramakrishnan et al. 2004), progressively increasing the depth of the trees as in the SplitT algorithm (Zinchenko et al. 2015), or growing subtrees to expand existing branches as in the LayeredT algorithm (Zinchenko et al. 2015). For reasons that will become clear, methods that rely on tree induction are the most relevant to us, and the approach we propose also belongs to this family.
Initially limited to Boolean data, the task was later extended to numerical attributes by Galbrun and Miettinen (2012a) and Zinchenko et al. (2015) (following the greedy and tree-based approaches, respectively).
Redescription mining has been used in various application domains, including political analysis (Galbrun and Miettinen 2016), econometrics (Galbrun et al. 2018), bio-informatics (Ramakrishnan and Zaki 2009) and medicine (Mihelčić et al. 2017).
3 Primer on differential privacy
Differential privacy ensures that it cannot be inferred, within a fixed probability, whether or not a particular individual is included in a dataset—irrespective of what other information the adversary might have. This is done by withholding direct access to the data and only returning randomized answers.
Formally, if \(\textbf{D}\) and \(\textbf{D}'\) are two datasets that differ in exactly one row, A is a randomized algorithm operating on these tables, S is any set of potential results A might return, and \(\varepsilon >0\) is the privacy parameter (or budget), then A is \(\varepsilon \)-differentially private (later also simply \(\varepsilon \)-private) if
where the probability is over the randomness of A (see Dwork and Roth 2014). This property ensures that a change in one row of a dataset has a limited impact on the output of A. Thus, very little information about the individual can be obtained if a sufficiently strict privacy level is enforced. Two important features of differential privacy are composability and resistance to post-processing. Composability means that if the user runs multiple algorithms on the data (or the same algorithm multiple times), then assuming the source of randomness is independent, the results are still differentially private: if the user runs k algorithms that are \(\varepsilon \)-private, the results are \(k\varepsilon \)-private (McSherry 2009) and conversely, if one wants to preserve \(\varepsilon \)-privacy over k runs, each run must be \(\varepsilon /k\)-private. Resistance to post-processing means that the results of an \(\varepsilon \)-differentially private algorithm will stay (at least) \(\varepsilon \)-differentially private after any post-processing that does not have access to the original data.
Composability has been a topic of intense study in differential privacy. The aforementioned \(k\varepsilon \)-privacy bound cannot be improved in “pure” differential privacy, but if one weakens the privacy definition, one can obtain different trade-off, using, e.g. the composition theorem of Dwork et al. (2010) or concentrated differential privacy (Dwork and Rothblum 2016). These methods are straight-forward to apply to \(\varepsilon \)-differentially private mechanisms, but they do require knowledge of the acceptable privacy trade-off in the application domain. Therefore, further analysis using these weaker forms of privacy is out of scope for this manuscript.
When considering differentially private algorithms, an important concept is the sensitivity of the underlying functions. Assume that the differentially private algorithm A computes some deterministic function f of the data (e.g. counts a particular type of observations). The sensitivity of f is defined as \(\Delta f = \max _{\textbf{D}, \textbf{D}'}\Vert f(\textbf{D}) - f(\textbf{D}')\Vert _1\), where the maximum is taken over all pairs of datasets \(\textbf{D}\) and \(\textbf{D}'\) that differ in exactly one row.
When A is an algorithm that returns a number (e.g. a counting query), a typical way to ensure differential privacy is to use the Laplace mechanism (Dwork et al. 2006): noise drawn from the Laplace distribution \(\texttt {Laplace}(0; \Delta f/\varepsilon )\) is added to the correct answer.
For more complex queries, the exponential mechanism (McSherry and Talwar 2007) can be used. Here, instead of adding noise to the answer, the algorithm samples the final answer from the set of all answers. The probability of sampling a particular answer r is proportional to
where \( score _\textbf{D}(r)\) is the quality of answer r on data \(\textbf{D}\) and \(\mu \) is a measure assigning a base probability to r (\(\mu \) is often uniform). That is, the exponential mechanism requires a way to assign a quality to each answer. The level of privacy, on the other hand, depends on the sensitivity of the quality measure, that is, on how much the quality can change with a change to a single record in the dataset.
3.1 Related work
After the initial formulation by Dwork et al. (2006), building on their earlier work on the SuLQ framework (Blum et al. 2005), and the introduction of the exponential mechanism by McSherry and Talwar (2007), differential privacy has seen significant interest in academia, and recent years have brought uptake by the industry as well (e.g. Ding et al. 2017). McSherry’s (2009) PINQ is a framework for implementing differentially private data analysis using smaller building blocks. For example, Friedman and Schuster (2010) presented a differentially private ID3 decision tree algorithm using PINQ operations.
Since then, differential privacy has been integrated in many fields of data mining and machine learning. Many classification algorithms, such as decision trees (Friedman and Schuster 2010; Gong et al. 2020; Jagannathan et al. 2012; Bai et al. 2017), random forests (Rana et al. 2015), naive Bayes (Gong et al. 2020; Vaidya et al. 2013), SVM (Gong et al. 2020; Li et al. 2014; Rubinstein et al. 2012), k nearest neighbours (Gursoy et al. 2017), artificial bee colonies, and differential evolution (Zorarpacı and Özel 2020) have been adapted to preserve differential privacy. Differential privacy has also been used in a semi-supervised setting (Jagannathan et al. 2013) where private data is augmented by publicly available data to create classifiers of improved accuracy. It has been integrated into (distributed) online learning (Gong et al. 2020) and regression algorithms (linear and logistic regression) (Gong et al. 2020).
Various types of neural and deep neural networks have been privatized using differential privacy (autoencoders, convolutional deep belief networks, long short-term networks, and generative neural networks) (Fan 2020; Gong et al. 2020). Clustering (Gong et al. 2020; Nissim et al. 2007) and dimensionality reduction (Blum et al. 2005; Gong et al. 2020) are the most prominent unsupervised techniques privatized using differential privacy. In addition to PCA, various matrix factorization approaches (Berlioz et al. 2015; Balu and Furon 2016) and more complex tensor decompositions (Imtia and Sarwate 2018; Wang and Anandkumar 2016) have been modified to preserve privacy using methods and techniques developed in the field of differential privacy. The creation of differentially private matrix factorization approaches is largely motivated by their application to the construction of differentially private recommender systems (McSherry and Mironov 2009), where it can be very important to preserve the full anonymity of users. Differentially private association rule mining (Tsou et al. 2020), frequent itemset mining (Zeng et al. 2012), and frequent subgraph mining (Xu et al. 2016) are examples of the integration of differential privacy with data mining techniques.
After the submission of this work, Karjalainen et al. (2023) published a greedy differentially private redescription mining method. Their method follows the ReReMi approach of Galbrun and Miettinen (2012a) and hence, the queries in their redescriptions are not based on decision trees and have a very different structure to the ones presented here. Evaluating the relative benefits of these two approaches is left for future work.
4 Algorithms
Fully differentially private redescription mining algorithms require parts of the computation to be performed on the secured server (where all data is stored). We use the notation \( object ^{\mathrm {[cl]}}\) to denote an object that is stored in the memory of the client computer and \( object ^{\mathrm {[ser]}}\) to denote an object that is stored in the memory of the secured server (thus not directly available to the client). Similarly, we use \(\texttt{function}^{\mathrm {[cl]}}\) for functions that are completely computed on the client computer, \(\texttt{function}^{\mathrm {[ser]}}\) for functions completely computed on the server, and \(\texttt{function}^{\mathrm {[inv]}}\) for functions that are started on the client but initiate (invoke) some computations on the server, potentially using data from both server and client (client data is sent to the server) and potentially obtaining some results that are differentially private and stored in the memory of the client computer.
4.1 Motivation
From the various approaches used to build redescriptions, in this work we focus on decision-tree based methods. Decision trees are easy to interpret, but can express complex relations. Also, building differentially-private decision trees is a well-studied problem (Fletcher and Islam 2019), which allowed us to build on existing approaches. Mining redescriptions (especially differentially private ones) imposes specific requirements on the decision tree induction algorithms, preventing any existing approach to be efficiently used directly.
Of the different existing approaches, we decided to base one approach on the work of Friedman and Schuster (2010), which presents the creation of differentially private decision trees using a standard exponential mechanism to select the appropriate split at each step of the decision tree induction, and to base the other approach on the work of Bai et al. (2017), which implements the exponential mechanism on the level of decision trees of a predefined depth using MCMC sampling. Using MCMC sampling, we do not need to instantiate all trees in the sampling space, but can still benefit from the efficient use of the privacy budget provided by the exponential mechanism.
Next, we will explain how the alternating redescription mining approach of Ramakrishnan et al. (2004) must be adapted to work in differentially private manner. This approach can use arbitrary differentially private decision tree creation procedures in a direct way to compute differentially private redescriptions. We use modified versions of two differentially private decision tree algorithms (Friedman and Schuster 2010; Bai et al. 2017) as the building blocks of our algorithms called AltExpM and AltMCMC, respectively. In Sect. 4.4, we will present our third algorithm, MCMCTreePair, that extends the ideas of Bai et al. (2017) to allow sampling of tree-pairs instead of individual trees. In Sect. 4.5, we explain the process of creating redescriptions using pairs of trees and computing the statistics (support and \(p\text {-value}\)) for them in a differentially private manner. This part is shared by all tree-based redescription mining algorithms. Section 4.6 explains our redescription pruning strategy, an important step for obtaining accurate differentially private redescriptions in the presence of potentially high levels of noise.
The asymptotic time complexity analysis is presented in the related technical report (Mihelčić and Miettinen 2022).
4.2 Main building blocks
All redescription mining algorithms that will be presented in the following sections consist of three main parts:
-
1.
Creation of an initial target variable.
-
2.
Creation of differentially private trees.
-
3.
Creation of differentially private redescriptions.
Being supervised algorithms, decision trees need a target variable. As redescription mining is unsupervised, we need to create an initial target variable. It is created from an attribute of one of the data tables. For Boolean and categorical attributes, each distinct value forms a target class. Numerical attributes, on the other hand, are discretized using a standard equal-width binning approach, where the number of bins is derived using the Freedman–Diaconis rule (Freedman and Diaconis 1981). Each bin forms one target class.
After the target variable is created, a process of differentially private tree construction begins. Different algorithms use different mechanisms for building trees, however the main goal of this step is to create a sequence of tree-pairs with matching leaves (see Fig. 1).
Pairs of differentially private trees are utilized to obtain differentially private redescriptions, with a specifically developed procedure (see Sect. 4.5 and Fig. 1) that preserves differential privacy. The branch leading to a leaf in a decision tree can straightforwardly be turned into a logical conjunctive query, by combining the tests encountered along the branch (see Fig. 1). A simple redescription is obtained by pairing the queries that correspond to a pair of leaves from opposite trees. Such simple redescriptions can be extended with logical conjunctive queries, obtained in either tree from the pair, using disjunctions and negations, to produce more complex, hopefully more accurate, redescriptions.
In the following sections we provide a detailed description of all three main parts of differentially private redescription creation.
4.3 Alternation-based algorithms
In this section, we describe two differentially private redescription mining algorithms: the first, AltExpM, uses individually created differentially private decision trees based on the exponential mechanism (Friedman and Schuster 2010) and the second, AltMCMC, embeds the sampling into the construction of the decision trees (Bai et al. 2017). We do not consider differentially private decision trees using noisy counts since prior work shows that these are outperformed by trees constructed using exponential mechanism (Friedman and Schuster 2010).
The algorithms presented in this section follow a well-known alternation scheme used in the CARTwheels algorithm (Ramakrishnan et al. 2004). In this scheme, the initial tree is created on one table (usually \(\textbf{D}_{L}\)) using a set of initial target labels (derived from data). Consecutive steps use information about entity membership in the leaves of the tree from the previous iteration to create target labels which allow the construction of a matching tree on the other table (\(\textbf{D}_{R}\)). Redescriptions are formed based on the paths from the root of one tree to the root of a second tree. Since decision tree construction cannot match created target labels exactly, each alternation creates slightly different trees, allowing the creation of new redescriptions.
We first present the general framework before explaining how individual trees are built.
4.3.1 Alternation-based framework for differentially private redescription mining
Algorithm 1 builds differentially private redescriptions using differentially private decision trees provided by the createDPTree algorithm. When createDPTree uses Algorithm 2 to build the decision trees, the underlying redescription mining algorithm is called AltExpM; when it uses Algorithm 3 instead, it is called AltMCMC. The next two subsections describe these two different algorithms to build decision trees. All tree-specific parameters are explained in “Appendix C.3”.
Algorithm 1 iterates for \( InTr \) iterations (line 3), chooses a random attribute at each iteration and creates an initial target variable using values of the selected attribute (line 4). The target variable creation step is entirely performed on the secured server (the one hosting the data) thus it does not require protection (the end-user does have access to this data). This operation is conceptually similar to the Partition operation, commonly used in differential privacy. This operation groups the data on the server based on some criterion (usually the value of a target label). Since this grouping is performed entirely on the secured server, it does not reveal any information to the end user. Then, the secured server returns the corresponding noisy counts of the number of entities contained in these groups to the end-user (thus privacy is preserved).
Once an initial target variable is constructed, it is used either to construct a differentially private decision tree using the exponential mechanism (see Sect. 4.3.2) or to perform decision tree construction by sampling the space of decision trees (see Sect. 4.3.2). The resulting tree is differentially private and returned to the client. Which protection mechanism is chosen depends on the choice of a differentially private decision tree algorithm used during the process of differentially private redescription creation (lines 6 and 8).
At each consecutive step of the loop starting in line 10, the successive targets are created directly from the leaves of a tree constructed in the previous iteration (lines 11 to 17). Each leaf represents one class and entities are assigned a class label based on their membership in a leaf; the labels in turn are used to create a new tree using attributes from the other view. The main idea is to obtain pairs of trees having very similar or identical sets of entities in their leaves.
Target creation is again performed entirely on the secure server and the corresponding data is accessed by the differentially private algorithms using the exponential mechanism or by sampling. The fact that MCMC sampling upon convergence produces trees as would be selected by applying the exponential mechanism to the space of trees of a predefined depth has been proven by Bai et al. (2017). The obtained pairs of trees (without information about their node sizes) are used to create redescriptions in differentially private manner using noisy counts, obtained by making two passes over the dataset, to compute noisy redescription support, accuracy and significance (line 19; see also Sect. 4.5).
To summarize, after creating a target variable (class attribute, see Sect. 4.2) in the first of \( RMIter \) iterations, redescriptions are obtained by iteratively creating pairs of differentially private decision trees with matching leaves and then combining their leaves into redescriptions in a differentially private manner described in Sect. 4.5.
The total budget of Algorithm 1 is \(\varepsilon \). In each of the \( initTry \) iterations, Algorithm 1 calls createDPTree, which is \(\varepsilon /( InTr (2\cdot RMIter +1))\) differentially private (see Sect. 4.3.2). Then, in each of the \( RMIter \) iterations, it calls createDPTree and extractReds, both of which are \(\varepsilon /( InTr (2\cdot RMIter +1))\) differentially private (see Sects. 4.3.2 and 4.5). Hence, the overall algorithm is \( InTr \bigl (\varepsilon /\bigl ( InTr (2\cdot RMIter +1)\bigr ) + RMIter \cdot 2\cdot \varepsilon /\bigl ( InTr (2\cdot RMIter +1)\bigr )\bigr ) = \varepsilon \) differentially private.
4.3.2 Types of differentially private decision trees
We use two different types of differentially private decision trees: differentially private decision trees using the exponential mechanism (Friedman and Schuster 2010), presented in Algorithm 2, and differentially private decision trees obtained using MCMC sampling (Bai et al. 2017), presented in Algorithm 3.
Algorithm 2 first computes the available budget for every level of the tree (line 2). It uses the exponential mechanism (line 7) to choose the split point out of all possible split points (attribute-value pairs) (see Friedman and Schuster 2010). Next, it partitions the dataset based on the values of the split point attribute using the Partition operation (line 8) and then recursively calls itself to create the next level of the tree (line 9). Finally, it connects subtrees into the final tree and returns it.
At each level, calls to the exponential mechanism are \(\varepsilon /d\) differentially private. Applying the exponential mechanism on siblings does not necessitate increasing the privacy level thanks to the property of parallel composition (McSherry 2009). The Partition function does not spend any budget, as no result is returned to the user. The function calls itself recursively at most d times, thus by the property of sequential composition (McSherry 2009), Algorithm 2 is \(d\cdot \varepsilon /d = \varepsilon \) differentially private. The parameter q in a call of the exponential mechanism in line 7 represents the Gini index, which has sensitivity 2 (Friedman and Schuster 2010); the larger sensitivity is taken care of in the exponential mechanism calculations. The exponential mechanism is executed on a secured server by calculating all splitting choices and returning a random solution sampled from a distribution depending on the desired privacy level and the corresponding score function.
Algorithm 2 is based on the algorithm developed by Friedman and Schuster (2010) with the following modifications: a) computations used to determine the noisy counts of entities at each node of a tree are removed and b) trees are built until a predefined depth is reached. The main reason for these modifications is to save budget, and they are justified by the fact that redescription mining algorithms usually use shallow trees (depth \(\le 8\)). For various reasons (budget savings, reduction of false positive redescriptions), this depth is even smaller in differentially private redescription mining algorithms. This allows using \(\varepsilon /d\) of the budget for the exponential mechanism. Noisy count computations are not required because target variables in the alternation process are created on the secured server (see Algorithm 1) and pairs of differentially private trees without information on target value distribution in their nodes are sufficient to compute differentially private redescriptions (see Sect. 4.5).
The second approach uses a deep embedding of differential privacy into the decision tree construction (Algorithm 3). The embedding is achieved by sampling the tree space using Markov Chain Monte Carlo (MCMC) (Hastings 1970). This process has been shown to generate, upon convergence, trees as would be obtained if selected by the exponential mechanism (Bai et al. 2017). The advantage of this approach compared to Algorithm 2 is that it does not have to divide the available budget between the levels of a tree. Instead, it can use the whole budget for sampling. The construction of a differentially private object of this complexity cannot be achieved easily by applying the exponential mechanism once to the vast space of all trees of depth d.
Algorithm 3 is a version of the approach presented by Bai et al. (2017), modified by removing the noisy count computation. This allows using the whole available budget \(\varepsilon \) for tree sampling. The reason why noisy counts are not required is the same as for the createDTExpMech: the initial targets are created on the secure server (see Algorithm 1) and noisy counts are not required as input to the construction of differentially private redescriptions (see Sect. 4.5).
The work of Bai et al. (2017) demonstrates that trees sampled using MCMC outperform trees based on the selection of split points using the exponential mechanism, which strongly motivates the use of the former method in an algorithm for differentially private redescription creation.
The score function for the tree construction is as defined by Bai et al. (2017):
where C(n) denotes the set of children of the tree node n, while \(\tau \), \(\tau _n\), and \(\tau _{n,c}\) denote respectively the total number of entities in the dataset, the number of entities in n and the number of entities of class c in n. \(N_l\) denotes a set of leaf nodes, whereas \(N_i\) denotes a set of internal (non-leaf) nodes. The quality of a tree, \(g(T)_1\), is computed as the quality of its root node.
Algorithm 3 is completely performed on the secured server. It first generates a completely random tree of depth d (line 1), then it iterates for \( MCIter \) iterations, selects a random node of a tree T and computes the score of a newly constructed tree where all other nodes are equal to the nodes of T but a selected node is replaced with a node corresponding to randomly chosen split point (lines 5 to 7). The quality of the new tree is compared to the old one in a similar fashion as in the exponential mechanism, and if the change is accepted, the old tree is replaced (line 10). If changes in a tree score during the iteration drop below a predefined threshold (determined by the function stabilizedVar), iterations are terminated and the resulting tree is returned to the client.
4.4 The MCMCTreePair algorithm
In this section, we present an algorithm whose main ingredient is MCMC sampling of tree-pairs instead of individual trees.
This algorithm uses the privacy budget more efficiently than the alternation-based algorithms. Assuming one has a budget \(\varepsilon _t\) to construct one pair of trees in a differentially private manner, applying the algorithm described here (that samples pairs of trees) allows using this whole budget \(\varepsilon _t\) to sample tree-pairs. Constructing a tree-pair by first creating single trees necessarily requires using \(\varepsilon _t/2\) budget to create each tree (regardless of the technique used to obtain individual trees). However, sampling pairs of trees is more complicated than sampling decision trees or split points, thus there is a trade-off between the budget spent and the complexity of the sampling.
The algorithm creates the initial target variable (line 3 in Algorithm 4) as described in Sects. 4.2 and 4.3.1. Next, a random pair of decision trees of a predefined depth is created (line 4 in Algorithm 4, lines 1 and 2 in Algorithm 5). This pair is iteratively improved by selecting a random node in either tree and replacing it with a node created by making a split on a randomly selected split point (attribute–category or attribute–numerical value pair) on the data contained in this node (while-loop starting in line 4 in Algorithm 5). Potentially newly gained branches are completed to random subtrees until a predefined depth. As a result, we obtain a pair of decision trees, respectively over the variables from either data tables, which are matched at the leaves through the entities they contain (line 5 of Algorithm 4).
This process is repeated, starting with different attributes from either data table, to produce a collection of redescriptions (for-loop starting in line 2 of Algorithm 4). These redescriptions—the pairs of queries together with their corresponding support statistics (support sizes, Jaccard index, and \(p\text {-value} {}\))—constitute the output of the algorithm.
Thus, two important steps are: a) building pairs of matching trees (line 4 of Algorithm 4) and b) evaluating and combining queries corresponding to the branches of the trees to form redescriptions and compute their support statistics (line 5 of Algorithm 4). The main challenge we need to tackle is to perform these steps in a differentially private manner while still being able to produce high-quality redescriptions.
Specifically, in the first step we use an MCMC method similar to the one used in AltMCMC. However, we have to extend the method to simultaneously sample pairs of matching trees. We show that this process, upon convergence, produces a tree-pair as would be produced if the exponential mechanism was applied to the very large space of all possible pairs of trees of a fixed depth d. Thus, upon convergence, we may consider the produced pairs of trees to be differentially private. In the second step we use noisy counts, i.e. apply the Laplace mechanism, in a targeted manner to compute the support statistics of obtained redescriptions. This guarantees that the returned redescriptions are differentially private. The methodology allows balancing budget expenditures between the two steps.
In each of the \( InTr \) iterations, procedure sampleTreePair is \(\omega \varepsilon / InTr \) differentially private (see Sect. 4.4.1) whereas procedure extractReds is \((1-\omega )\varepsilon / InTr \) differentially private (see Sect. 4.5). Thus, by the sequential composition property of differential privacy (McSherry 2009), each iteration of MCMCTreePair is \(\varepsilon / InTr \) differentially private. As a result, the MCMCTreePair algorithm is \( InTr \cdot \varepsilon / InTr = \varepsilon \) differentially private.
4.4.1 Sampling a pair of trees
The MCMC-based procedure for sampling a pair of trees is described in Algorithm 5. It is an extension of the MCMC sampling procedure for trees developed by Bai et al. (2017) and used in Algorithm 3. In addition to the extensions in MCMC sampling iterations, development of the tree-pair sampling procedure necessitated development of a novel tree-pair evaluation function. This necessitated creating a normalized tree evaluation score (5), which is a modified version of the score from (3), and incorporating this modified score into the novel tree-pair evaluation function (6).
For simplicity of exposition, we assume that the initialization attribute a comes from table \(\textbf{D}_{R}\). If attribute a comes from \(\textbf{D}_{L}\) instead, the roles of the two views are simply inverted. We let \(N_i(T)\) and \(N_l(T)\) denote the sets of inner nodes and of leaf nodes of the tree T, respectively. The procedure relies on a score function to evaluate the quality of pairs of trees, denoted \( score (T_L,T_R)\), to which we will come back later.
The procedure in Algorithm 5 starts by randomly generating trees \(T_L\) and \(T_R\) of depth d (lines 1 and 2). Function makeTarget is used to create a target variable from a given attribute or tree as explained above, and function randomSplitTree is used to build a tree of chosen depth by selecting splits at random. Then, in each iteration, a node n is selected at random from the current tree-pair (line 6). A new node \(n'\) is created, by selecting at random an attribute from the same view as the attribute in n, and a random split value in its range (line 7).
The score for the tree-pair obtained by replacing node n with node \(n'\) (line 8) is computed and compared to the score of the original tree-pair to decide whether to accept the replacement (line 9). Bad choices for replacement (e.g. internal nodes which affect several leaves negatively) will significantly reduce the score of a newly constructed tree, thus the change will be accepted with smaller probability.
If the replacement is accepted and the modified node is in \(T_L\), the statistics of \(T_R\) must be updated to account for the modified target (line 13). Such node replacement attempts are performed for a predefined number of iterations (\( MCIter \)) or until the variance of the scores in the k previous iterations drops below a predefined threshold, as checked by function stabilizedVar (line 4). The last obtained tree-pair is returned (line 18).
Clearly, the crux of this procedure is the score function to evaluate the quality of pairs of trees, \( score (T_L,T_R)\), and the replacement probability \(\alpha \) that is derived from it. The replacement probability should be such that the procedure is able to traverse the whole space of tree-pairs and, upon convergence, samples a tree-pair \((T_L,\ T_R)\) with the probability assigned to it by the exponential mechanism
where \( AllTreePairs _d\) is the set of all pairs of trees of depth at most d and \(\Delta score \) is the sensitivity of the score function. Assuming the tree-pairs have uniform base probability, the above probability is clearly in the form of (2). We show that the sampling follows the above probability in “Appendix A”.
Given a decision tree T and a target variable \({\textbf{c}}\), we measure the quality of the subtree rooted in node n as the quality of the associated split with respect to the target, using the recursive formula
where, as in (3), C(n) denotes the set of children of n, while \(\tau \), \(\tau _n\), and \(\tau _{n,c}\) denote respectively the total number of entities in the dataset, the number of entities in n, and the number of entities of class c in n. Unlike Bai et al. (2017) and in (3), here we use a normalized quality measure that does not depend on the data size. A value of 1 indicates a pure split, i.e. a perfect match between the (sub)tree and the target. Computing \(g( root (T))\) gives the quality of tree T.
The score of a pair of trees is measured by combining their respective scores
The quality score of tree \(T_R\) is weighted by the quality score of tree \(T_L\) since a fairly accurate tree \(T_L\) must be built first, to ensure that the resulting tree-pair adequately models the initial target. Then, as tree \(T_L\) becomes more accurate, the overall score depends increasingly on the accuracy of tree \(T_R\), forcing both trees to match and to model the initial target.
Lemma 1
The function \( score (T_L, T_R)\) from (6) has sensitivity 1.
The proof of the above lemma is presented in “Appendix B”. Note that the score used by Bai et al. (2017) has sensitivity 2 instead. Thus, our score reduces the level of noise needed in the differentially-private mechanism.
To conclude, Algorithm 5 uses the whole initial budget \(\varepsilon \) for tree-pair sampling. The produced tree-pair, upon convergence of the MCMC sampling procedure, is as if produced by the exponential mechanism applied to the space of all tree-pairs of depth d and can thus be considered \(\varepsilon \)-differentially private. Sampling tree-pairs is much more complex than sampling individual trees. For this procedure to work, the underlying quality score must be normalized to the [0, 1] interval. Although this allows reducing sensitivity, it also changes the acceptance probabilities of the exponential mechanism (random choices can be accepted with substantially larger probability compared to the non-normalized version).
4.5 Generating redescriptions
Having obtained a pair of matching trees, the next step consists in generating redescriptions from it by evaluating and combining queries corresponding to the branches of the trees. In particular, doing this in a differentially private way requires obtaining noisy counts of the entities shared by the leaves from either tree, from which we can then compute all necessary support statistics. A naïve way to compute the support cardinalities would be to use the Laplace mechanism to obtain the size of the support \(\vert {{\,\textrm{supp}\,}}(q)\vert \) for both conjunctive queries of every simple redescription, and then use the mechanism again for every attempted combination of these queries via disjunctions or negations. This, however, would use too much of the privacy budget, and we show how we can compute the required counts at a significantly lower privacy budget.
Recall that the leaves of a given tree contain disjoint sets of entities. All necessary information to evaluate redescriptions generated from the pair of trees \(T_L\) and \(T_R\) with sets of leaf nodes \(N_l(T_L)\) and \(N_l(T_R)\), respectively, can be computed using two passes over the data: first computing \(\vert {{\,\textrm{supp}\,}}(n_L) \cap {{\,\textrm{supp}\,}}(n_R)\vert \) for all \((n_L, n_R) \in N_l(T_L) \times N_l(T_R)\) in the first pass over the dataset (line 2 in Algorithm 6), and then computing \(\vert {{\,\textrm{supp}\,}}(n_L)\vert \) for all \(n_L \in N_l(T_L)\) in the second pass (line 3). As the entities in \({{\,\textrm{supp}\,}}(n_L) \cap {{\,\textrm{supp}\,}}(n_R)\) are disjoint over the leaves, learning the (noisy) cardinality of one of these reveals nothing about the rest of the data, and hence we can use the same \(\varepsilon \) for computing each of these cardinalities. The same holds true for computing \(\vert {{\,\textrm{supp}\,}}(n_L)\vert \).
Using the computed noisy cardinalities, we can also compute
The procedures described above are used in lines 4 and 5 of Algorithm 6. For \(I_L \subseteq N_l(T_L)\) and \(I_R \subseteq N_l(T_R)\), we have that
and
The formulae to compute noisy counts described above are used inside the inner for-loop of Algorithm 6.
Based on these differentially private values, we can compute the support sizes of the simple queries corresponding to single branches from the trees, as well as of the extended queries, that combine different branches, or their negations, from either tree. We can also compute the accuracy and the \(p\text {-value}\) of every query pair, as this only requires knowledge of the support sizes of the queries, their intersection, and the size of the data.
Simple redescriptions, generated by function createSimpleReds, consist of query-pairs \(({\bar{q}}_L, {\bar{q}}_R)\) obtained from \((n_L, n_R)\in N_l(T_L) \times N_l(T_R)\) (line 6 in Algorithm 6), where \({\bar{q}}_X\) denotes the simple query associated to \(n_X\) or its negation. These simple redescriptions are iteratively extended by at most three other simple queries (or their negations) obtained from each view, using the disjunction operator (the use of disjunctions and their properties are described by Mihelčić et al. (2018)). An extension is accepted if it increases the Jaccard index compared to the simple redescription (checked by function evaluate) and if it satisfies conditions from \(\Gamma ^{\mathrm {[cl]}}\) (checked by function sat). The resulting redescriptions after this step can be, for example, \((n_{L_1}\vee n_{L_2}, n_{R_1})\), \((n_{L_1} \vee n_{L_2} \vee n_{L_3}, n_{R_1})\), \((n_{L_1} \vee n_{L_2}, n_{R_1} \vee \lnot n_{R_2})\), \((n_{L_1} \vee n_{L_2} \vee n_{L_3}, n_{R_1} \vee n_{R_2} \vee n_{R_3})\). Extended redescriptions are often more accurate than simple redescriptions. Such redescriptions also have larger support set size compared to corresponding simple redescriptions. All redescriptions satisfying the constraints from \(\Gamma ^{\mathrm {[cl]}}\) are added to the redescription set (line 20) that is returned to the user.
When the data contains missing values, we have
This is because entities for which the value of the split attribute(s) is missing cannot be sorted into the leaves. In such cases, we propose to estimate the support of negated queries with a heuristic count \(\vert {{\,\textrm{supp}\,}}(\lnot \,q_m)\vert \approx \sum _{l \in I_L{\setminus } \{m\}}\vert {{\,\textrm{supp}\,}}(n_l)\vert \). Detailed justification for this choice is provided in the related technical report (Mihelčić and Miettinen 2022).
Procedures computeNLeafInterCards and computeNLLeafCards from Algorithm 6 are \(\varepsilon /2\) differentially private, since both procedures apply the Laplace mechanism with budget \(\varepsilon /2\), thus Algorithm 6 is \(\varepsilon /2 + \varepsilon /2 = \varepsilon \) differentially private.
4.6 Redescription pruning strategy
The redescription mining process on data where differential privacy must be preserved is significantly harder than the regular task. Redescription mining algorithms have traditionally had a number of parameters that require tuning, usually through experimentation. However, in a setting where differential privacy needs to be preserved, such experimentation is not an option. Thus, constraints applied to filter redescriptions during the mining process need to be loose so that regardless of data characteristics some usable redescriptions can be obtained. This necessitates using relatively low noisy Jaccard threshold, as setting this parameter to very high values might remove many useful redescriptions and does not guarantee removing inaccurate artefact redescriptions.
The Support of redescriptions is an important quality measure. In the differentially private setting, we do not know the exact size of the data, and hence cannot set the minimum support threshold exactly. In our experiments, we set a coarse minimum support threshold (see “Appendix C.3”). Noisy support calculation has the biggest relative effect on redescriptions with small support, having also a noticeable impact on the accuracy of such redescriptions. As the real support of redescriptions increases, the noisy support size and accuracy of redescriptions are relatively closer to the real values.
To keep the mining process general but also allow obtaining usable sets of redescriptions, we apply a redescription pruning strategy after the redescription mining step. The aim of this post-processing step is to use the information about noisy support and noisy accuracy to determine a noisy support threshold that will be used to filter out redescriptions that are probably artefacts from the introduced noise. The easiest way to determine such threshold is to study a noisy support versus noisy redescription accuracy scatter plot. In a differentially private setting, there will always be a very dense block in the upper left corner of such a scatter plot (low noisy support, very high accuracy). Redescriptions making up such blocks should be pruned as most of them are inaccurate.
Notice that this step can be applied only once the redescription mining step is complete. Although this step drastically reduces the number of noise-induced redescriptions, it is still possible to obtain redescriptions with strongly overestimated accuracy or insignificant redescriptions. However, as we demonstrate in Sect. 5, the redescription sets obtained after pruning are usable in practice.
5 Experimental evaluation
5.1 The setup
All presented algorithms are implemented in Java.Footnote 1 The experiments were conducted on a server with two 64-core AMD EPYC 2 processors at \({2}\,\textrm{GHz}\) and with \(1\,\textrm{TB}\) of main memory. All experiments used less than \({25}\,\textrm{GB}\) of main memory.
Datasets
We conducted experiments on seven real-world datasets based on four different types of data. The datasets are selected to cover a variety of data sizes and types, with an emphasis on the kind of data where confidential access would be important. CMS is a family of datasets based on synthetic health insurance claims; NPAS is based on online psychological surveys; MAMMAL contains mammal occurrences and climate information; and MIMIC-III is based on hospital records. CMS, NPAS, and MIMIC-III are examples of sensitive data with different properties, while MAMMAL is a standard dataset in the redescription mining literature. All datasets are publicly available. Basic properties of the datasets are presented in Table 1; further information is available in “Appendix C.1”.
Baseline Methods
We compared all three differentially private algorithms to two existing general (not privacy-preserving) tree-based redescription mining algorithms, SplitT and LayeredT (Galbrun and Miettinen 2018a; Zinchenko et al. 2015) using the implementations from the python-clired package.Footnote 2 We used only tree-based methods for the comparison to be fair; the purpose of these experiments is to show the effects of differential privacy, not to compare different types of redescriptions.
Parameters
Setting the parameters for differentially private algorithms requires some care. Unlike in a normal setting, with differential privacy the user cannot tune the parameters. We provide the parameters used in the experiments, as well as some general guidelines on how to set the parameters when the data is not accessible, in “Appendix C.3”.
Evaluation setup
We conducted five types of experiments. First, we studied the convergence properties of the MCMC simulations used to create tree-pairs by the MCMCTreePair algorithm. This is studied because sampling tree-pairs in this manner has not been previously reported in the literature. These results are postponed to the related technical report (Mihelčić and Miettinen 2022). Second, we studied the differences between the quality of the redescriptions as reported by the differentially private algorithms (i.e. the noisy Jaccard and \(p\text {-value}\)) versus the true values. Here, we used \( InTr = 1\) to have the largest budget for tree creation in AltExpM and AltMCMC. Noise is required to preserve privacy, but it obviously also means that the results the user will obtain will not be entirely accurate. The aim of this experiment was to assess the error introduced by adding noise.
In the next experiment, we compared the redescriptions found with differentially private algorithms to redescriptions obtained by standard (non-differentially private) tree-based algorithms SplitT and LayeredT. This experiment measured the overall quality of differentially private redescriptions against standard redescriptions. Then, we compared the performance of differentially private algorithms among 10 runs with budget 1.0 each (we perform repeated runs to measure variation between executions), on 10 differentially private runs with budget 0.1 each, and on 100 differentially private runs with budget 0.01 each. We compared differentially private algorithms based on whether any redescriptions were obtained in a run, the percentage of significant redescriptions discovered, the distribution of true Jaccard index values among the obtained redescriptions, and the distribution of the absolute difference between the true and reported, noisy Jaccard index values of the redescriptions. Finally, we tested the scalability of the proposed differentially private algorithms.
5.2 Effects of differential privacy on redescription quality
The effects of the added noise on the reported Jaccard indices for the three developed algorithms can be observed in the scatter plots in Figs. 2 and 3. Here, the y-axis shows the differentially private Jaccard indices reported by the differentially private algorithms, while the x-axis shows the true Jaccard indices computed with full access to the data.
Because of the noisy counting, the redescriptions with low (noisy) support can have very misleading quality measures. Hence, we recommend that the user prunes out all redescriptions with reported support below a pre-defined threshold (see Sect. 4.6 for an explanation of the pruning strategy and Appendix C.3 for the thresholds we used). The redescriptions pruned this way are depicted as light grey diamonds. The redescriptions not pruned are the larger blue dots. The colour of the dot indicates whether the true (i.e. not noisy) \(p\text {-value}\) is above 0.01 (dark blue) or below it (light blue). The user, with no access to the true data, can separate the diamonds from the dots, but cannot separate the dark from the light blue dots.
We can see that for all approaches, the scatter plot slowly turns into a straight diagonal line as data size increases. This is due to the fact that noisy counts become more accurate as more data is available. Redescription accuracy on the NPAS dataset is overestimated for all approaches, which is expected for such a small dataset. As it can be seen from Table 2, MCMCTreePair has the largest and most significant correlation between the noisy and true Jaccard indices on the NPAS and MAMMAL datasets, AltExpM has the highest correlation on the MIMIC-III and CMS-2-16 datasets, and AltMCMC on the CMS-2-1, CMS-2-4, and CMS-2-8 datasets. MCMCTreePair produced a smaller number of significant redescriptions on MAMMAL and on the larger datasets.
A very important observation is that the pruning strategy works. The grey diamonds to the left of the scatter plots correspond to false positives, that is, to redescriptions that differentially private algorithms return as valid redescriptions, but that have very low true Jaccard index value. On the other hand, there are very few diamonds towards the right, indicating that there are very few false negatives, that is, good redescriptions that are unnecessarily pruned. As the user does not have access to the true data and cannot compute the true Jaccard indices, it is of vital importance that this filtering works.
The first three rows of plots in Figs. 4 and 5 show the distribution of the true Jaccard indices, with the number of redescriptions having true Jaccard index essentially zero indicated in the top-left corner. These plots make it clearer that the pruned redescriptions (light grey) are those with lower true Jaccard index values, while the non-pruned ones have higher Jaccard index values.
5.3 Comparison to other algorithms
The comparison of MCMCTreePair, AltExpM, and AltMCMC to SplitT and LayeredT can be seen by comparing the first three rows of plots of Figs. 4 and 5 to the last row. To make a fair comparison, we use only redescriptions with high enough support (non-filtered) for differentially private algorithms. Filtering for SplitT and LayeredT was performed by removing all redescriptions with true support smaller than the minimum support redescription retained for the MCMCTreePair algorithm.
As it can be seen from Figs. 4 and 5, differentially private algorithms mostly return a smaller number of redescriptions with the highest accuracy (Jaccard index in interval [0.8, 1.0]) and MCMCTreePair returns fewer redescriptions overall (as noted earlier). One reason why differentially private algorithms might find fewer highly accurate redescriptions, except the obvious reason related to the introduced noise, is the maximum support constraint, that prevents the algorithms from creating redescriptions with (noisy) support larger than 80% of the dataset size. This bound is meant to prevent the creation of tautological statements and redescriptions describing almost all entities contained in the dataset. While the same bound is used in non-private algorithms, the added noise can lead the private algorithms to erroneously prune some redescriptions.
These experiments demonstrate that when the proposed differentially private algorithms report good redescriptions with high quality and high-enough support, the user can be rather confident that she will find good results also from the true data, should she get a permission to study it.
5.4 Evaluation of the properties of the different algorithms
Here we provide a in-depth comparative analysis of the different algorithms. We will see that AltMCMC and AltExpM perform very well and significantly outperform MCMCTreePair on medium-sized and large datasets when a very large budget is available or on large datasets with smaller budget. The MCMCTreePair algorithm has the best performance on small datasets or when small budget is available on small and medium-sized datasets. The performance of AltMCMC and AltExpM increases with the increase of the data size.
The AltExpM and AltMCMC algorithms have higher dependence on the initialization procedure than MCMCTreePair. When used in such a way that only one initial attribute is selected to start the alternations (\( InTr = 1\), the setting giving the largest budget for tree construction), a poor choice of initial attribute might produce an extremely bad or even empty redescription set. Since an analysis of the distribution of the attribute values is required to reduce the risk that this happens, these algorithms require either significantly pre-processed data (rarely available in real-world scenarios) or some more involved strategy to choose the initial attribute. Our experimental results show that such critical failure of AltExpM and AltMCMC mostly happens with a probability less than 0.1 (for more details, see the related technical report Mihelčić and Miettinen 2022).
In the absence of a good strategy to choose the initial attribute, one solution is to randomly select a larger number of initial attributes and reduce the number of alternations. Thus, we also test the performance of these two algorithms in a setting at the opposite end of the spectrum, where \( InTr = k\) and \( RMIter = 1\), to maximize the chances of choosing one good initial attribute, while keeping the tree-construction budget similar to the previous setting. In this setting, k different initial attributes are chosen and for each one a pair of trees is created from which redescriptions are generated. The probability that algorithms with this parameter setting will choose a good initial candidate is now equal to that of the MCMCTreePair (thus we call this parameter setting stable with respect to the choice of initial attributes). AltMCMC and AltExpM allow different tradeoffs between \( InTr \) and \( RMIter \) parameters; further study of these tradeoffs is left for future work.
We test MCMCTreePair with a budget redistribution weight \(\omega = 0.1\) (standard) and \(\omega = 0.5\) (balanced setting, denoted by “b”).
The statistical significance of differences in pooled distribution between approaches is measured using the Mann–Whitney U test (Mann and Whitney 1947). As it can be seen from Table 3, when an initial budget of 1.0 is used, the AltMCMC and AltExpM algorithms have the largest percentage of significant redescriptions on all datasets. This changes when 10 runs with budget 0.1 are used, where MCMCTreePair has the best results on NPAS, MAMMAL, MIMIC-III, and CMS-2-1 but AltMCMC and AltExpM produce the largest percentage of significant redescriptions on CMS-2-4, CMS-2-8, and CMS-2-16. When 100 runs with initial budget of 0.01 are used, the AltExpM algorithm produces the highest percentage of significant redescriptions out of all filtered redescriptions on all datasets.
A better insight into the overall performance of the presented methodologies can be obtained from Table 4. The methodology with the highest median of true accuracy of filtered redescriptions in a pooled set is compared to all other approaches. The pooled sets contain redescriptions produced in all runs by some algorithm with a chosen initial budget.
These results demonstrate that MCMCTreePair outperforms all methodologies with all values of initial budget on the NPAS dataset. Due to the small size of this dataset, the difference is not always significant when a smaller initial budget is used. As data size increases, AltMCMC and AltExpM have significantly better accuracy with a budget of 1.0, but already with budget of 0.1, MCMCTreePair outperforms other approaches on all datasets up to CMS-2-8. On relatively large datasets, namely CMS-2-8 and CMS-2-16, AltMCMC and AltExpM significantly outperform MCMCTreePair if they are not stabilized, whereas MCMCTreePair outperforms other algorithms with lower initial budgets (0.1 or 0.01) when other approaches are stabilized.
Results presented in Table 4 show that every presented approach has its merits and can be effectively used under different circumstances. The AltMCMC and AltExpM algorithms are best used with high initial budgets and on medium-sized and large datasets, whereas the MCMCTreePair algorithm should be used on smaller datasets and with smaller initial budget. The stable parameter setting often significantly reduces the accuracy of redescriptions produced by the AltMCMC and AltExpM algorithms, but it averts the undesirable outcome where no redescriptions are produced.
Table 5 shows the statistical significance of the absolute difference between noisy and true redescription accuracy. It demonstrates that MCMCTreePair has the lowest difference in the majority of cases on the NPAS, MAMMAL, MIMIC-III, CMS-2-1, and CMS-2-4 datasets, whereas AltMCMC as well as, in most cases, AltExpM have the lowest difference on the larger CMS-2-8 and CMS-2-16 datasets.
The distributions of redescription accuracy and absolute difference between noisy and real redescription accuracy corresponding to the results presented in Tables 4 and 5 are provided in the related technical report (Mihelčić and Miettinen 2022).
5.5 Scalability
To test scalability, we ran all proposed differentially private algorithms on the CMS-2-k datasets for \(k=\{1,2,4,8,16\}\). We run each algorithm 10 times on each dataset. AltMCMC and AltExpM were run with both tested parameter settings (\( InTr = 1\), \( RMIter = k\) and \( InTr = k\), \( RMIter = 1\)). We report the total CPU time in seconds required to perform redescription mining in Fig. 6. The time required to load the input data is excluded to eliminate the effects of I/O and caching.
Algorithms scaled relatively well with data size (even single-threaded) and never used more than \({25}\,\textrm{GB}\) of main memory. Our experiments show that stabilization increases the execution times of the algorithms. This is the consequence of the fact that a larger number of trees need to be built and that the initial binning of an attribute needs to be repeated multiple times. AltMCMC has the lowest execution times, whereas the stabilized version of AltExpM has the largest execution times. Variation in execution times increases for all approaches with the increase of data size. The increase is expectedly much larger for approaches based on sampling, since the involved random choices of attributes can drastically change the structure of the tree or tree-pair being build, potentially leading to rebuilding larger parts of these trees (for more details, see the related technical report Mihelčić and Miettinen 2022).
As it can be seen from Fig. 6, all algorithms finish in less than 10 hours on the largest dataset containing 115,200 entities.
5.6 Discussion
The performed experimental evaluation shows that the presented differentially private redescription mining algorithms discover useful redescriptions on datasets of various sizes. Due to introduced noise, all presented differentially private approaches produce a number of redescriptions with very inaccurate estimates of the Jaccard index and \(p\text {-value}\). Yet, as demonstrated, these can be detected and removed with a simple filter on the noisy support size.
Differentially private redescription mining algorithms discover comparable number of redescriptions to the non-differentially private approaches on medium-sized and large datasets, whereas this number is smaller on datasets of smaller size. The overall accuracy of differentially private redescription mining approaches is also reduced compared to non-differentially private approaches due to the noise required to ensure privacy of individual records.
The AltExpM and AltMCMC differentially private redescription mining algorithms perform very well on datasets of medium and large size if: a) a maximal budget is used to perform the experiment (thus only one experiment overall can be performed) and b) the data is carefully filtered and pre-processed to ensure successful initialization. With suitably adjusted parameters, requirement b) can be alleviated or completely removed at the cost of a significant drop in the overall accuracy. With the adjusted parameters, AltExpM and AltMCMC perform very similarly to the MCMCTreePair algorithm.
If the data size is relatively small, or it is required to allow several experiments to be performed on the data, the experimental results show increasing benefits of using the MCMCTreePair algorithm.
6 Conclusions
Differential privacy offers strong guarantees to protect the privacy of the data of every individual covered by a database. Unfortunately, these guarantees come at the cost of limited availability of information about the data set and reduced accuracy of the obtained results.
Despite this, such systems allow performing exploratory data analysis on privacy-sensitive data with the aim of generating new hypotheses. Preliminary results can be obtained quickly, safely, and cost-effectively, which can inform the decision to request a full access to the data (involving privacy agreements, potential data-use fees etc.).
Pattern-based data mining methods, such as redescription mining, are especially sensitive to the noise required by differential privacy. Designing effective, differentially private pattern mining algorithms is even harder than, say, classification methods. Nonetheless, our experiments show that the proposed differentially private redescription mining algorithms work: they can produce high quality redescriptions many of which are statistically significant.
The proposed methods cannot be used on arbitrary data without care, however. The input data might require pre-processing (by the data owner), algorithmic parameters need to be set, and, most importantly, noisy support filtering must be performed to achieve reasonably good performance.
Approaches based on the MCMC sampling of decision trees and the exponential mechanism (AltMCMC and AltExpM) have the potential of returning very accurate sets of redescriptions with over 90% of the reported redescriptions (after noisy support filtering) being in fact significant. However, they are highly dependent on the initialization strategy or data preprocessing (which entirely depends on the data provider). Stabilizing these approaches in such a way that more initial attributes are chosen has a negative impact on the accuracy and significance of the produced redescriptions, as well as on the absolute difference between their noisy and true accuracies
The MCMCTreePair algorithm produces overall fewer redescriptions and these have lower accuracy compared to the results of AltExpM and AltMCMC when a budget of 1.0 is used. On the other hand, it is more stable, makes smaller errors in estimating true redescription accuracy, and is especially suited for use on smaller datasets or with smaller budgets, both common constraints in many real-world applications.
Tree-based algorithms are only one family of redescription mining algorithms. An interesting topic for future research is to study whether other types of algorithms could work equally well—or even better—than the proposed tree-based differentially private redescription mining algorithms.
Notes
The source code is available at https://doi.org/10.5281/zenodo.7599160.
https://pypi.org/project/python-clired/ v6.0.4, accessed 18 September 2020.
https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp, accessed 9 June 2020.
https://openpsychometrics.org/_rawdata/, accessed 9 June 2020, data from 19 February 2016.
https://mimic.physionet.org/gettingstarted/overview/, accessed 18 September 2020.
References
Agrawal D, Aggarwal CC (2001) On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS), pp 247–255. https://doi.org/10.1145/375551.375602
Bai X, Yao J, Yuan M et al (2017) Embedding differential privacy in decision tree algorithm with different depths. Sci China Inf Sci 60(082):104. https://doi.org/10.1007/s11432-016-0442-1
Balu R, Furon T (2016) Differentially private matrix factorization using sketching techniques. In: Proceedings of the ACM workshop on information hiding and multimedia security (IH &MMSec), pp 57–62. https://doi.org/10.1145/2909827.2930793
Berlioz A, Friedman A, Kaafar MA et al (2015) Applying differential privacy to matrix factorization. In: Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114. https://doi.org/10.1145/2792838.2800173
Blum A, Dwork C, McSherry F et al (2005) Practical privacy: the SuLQ framework. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS), pp 128–138. https://doi.org/10.1145/1065167.1065184
Ding B, Kulkarni J, Yekhanin S (2017) Collecting telemetry data privately. In: Proceedings of the advances in neural information processing systems (NIPS), pp 3571–3580. https://proceedings.neurips.cc/paper/2017/file/253614bbac999b38b5b60cae531c4969-Paper.pdf
Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3–4):211–407. https://doi.org/10.1561/0400000042
Dwork C, McSherry F, Nissim K et al (2006) Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography (TCC), pp 265–284. https://doi.org/10.1007/11681878_14
Dwork C, Rothblum GN (2016) Concentrated differential privacy. arXiv:1603.01887
Dwork C, Rothblum GN, Vadhan S (2010) Boosting and differential privacy. In: Proceedings of the IEEE annual symposium on foundations of computer science (FOCS), pp 51–60. https://doi.org/10.1109/FOCS.2010.12
Fan L (2020) A survey of differentially private generative adversarial networks. In: Proceedings of the AAAI workshop on privacy-preserving artificial intelligence
Fletcher S, Islam MZ (2019) Decision tree classification with differential privacy: a survey. ACM Comput Surv 52(4):83:1-83:33. https://doi.org/10.1145/3337064
Freedman D, Diaconis P (1981) On the histogram as a density estimator: \(l_2\) theory. Z Wahrscheinlichkeitstheorie verw Gebiete 57:453–476. https://doi.org/10.1007/BF01025868
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 493–502. https://doi.org/10.1145/1835804.1835868
Fröhlich H, Balling R, Beerenwinkel N et al (2018) From hype to reality: data science enabling personalized medicine. BMC Med 16:150. https://doi.org/10.1186/s12916-018-1122-7
Galbrun E, Miettinen P (2012) From black and white to full color: extending redescription mining outside the Boolean world. Stat Anal Data Min 5(4):284–303. https://doi.org/10.1002/sam.11145
Galbrun E, Miettinen P (2018) Mining Redescriptions with Siren. ACM Trans Knowl Discov Data 12(1):6. https://doi.org/10.1145/3007212
Galbrun E, Miettinen P (2018) Redescription mining. Springer, New York. https://doi.org/10.1007/978-3-319-72889-6
Galbrun E, Miettinen P (2012b) Siren: an interactive tool for mining and visualizing geospatial redescriptions [demo]. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1544–1547. https://doi.org/10.1145/2339530.2339776
Galbrun E, Miettinen P (2016) Analysing political opinions using redescription mining. In: Proceedings of the IEEE international conference on data mining workshop (ICDMW), pp 422–427. https://doi.org/10.1109/ICDMW.2016.0066
Galbrun E, Tang H, Fortelius M et al (2018) Computational biomes: the ecometrics of large mammal teeth. Palaeontol Electron 21.1.3A. https://doi.org/10.26879/786
Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the SIAM international conference on data mining (SDM), pp 334–345. https://doi.org/10.1137/1.9781611972788.30
Gong M, Xie Y, Pan K et al (2020) A survey on differentially private machine learning [review article]. IEEE Comput Intell Mag 15(2):49–64. https://doi.org/10.1109/MCI.2020.2976185
Gursoy ME, Inan A, Nergiz ME et al (2017) Differentially private nearest neighbor classification. Data Min Knowl Discov 31(5):1544–1575. https://doi.org/10.1007/s10618-017-0532-z
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biomet 57(1):97–109. https://doi.org/10.1093/biomet/57.1.97
Hijmans RJ, Cameron SE, Parra LJ et al (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978
Imtia H, Sarwate AD (2018) Improved algorithms for differentially private orthogonal tensor decomposition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2201–2205. https://doi.org/10.1109/ICASSP.2018.8461303
Jagannathan G, Pillaipakkamnatt K, Wright RN (2012) A practical differentially private random decision tree classifier. Trans Data Privacy 5(1):273–295. https://doi.org/10.1109/ICDMW.2009.93
Jagannathan G, Monteleoni C, Pillaipakkamnatt K (2013) A semi-supervised learning approach to differential privacy. In: Proceedings of the IEEE international conference on data mining workshop (ICDMW), pp 841–848. https://doi.org/10.1109/ICDMW.2013.131
Jahan T, Narasimha G, Rao CVG (2014) A comparative study of data perturbation using fuzzy logic to preserve privacy. In: Proceedings of the international conference on networks and communications (NetCom2013), pp 161–170. https://doi.org/10.1007/978-3-319-03692-2_13
Kalofolias J, Galbrun E, Miettinen P (2016) From sets of good redescriptions to good sets of redescriptions. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 211–220. https://doi.org/10.1109/ICDM.2016.0032
Karjalainen M, Galbrun E, Miettinen P (2023) Serenade: an approach for differentially private greedy redescription mining. In: Proceedings of the 20th anniversary workshop on knowledge discovery in inductive databases (KDID ’22), pp 31–46
Kenthapadi K, Korolova A, Mironov I et al (2013) Privacy via the Johnson–Lindenstrauss transform. J Priv Confid 5(1):39–71. https://doi.org/10.29012/jpc.v5i1.625
Li H, Xiong L, Ohno-Machado L et al (2014) Privacy preserving RBF kernel support vector machine. BioMed Res Int 2014:827371. https://doi.org/10.1155/2014/827371
Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Tran Knowl Data Eng 18(1):92–106. https://doi.org/10.1109/TKDE.2006.14
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60. https://doi.org/10.1214/aoms/1177730491
McSherry F (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD), pp 19–30. https://doi.org/10.1145/1559845.1559850
McSherry F, Mironov I (2009) Differentially private recommender systems: building privacy into the Netflix prize contenders. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 627–636. https://doi.org/10.1145/1557019.1557090
McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: Proceedings of the IEEE symposium on foundations of computer science (FOCS), pp 94–103. https://doi.org/10.1109/FOCS.2007.66
Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092. https://doi.org/10.1063/1.1699114
Mihelčić M, Džeroski S, Lavrač N et al (2017) A framework for redescription set construction. Expert Syst Appl 68:196–215. https://doi.org/10.1016/j.eswa.2016.10.012
Mihelčić M, Šimić G, Babić-Leko M et al (2017) Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients. PLoS ONE 12(10):e0187364. https://doi.org/10.1371/journal.pone.0187364
Mihelčić M, Džeroski S, Lavrač N et al (2018) Redescription mining augmented with random forest of multi-target predictive clustering trees. J Intell Inf Syst 50:63–96. https://doi.org/10.1007/s10844-017-0448-5
Mihelčić M, Miettinen P (2022) Differentially private tree-based redescription mining arXiv:2212.06630
Mitchell-Jones AJ, Amori G, Bogdanowicz W et al (1999) The Atlas of European Mammals. Academic Press, Cambridge
Napolitano F, Zhao Y, Moreira VM et al (2013) Drug repositioning: a machine-learning approach through data integration. J Cheminform. https://doi.org/10.1186/1758-2946-5-30
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the ACM symposium on theory of computing (STOC), pp 75–84. https://doi.org/10.1145/1250790.1250803
Ramakrishnan N, Zaki MJ (2009) Redescription mining and applications in bioinformatics. In: Chen J, Lonardi S (eds) Biological data mining. Chapman and Hall/CRC, Boca Raton. https://doi.org/10.1201/9781420086850.ch22
Ramakrishnan N, Kumar D, Mishra B et al (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 266–275. https://doi.org/10.1145/1014052.1014083
Rana S, Gupta SK, Venkatesh S (2015) Differentially private random forest with high utility. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 955–960. https://doi.org/10.1109/ICDM.2015.76
Rubinstein BIP, Bartlett PL, Huang L et al (2012) Learning in a large function space: privacy-preserving mechanisms for svm learning. J Priv Confid 4(1):25. https://doi.org/10.29012/jpc.v4i1.612
Tsou YT, Zhen H, Jiang X et al (2020) DPARM: Differentially private association rules mining. IEEE Access 8:142,131-142,147. https://doi.org/10.1109/ACCESS.2020.3013157
Vaidya J, Shafiq B, Basu A et al (2013) Differentially private Naive bayes classification. In: Proceedings of the IEEE/WIC/ACM international joint conferences on web intelligence and intelligent agent technologies (WI-IAT), pp 571–576. https://doi.org/10.1109/WI-IAT.2013.80
Van Ginneken B, Ter Haar Romeny BM, Viergever MA (2001) Computer-aided diagnosis in chest radiography: a survey. IEEE Trans Med Imaging 20:1228–1241. https://doi.org/10.1109/42.974918
Wang Y, Anandkumar A (2016) Online and differentially-private tensor decomposition. In: Proceedings of the advances in neural information processing systems (NIPS), pp 3531–3539. https://proceedings.neurips.cc/paper/2016/file/7eb7eabbe9bd03c2fc99881d04da9cbd-Paper.pdf
Xu S, Su S, Xiong L, et al (2016) Differentially private frequent subgraph mining. In: Proceedings of the IEEE international conference on data engineering (ICDE), pp 229–240. https://doi.org/10.1109/ICDE.2016.7498243
Zeng C, Naughton JF, Cai JY (2012) On differentially private frequent itemset mining. Proc VLDB Endow 6(1):25–36. https://doi.org/10.14778/2428536.2428539
Zinchenko T, Galbrun E, Miettinen P (2015) Mining predictive redescriptions with trees. In: Proceedings of the IEEE international conference on data mining workshops (ICDMW), pp 1672–1675. https://doi.org/10.1109/ICDMW.2015.123
Zorarpacı E, Özel SA (2020) Differentially private 1R classification algorithm using artificial bee colony and differential evolution. Eng Appl Artif Intell 94:103813. https://doi.org/10.1016/j.engappai.2020.103813
Acknowledgements
Part of this work was done while the first author was with University of Eastern Finland. The authors would like to thank Esther Galbrun for her help with experiments involving the python-clired package and for her helpful comments about the manuscript. We would also like to thank Pekka Hänninen for developing scripts to transform the data to the .arff format.
Funding
Open access funding provided by University of Eastern Finland (UEF) including Kuopio University Hospital. Partial financial support was received from Regional Council of Pohjois-Savo under the ERDF project Alueellinen SOTE AI hub (A74608).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Responsible editor: Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Correctness of transition probabilities
Given a set of objects \({\mathcal {X}}\) and a scoring function \(q:({\mathcal {X}}^n\times {\mathbb {R}})\rightarrow {\mathbb {R}}\), the main idea behind the simulation of the exponential mechanism based on MCMC sampling is to let the probability of a transition from object \(O_1\) to object \(O_2\) be \(\exp \big (\frac{\varepsilon \cdot q(O_2)}{2\cdot \Delta q}\big ) \big / \exp \big (\frac{\varepsilon \cdot q(O_1)}{2\cdot \Delta q}\big )\), where \(\Delta q\) represents the sensitivity of the quality function q. Then, upon convergence, the object O is chosen with probability
which corresponds to the probabilities assigned by the exponential mechanism.
The proposed algorithm utilizes Metropolis–Hastings algorithm (Hastings 1970; Metropolis et al. 1953) using the symmetric proposal distribution. This can be seen from the construction of the transitions. First, the algorithm chooses a node n uniformly at random among the nodes in the pair of trees, i.e. from \(N_i(T_L)\cup N_i(T_R)\). Then, depending on the tree from which the node was chosen, the new node \(n'\) is chosen uniformly at random among all possible attribute–split choices from the corresponding view, i.e. from \({\mathcal {S}}(\textbf{D}_{\texttt {view} (n)})\). Thus, the overall probability is \((\vert N_i(T_L)\cup N_i(T_R)\vert \cdot \vert {{\mathcal {S}}({\textbf {D}}_{\texttt {view} (n)})}\vert )^{-1}\). The original pair can be obtained with the same probability starting from the newly constructed tree pair.
It follows from the construction of initial pairs of trees and positive transition probabilities that every possible pair of trees of a predefined depth d can be reached with a positive probability. The algorithm accepts the MCMC transition from a pair of trees \((T_{L},\ T_{R})\) to a pair \((T'_{L},\ T'_{R})\) with probability \(\min \{1, z/z'\}\), where \(z = \exp \big (\frac{\varepsilon \cdot score (T_{L},\ T_{R})}{2\cdot \Delta score }\big )\) and \(z' = \exp \big (\frac{\varepsilon \cdot score (T'_{L},\ T'_{R})}{2\cdot \Delta score }\big )\). This follows the idea described above, thus upon convergence, the tree-pair \((T_L, T_R)\) is chosen with probability
Proof of Lemma 1
First, note that the sensitivity of counting queries is 1 (see Dwork and Roth 2014). Recall from Eq. 5 that, given a decision tree T and a target variable \({\textbf{c}}\), the quality of the subtree rooted in node n is defined as
where C(n) denotes the set of children of n, while \(\tau \), \(\tau _n\), and \(\tau _{n,c}\) denote respectively the total number of entities in the dataset, the number of entities in n, and the number of entities of class c in n.
Since \(\tau _{n,c} \ge 0\) for all n and c, we have that \(\sum _{c \in {\textbf{c}}} \tau _{n,c}^2 \le \big (\sum _{c \in {\textbf{c}}} \tau _{n,c}\big )^2\). Furthermore, since \(\sum _{c \in {\textbf{c}}} \tau _{n,c} = \tau _n\), we have that
Since \(0 \le \tau _n/\tau \le 1\) and \(\sum _{n \in N_l(T)} \tau _{n}/\tau = 1\), the sum of scores of leaves is a convex sum, and hence takes value between 0 and 1, so that \(g(n) \in [0,1]\) and \(g( root (T)) = \sum _{n \in N_l(T)} g(n) \in [0,1]\), in particular.
Therefore, the measure used to evaluate a pair of trees, defined in (6) as
clearly also takes value in the unit interval. Thus, the sensitivity of \( score (T_L,\ T_R)\) equals 1.
Details on experimental evaluation and datasets
1.1 Dataset properties and pre-processing
We prepared the CMS dataset(s) based on another publicly available dataset, namely the Data Entrepreneurs Synthetic Public Use Files (DE-SynPUF)Footnote 3 released by the Centers for Medicare and Medicaid Services (CMS). It consists of realistic synthetic records of beneficiaries and claims information for three consecutive years (2008–2010). An entity in the dataset is a (synthetic) beneficiary. As the left-hand side table, we collected the personal information (birth year, sex, race, etc.) and diagnoses received by the beneficiaries over the three years. ICD-9 diagnosis codes were mapped to the second level of the categorization scheme from the Clinical Classifications Software.Footnote 4 The right-hand side table contains the yearly total amounts in different reimbursement, deductible, co-insurance components, relevant to inpatient, outpatient, carrier, and prescription drug event claims. As a result, one side of the data is mostly Boolean (indicators of different diagnoses) while the other is fully numeric. The original data is divided into 20 samples. We used Sample 2, which we split further into 16 segments based on the first digit of the hexadecimal identifier assigned to each beneficiary. Each segment contains roughly 7200 beneficiaries. We can construct datasets of different sizes by merging segments together. We denote as CMS-2-k the datasets containing the first k segments from Sample 2. For instance, CMS-2-4 contains beneficiaries from Sample 2 whose identifier starts with ‘0’, ‘1’, ‘2’, or ‘3’.
NPAS is a publicly availableFootnote 5 collection of answers to an online psychological assessment questionnaire. One side of the data contains background information about the respondents (age, education level, etc.) and the other their answers to the questions on the 5 point Likert scale. We removed all answers that contained missing values, since the SplitT and LayeredT baseline algorithms cannot handle them. The NPAS is significantly smaller than CMS datasets, and it also differs from the others in that it contains categorical variables. We use it in part to evaluate the impact of noise on smaller datasets.
The MIMIC-III datasetFootnote 6 is a prime example of health data where patient privacy is extremely important. It contains de-identified health information about patients who were treated at the Beth Israel Deaconess Medical Center (Boston, MA, US). Using this data, we constructed a dataset containing two views: the diagnoses view and the lab events view. The aim when mining this dataset is to find out what kind of lab events preceded and followed various diagnoses in the studied patients.
The dataset contains 808 Boolean attributes in the diagnoses view (1 indicating positive diagnosis) and 260 ternary attributes in the lab events view. The ternary attributes distinguish between normal lab test results (value 0), abnormal lab test results (value 1) and lab test that were not performed (value \(-1\)) for a given patient. With 46065 entities, it is a medium-sized data.
The MIMIC-III, CMS, and NPAS datasets are examples of datasets that contain potentially sensitive information. Our last dataset, MAMMAL, is a collection of records about the distribution of land mammal species (Mitchell-Jones et al. 1999) and climate (Hijmans et al. 2005) in Europe. This information is hardly sensitive, but we include this dataset as a point of reference, because it has been used in experiments in several redescription mining articles (see, e.g. Galbrun and Miettinen 2012a, b, 2018a; Kalofolias et al. 2016; Mihelčić et al. 2017; Zinchenko et al. 2015).
1.2 Filtering NPAS data
NPAS is a relatively small dataset (just 17 and 47 attributes for 1221 entities) containing attributes of mixed type in the first view and numeric attributes in the second view. In order to obtain any useful redescriptions despite the introduced noise of the differential privacy mechanisms, we need to make sure the input data is properly prepared. Since this step can be performed on the side of a data provider, pre-processing and filtering data does not break any privacy constraints and the resulting redescriptions are differentially private.
As a first filtering step, we removed all attributes containing only one value. Such attributes are useless in every data mining task, but are especially harmful in a scenario where differential privacy needs to be preserved. Using such attributes not only wastes budget but also causes the creation of many redescriptions with empty support, which are neither detected nor removed due to the introduced noise.
As a second step, we merged some values of categorical attributes. Similarly to useless attributes, categorical values that occur very rarely can also cause the creation of many redescriptions of very poor quality. For a given attribute, we merged categorical values occurring in less than 50 entities. As a consequence, the new category, \(category_{new}\), must be interpreted as \(category_{old1}\ \vee \ category_{old2}\).
Redescriptions with smaller support are more affected by the introduced noise. This is expected, since the level of noise introduced depends on the algorithm parameters and not on the data size. For this reason, it is substantially harder to obtain a good differentially private redescription set on datasets containing fewer than 2000 entities.
1.3 Parameters used in experiments
The maximum tree depth d affects mostly the complexity of the resulting redescriptions and depends on the application. We used maximum depth of 4 for all experiments. This depth allows creating interpretable redescriptions (with queries containing \(\le 4\) attributes) and keeps the level of noise required to obtain negated queries at reasonable levels. All algorithms use the Gini coefficient as the splitting criterion.
The minimum support of the redescriptions should be set small enough to find all interesting results, and depends somewhat on the data size. In the differentially private setting where information about the data is unavailable, the preferable setting is \(\ge 10\) (since this will remove redescriptions with very small or very inaccurate support). If one knows that the used data is large (regardless of the real number of entities) this threshold can be increased to 100 which will reduce even larger number of redescriptions that are very sensitive to noise. The noisy counting also means that the Jaccard index computation can be very inaccurate for redescriptions with very small support. Hence, we mined redescriptions using the same relatively low minimum support for all algorithms (the only knowledge we used is if the data is small or medium/large which does not give out any sensitive information) and then pruned away those redescriptions returned by the MCMCTreePair that had too small (noisy) support. The minimum support for the redescriptions was 10 for MAMMAL and NPAS and 100 for MIMIC-III and CMS. In post-processing, we removed all redescriptions from the MCMCTreePair that had noisy support smaller than 2000 for MIMIC-III, 1000 for CMS, 500 for MAMMAL and 200 for NPAS. Obtaining proper post-processing filtering threshold is explained in Sect. 4.6.
All algorithmic parameters used to perform the experiments are listed in Table 6. AltMCMC and AltExpM have additional parameter \( RMIter \). For these two algorithms we perform two experiments, the first uses \( InTr = 1\) and \( RMIter \) equal the number of initial trials from Table 6 and the second where \( InTr \) is as reported in Table 6 and \( RMIter = 1\). The second setting ensures equal probability of selecting a satisfactory initial attribute for all approaches.
One should use a high number of MCMC iterations (\( MCIter \) parameter) to ensure the obtained tree-pairs appropriately mimic the probability distribution of the exponential mechanism. We used 10,000 MCMC iterations as a relatively high number for shallow trees (depth \(\le 4\)). Determining the exact number of iterations for general trees requires knowledge about the depth of the tree, which depends on data size and experimental evaluation (Bai et al. 2017), neither of which is possible in the differentially private setting. The MCMC iteration termination variance threshold \(\sigma \) terminates MCMC iterations if the variance of the MCMC score in the previous k iterations is smaller than this threshold (there is little change in the score, thus we may consider the process to have converged); \(k = 500\) is used in all experiments. The \(\sigma \) should be set to a small fraction of a quality score value range. Here, we set it to 0.5% of the value range of the tree or tree-pair quality evaluation function.
The \( InTr \) and \( RMIter \) parameters are very important since more initial trials and iterations allow creating more redescriptions. Larger \( InTr \) allows using a larger number of different initial targets to start growing trees or tree-pairs, and larger \( RMIter \) allows using more alternations inside the AltMCMC and AltExpM algorithms. However, increasing either of these parameters inevitably reduces the budget for individual tree or tree-pair construction. As it turns out, using 4 iterations with high weight (\(1-\omega =0.9\)) for noisy counts in MCMCTreePair leads to fairly accurate redescriptions even on small datasets (such as MAMMAL and NPAS). On datasets with a larger number of entities, the \( InTr \) parameter can be increased. The weight parameter \(\omega \) allows balancing the budget between tree-pair creation and noisy counts. Using very low \(\omega \) causes tree-pairs to be of lower quality, which results in a smaller number of produced redescriptions. At the same time, since this means allocating a larger portion of the budget for noisy counts, the remaining redescription statistics are computed more accurately.
The privacy budget \(\varepsilon \) is set to 1 if the algorithm is to be run only once on the data. No more runs would be permitted after that. Using a smaller privacy budget allows multiple exploratory runs on the data.
The impact of further algorithm parameters in \(\Gamma \) is very limited on datasets with a small number of entities and for redescriptions with small support set size. Among these parameters, the most useful one is \(\Gamma . minSupp \), which is used for redescription filtering (as explained earlier). On the other hand, the \(\Gamma . maxSupp \) is typically set to 0.8, to prevent various potentially uninteresting results such as tautologies and insignificant redescriptions (which are hard to detect internally due to noise). Calculating this requires first calculating the estimated data size. Since noisy counts are relatively accurate on the higher support size spectra, setting the maximum support to 80% dataset size should remove the majority of redescriptions with support equal to or very close to all entities in the dataset. The p-value threshold (\(\Gamma .p_{ val }\)) is set to 0.01 by default, but even stricter values might make sense in this scenario. The parameter \(\Gamma . minJ \) is most affected by the introduced noise, and setting it to a very high value comes with the risk of eliminating many true, accurate redescriptions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mihelčić, M., Miettinen, P. Differentially private tree-based redescription mining. Data Min Knowl Disc 37, 1548–1590 (2023). https://doi.org/10.1007/s10618-023-00934-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-023-00934-8