This is the second post of my data privacy series. You may want to also check out this post for an introduction to data privacy and differential privacy. 🙂
For convenient, let’s repeat the definition of differential privacy from the previous post.
Definition: -Differential Privacy
The query is -differentially private if and only if for any subset of the range of ,
(1)
where and are any two data sets where they differ by at most one record and one data set is a subset of the other.
It is common in the literature to refer the “private” way to disclose the statistics of a data set to as privacy mechanisms. To ensure differential privacy, two common mechanisms are considered: Laplace mechanism and exponential mechanism.  But before discussing them, let’s define the sensitivity of a function applying on a data set as follows.
Definition: Sensitivity of a function
The sensitivity of a function on a data set, , is defined by
(2)
where and are any two data sets where they differ by at most one record and one data set is a subset of the other.
As one can see from the definition, sensitivity measure to a extreme how much a single record can change the function value of a query. Intuitively, a malicious user may have higher chance to infer private information from a more sensitive query than a less sensitive one. Therefore more protection may be needed for a more sensitve query.
Laplace Mechanism
Instead of returning the true computation, , the Laplace mechanism returns the query result plus a Laplacian noise with standard deviation (denoted as ). In other words, it returns .
It is not difficult to show that (1) (and hence differential privacy) will be satisfied. Note that (1) is true for all subset if it is true for all singleton set . Therefore we can restrict ourselves to only consider the latter case. For any , note that
(3)
where is due to triangle inequality () and exponential function being monotonically increasing, and is from the definition of sensitivity.
Now, if the system allows more queries, intuitively more protection (hence more distortions to queries) will be needed since there is a larger chance for a malicious user to identify a record owner via analyzing the additional information.
Consider the case when we have two query functions and . To ensure that a malicious user can’t take advantage the joint statistics of the queries to identify a record, we should modify the differential privacy definition in (1) to
(4)
where and are the modified query output functions for and , respectively. Again, and are any two data sets where they differ by at most one record and one data set is a subset of the other.
As in the single query case, we can construct and by simply adding Laplacian noises to the true results. However, we will require a larger noise as mentioned earlier. Actually, it turns out that Laplacian noise is sufficient. In other words, we may choose and .
Similar to the single query case, to show (4) to be true for all , it is sufficient to show that it is true for any singleton set . Then,
(5)
where is due to triangle inequality and is again from the definition of sensitivity.
Exponential Mechanism
When an attribute is categorical, we cannot simply attach a noise to it and so Laplace mechanism will not apply in that case. To get around this, we introduce exponential mechanism, which essentially adds noise to the distribution of the attribute rather than the attribute itself.
Recall that the core idea of privacy mechanism is to introduce randomness to query outcome. Therefore, even for a categorical query the output should not be deterministic. For a data set and a particular query, we may specify an utility function to control the preference of the output. Essentially, the larger , the more likely will be the output.
Definition: Exponential Mechanism
Given an utility function , we output with probability proportional to , where is the sensitivity of given by
(6)
and and are any two adjacent data sets.
Note that the definition of sensitivity differs from the previous one.
Exponential Mechanism is differentially private
The exponential mechanism described above is differentially private. Intuitively, one cannot gather information of an individual record out of the distribution of the observed outcomes.
Denote as the output of the exponential mechanism.
Therefore, the probability of getting for data set can be written as
(7)
Note that if we shift from to an adjacent data set . The change of numerator is bounded by and the change of denominator is also bounded by . So overall the change in probability is bounded by . Thus the mechanism is -differentially private.
What next?
In the next post, we will discuss the effect of compositions to privacy mechanism.