Criticism of differential privacy

This is the forth post of my data privacy series. Please refer to the first post for detail of differential privacy. 🙂

While differential privacy is probably the hottest research area in data privacy. It is not immune from critics. The strongest criticism to differential privacy is that one has to increase the noise amplitude as the number of queries increases. In essence, there is a limit of how many times a query can be made until a data set becomes totally drown by noise. Fairly speaking, this may be a fundamental price that needs to be paid for privacy. However, differential privacy demands that the ultimate number of queries need to be known beforehand. No simple retroactive corrective measure has been suggested for the scenario when a data set owner wants to change the possible number of query afterward.

Another potential problem of differential privacy is that it is overly pessimistic. The Laplace mechanism relies on the sensitivity of the query function which is defined in the worst case scenario ignoring the actual data one possesses. It is quite possible that the pair of “adjacent” data sets that maximize the difference in the sensitivity definition will never occur. If so, a much lower noise level can be sufficient to provide the same privacy level. Indeed, the definition of differential privacy is essentially “blind” to the actual data as it requires the chosen condition to be satisfied for all potential data set pairs. It excludes the possibility that one may be better off excluding the worst case scenarios by trimming the records that are “sensitive“ to the query.

Another problem is that the parameter has no physical meaning to the data set owner. There is no intuition what $\epsilon$ is sufficient to ensure privacy. In contrast, the physical meaning of the corresponding parameters in prior privacy formulations such as $k$ -anonymization and $l$ -diversity can be interpreted intuitively. Once again, the definition itself is essentially “data blind”—to the extend that the definition is indifferent to the change of the query outcome. It is because the inequality in the definition has to be satisfied by not just for all adjacent data set pairs, but also by all subsets of the range of the query function. For example, for the counting query that is most commonly used in explaining differential privacy, the degrees of protection towards the cases when the counting results being 1 and 10,000 are completely the same. Intuitively, if the query is counting the number of AIDS patients living in a particular zip code, we expect that higher protection to patient privacy should be needed when the number of patients is significantly smaller. Finally, because of the data blind nature, differential privacy completely ignores the possibility of quantifying the utility of a data record. It essentially adds noise evenly to all records despite that some records could be more useful than the rest. Moreover, the parameter $\epsilon$ , without any apparent realistic interpretation, is usually predetermined rather arbitrarily and controls a hard privacy constraint. A systematic way to trade off privacy and utility (in what way one should adjust $\epsilon$ if higher utility is desirable) is not given.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Leave a Reply Cancel reply