(Local) Differential Privacy for Dummies :D
TL;DR - This blog is for the very beginners to gain complete clarity on what is LDP.
Before we begin let's all pledge to unanimously agree that
"Data is invaluable".
Safe and effective use of data could potentially save lives, enhance the livelihood, to quote Michel Jackson - "Make it a better place - for you and for me and the entire human race" (and for other organisms too).
On that note, let's begin with a short story (sort of based on real-life incidents, sort of added elements for explanation purpose). One morning in the 1990s, The governor of Massachusetts - Mr. William Weld and the Group Insurance Commission had a long discussion on how much data they had, and how it could be useful if others could utilize it and make sensible judgments. So, towards the end of the discussion, they said something like,
“Let’s release the health data of the people to make sensible judgements”.
By 'sensible judgments' they were hoping something like,
Inferring what disease people have the most,
Hospitals can know what medical equipment should we procure more?
Anti-Pollution Activists can check if there is an environmental pollutant that’s causing this.
Or, If we know the answer to the question of how many people have diabetes,
Health care officials can find out what kind of medical awareness campaign they should conduct?
Food companies can check if they should promote a particular diet?
Good stuff right :D
At this point, the Governor and the Insurance company had some doubts regarding the privacy aspect of releasing the data. Hence they decided to do something like the following,
They remove the key identifiers in the data, like name and address of each individual participant.
A few months later, a then MIT Ph.D. student Latanya Sweeny found something was out of place with this dataset. She went to the Cambridge town hall and got a copy of the city's voters list (Fun fact: for 20$) and compared it with the data released by the Governor. She found that there was just one person with the specific gender who was born in that particular zip code on that day (given the zip code was covering a brief area). This way she found that the privacy of over 80% of the people in the data set was compromised including that of the governor's.
They tried doing something good with the data, but it ended up violating privacy and causing harm to a lot of people. How could we solve this?. If we notice closely, the information we wanted to extract from the data was,
What disease people have the most?
How many people have diabetes?
We did not want to know the health condition of an individual but of the collective population.
“We need an aggregate of the data not information about the individual in the data”
But there is a problem with aggregation too. You have most likely seen this histogram while browsing through google for popular places. This histogram displays the number of visitors to a place at a given point in time. Aka, it aggregates the number of visitors to a place. Say for example you're my friend. On Monday I look up the number of people who have visited a movie theater, the histogram says "6 people visited on Sunday". Later sometime on Monday, you tell me you just deleted your Google account, and accidentally I also tend to look up the visitor histogram for the same theater again and see that now there are just "5 people who have visited on Sunday". Based on my knowledge that you deleted your Google account and there is a drop in the number of visitors last Sunday - I know you have visited the theatre on Sunday. Long story short, I just ended up breaching your privacy !!
P.S. Google does not just add up and display the numbers so don't worry about your privacy :D
Removing identifiers did not help, merely aggregating also did not help, what else do we do? If we break down the problem, we want to have the same histogram irrespective of the presence of an individual in the data set.
We want a magic portion that would yield the same result even if one of the entries in it is removed/varied in one of the datasets.
TL;DR - The magic portion is called Noising
Let's see what is 'Noising' with the following (really good) analogy. Credits to Ananth Raghunathan.
This is the portrait of Mona Lisa, it's made up of Ms and dots.
Zoomed in version of the portrait at the red box looks like the following,
If we flip each element with a probability of 25 %, this is how the zoomed picture looks.
The complete image after flipping looks like the following,
The inference is that we still see Mona Lisa, even after flipping the individual data. In other words, we can sustain the big picture even after nosing out the individual data, Voila!. But, how do we do this with real data ?. There are plenty of algorithms and ways to noise data. One such algorithm which we would be discussing further is called Randomised Response. Apart from the wonky name, the algorithm is very simple to understand and is one of the most fundamental methods for noising. How simple you ask...? We just flip coins to noise the data.
Flip a coin to protect your data :D
Say we have a questionnaire for asking users if they smoke, something similar to the disease dataset we had earlier (just less gruesome). The user fills it out in their phone and the data undergoes some sort of noiser (aka magic) in the phone and the result from the noiser is sent to the respective server of the questionnaire owner.
Let's say the sample data a user enters looks like the following,
With the obtained data we flip a coin (an imaginary one) and follow the flow chart below.
Assuming the coin has a 0.5 probability for head and 0.5 for a tail, chances are that we say the truth twice, say the user smokes (irrespective of their input) once, and say that they don't smoke (irrespective of the input) once. Analyzing at the math of this,
Proportion of True = p = ¾ Proportion of False = (1-p) = ¼ P(Event) = ¼(1-p) + ¾(p) = ¼ + (p/2)
Let's see how this has worked. Assume the following table contains 10,000 rows of information of people, their smoking status, and that it is noised (it was passed through the coin flipper).
Let's see if we could interpret the following without inflicting an individual's privacy.
How many people smoke..? - We can answer this by counting the number of Yes in the Smoking column. We also know the answer we get here is not accurate and is biased by 1-p so we could adjust our interpretations accordingly.
Do you think Penny smokes..? - We don't know for sure, since she might have been a noised entry. Hence we just protected her privacy by making her data ambiguous.
Does removing Penny impact us..? - Well not really, we have noised out individual user's contribution so removing one contribution (give the data set is large enough), won't impact a lot.
Hence the moral of the story is,
If I see that someone smokes - they might not smoke.
The truth is biased by 1-p
We see an approximate result but a private result
The change in the dataset based on an individual’s contribution is less
Yay !! - We just made the data differentially private :D
But wait, is there a coin on my phone to flip? - would be cool to have one, but nope, there is none. Instead, we use epsilon - ɛ.
Epsilon-Differential Privacy can be mathematically described as done by Cynthia Dwork in her incredible book "The Algorithmic Foundations of Differential Privacy".
The case where noising occurs in the user's device is called Local Differential Privacy. If all of the raw user data is sent to the company's server and they perform noising before providing it to the product teams for analysis the method is called - Central Differential Privacy. There is a vast difference with utility while comparing various Central DP methods with Local DP ones. Ideally, Central DP offers better utility in comparison to Local DP. ON a very abstract level, Shuffled DP is a concept that could strike a balance between the two, offering privacy as good as LDP and utility as good as Central DP. If you seem intrigued by Local DP, do check out information about Shuffled DP as well :D
Some super-useful reference links to get a deeper insight on the nuances of LDP -
https://www.youtube.com/watch?v=tuOBz5AzivM (Ananth Raghunathan's speech)
All the opinions stated above are just my own not that of my employer. It took me months to get a good clarity on this concept and hence I thought this could probably help someone out there to ramp up faster :D.
If you have any feedback feel free to write to email@example.com