According to a study published in 2006, 87% of the American population has a unique combination of ZIP code, birth date and sex. This means that the combination of these three pieces of information is enough to identify each of these individuals. Exactly which information is classified as “identifying” and how is identifiable information protected?
This article looks at the processes of de-identification, or anonymization of personal information. It also examines how developments in re-identification can use anonymous information to identify individuals, underscoring the shortcomings of anonymization processes.
De-identification refers to the process in which sensitive data is treated in such a way that the individual cannot be identified. For instance, the de-identification of data may remove specified identifiers. Depending on the legislation, there may be different criteria for de-identification of data.
There are numerous reasons for de-identifying data. The de-identified, or anonymized, data allows important information to be accessed and analyzed, while still protecting the privacy of individuals. This win-win compromise allows analysts to use this information, while preventing identity thieves and other unauthorized entities from identifying the individuals.
Data mining companies will often gather personal information from dentists, doctors, nurses or pharmacists. For instance, prescription information will be collected and later sold to private interests, such as pharmaceuticals or research organizations. The information is then analyzed for trends or patterns in the prescriptions. Data mining companies argue that since the information they collect is de-identified, there is no risk of compromising individual’s sensitive information or revealing his/her identity.
According to the HIPAA (Health Insurance Portability and Accountability Act), there are two methods for de-identifying information. One option is to consult a qualified statistical expert regarding the risk of identifying the individual. This expert would be able to de-identify the data by applying methods to determine the risk that the information may be used to identify an individual, whether used alone or in combination with other information.
The other method is to remove eighteen specified identifiers. This ensures that there is no identifying information in the data. These identifiers are as follows:
- Geographic categories smaller than a state
- Dates (except for the year) that is related to an individual. This may include data of birth, death, admission, discharge, etc.
- Telephone numbers
- Fax number
- E-mail address
- Social security number (SSN)
- Medical record number
- Health plan beneficiary number
- Account numbers
- Certificate or license numbers
- Vehicle identifiers (e.g. serial numbers, license plates)
- Device identifiers
- Web URLs
- IP addresses
- Biometric identifiers (e.g. finger print, voice print)
- Full-face photos
- Any other unique identifying number, code or characteristic
The process of re-identification matches de-identified, or anonymized, personal information back to the individual. Re-identification brings to light some of the shortcomings of anonymization, since the goal of anonymization is to ensure that any personally identifying information is removed, without compromising the utility of the data.
Computer scientists have found that once de-identified data can easily be re-identified. This raises a host of issues for organizations dealing with “anonymized” data. Currently, organizations do not have privacy obligations when working with anonymized data. However, if the data can easily be made personally identifiable, critics argue that privacy safeguards should be put in place.
Originally, re-identification refers to using data from a single entity holding the data. Recent research has looked at the concept of trail re-identification, which studies a trail of anonymous, homogenous data from a number of different locations. Looking at how the different data intersects can reveal personal, identifying information about the individual.
One example of trail re-identification is of online shoppers, who may visit a number of different websites before making a purchase. However, the IP addresses of their computers are recorded at each website they visit. Combining the visit logs with the customer lists may successfully identify individuals.
An identity thief may be able to reconstruct trails from data distributed over a number of locations. By pairing identified entities with their anonymized data, an adversary may be able to re-identify the individual through the process of trail matching. This form of identity attack is dependent on how the data is collected and to what extent it is collected. For instance, important information may include the fact that anonymous data is collected along with identified data, and if this data is complete or incomplete.
Examples of Re-identification
The following examples illustrate some failures of anonymization, in which individuals’ privacy could no longer be protected in light of the developments in re-identification processes. Although the harm done to the individuals was relatively limited, these examples point to how the process of re-identification can be used in more dangerous ways.
The introduction of this article cites Latanya Sweeney’s 2006 research in which 87% of people in the United States can be identified by combining their ZIP code, birth date and sex. Sweeney’s research also found that other types of information can also re-identify people. For instance, 53% of US citizens can be identified by their city, birth date and sex, while 18% of citizens can be identified by their county, birth date and sex.
In August of 2006, America Online (AOL) publicly posted 20 million search queries for 650,000 AOL search engine users. These queries summed up three months of activity. Before the data was released, AOL anonymized it by removing identifying information, such as the username and IP address. However, these identifiers were replaced with unique identification numbers, so that researchers could still make use of the data. Due to the nature of the personal information revealed, AOL was criticized for the move. Even though the data was anonymized before the release, within a relatively short time, journalists were able to trace user queries to specific individuals.
In October 2006, Netflix, an online movie rental service, publicly released 100 million records regarding how its users had rated movies over the period of time from December 1999 to December 2005. The records showed the movie, the rating the user had given and the date the user had rated the movie. While identifying usernames had been removed, each user was given a unique identifying number.
Two weeks after the data was released, researchers from the University of Texas found that it was relatively easy to re-identify the individuals in the Netflix database and find out all of the movies that the individual had rated. According to the research, using only six movie ratings, one could identify the individual 84% of the time. With the six movie ratings and the approximate date of the ratings, one could identify the individual 99% of the time.
A 2009 study carried out by Alessandro Acquisti and Ralph Gross showed that SSNs can be predicted when combined with an individual’s date of birth and his/her geographic location. Using the publicly available records in the Social Security Administration’s Death Master File (DMF), Acquisti and Gross looked for trends in SSNs of reported deaths. The researchers found there was a strong correlation between an individual’s birth data and the digits in their SSN. This raises a serious concern for individuals whose birth dates are publicly known, whether through voter registration lists, social networking profiles or other sources.
There are a number of different privacy models that have been developed in order to address the issue of re-identification. Some methods include:
- Access control: This is the traditional model for safeguarding individuals’ privacy. It is also referred to as query restriction, which associates certain data to a given request in a multi-level relational database.
- Statistical disclosure control: This method includes a wide variety of techniques, including suppression, noise addition, perturbing records of a collection. Statistical disclosure control prevents the receiver of the data from inferring identities of the individuals.
- Computational disclosure control: This model prevents the formation of direct connections from unidentified data to identifiable data. With computational disclosure control, records appear identical through generalization and suppression of attributes.
- Algorithms: This model has been promoted most by the data mining industry to preserve the privacy of individuals.
Legislation must be able to respond to the shortcomings of anonymization as well. Critics have pointed out that if the process of de-identification is ineffective, then so-called privacy protecting laws may be severely eroded. According to the HIPAA, the handling of health records that have been anonymized is not regulated. There are no regulations for such data, since it is supposedly protected. Given the many examples of unintended re-identification, critics have argued that de-identification alone is an insufficient privacy safeguard.
This article examines the reasons for and methods of de-identification, in which personal data is anonymized. It then looks at developments in re-identification of data, which links the personal information back to the individual. Several examples of re-identification of data are given, which raise the issue that anonymization is insufficient for safeguarding individuals’ privacy. Finally, the article presents privacy protection methods for addressing the issue of re-identification.
In preparation for the Certified Information Privacy Professional/Information Technology exam, a privacy professional should be comfortable with topics related to this post, including:
- Data de-identification and minimization (III.B.a.)
- Degrees of identification (III.B.a.iv.)
- De-identification and re-identification (III.B.iv.2.)