Linking Administrative Datasets using Hashing: A Privacy-Preserving Approach | Lecture notes Programming Languages

Linking datasets without revealing personally identifiable information (PII)

Policy-relevant research often requires a combination of administrative datasets from different agencies,

especially when research questions concern more than one service or program. The preferred and most

accurate method is linking at the individual-level using PII. However, a major consideration in linking

individual-level administrative data is the need to protect individual privacy and comply with privacy

constraints. Statutes may also restrict the types of data that agencies can share.

The California Policy Lab (CPL) has developed a process to link individual-level data from multiple

agencies without receiving PII. Rather than sharing PII, such as full names and Social Security Numbers

(SSNs), agencies first encrypt the PII using a hashing technique before transferring any data. Once data is

transferred, CPL then links the data using the hashed identifiers.

To date, CPL has used this approach to help agencies address vexing policy issues, including:

●A project with the California Franchise Tax Board (FTB) and the California Department of Social

Services (CDSS) linking tax and social services data to understand whether Californians on food

stamps are claiming important anti-poverty tax credits.

●A project with CDSS, the California Student Aid Commission, the California Community

Colleges, and the University of California Office of the President, linking student, financial aid,

and social services data to understand the extent to which college students are eligible for food

stamps, but are missing out on receiving these benefits.

What is hashing and how is it applied to PII?

Hashing is a method of turning readable data into unique but unreadable data -- this can be applied to

PII to protect the values in a way that they can still be used to link datasets. When agencies hash data,

they apply a one-way encryption that turns the PII into indecipherable strings of letters and numbers.

The original datasets at each partner agency contain both PII (for example, names, addresses, and SSNs)

and other analytic data that does not include PII (for example, variables about the individuals such as

college enrollment details and financial aid received). Once the hashing process is complete, CPL

receives the hashed PII as well as the un-hashed analytic data.

Phase 1. The hashing process

A hash function performs a one-way transformation that can convert a readable piece of data of any

length into a different, unreadable piece of data of a fixed length. Hashing can turn PII (such as the name

“John” or an SSN) into an indecipherable but uniform string of characters. For example, using the SHA

256 hashing algorithm, “John” hashed becomes “a8cfcd74832004951b4408cdb0a5dbcd8c7e52d43f7fe-

244bf720582e05241da”. If even one character in the input string changes, the hash is different. For

example, even one digit in an SSN changes the hash completely:

Partial preview of the text

Download Linking Administrative Datasets using Hashing: A Privacy-Preserving Approach and more Lecture notes Programming Languages in PDF only on Docsity!

Linking datasets without revealing personally identifiable information (PII)

Policy-relevant research often requires a combination of administrative datasets from different agencies, especially when research questions concern more than one service or program. The preferred and most accurate method is linking at the individual-level using PII. However, a major consideration in linking individual-level administrative data is the need to protect individual privacy and comply with privacy constraints. Statutes may also restrict the types of data that agencies can share. The California Policy Lab (CPL) has developed a process to link individual-level data from multiple agencies without receiving PII. Rather than sharing PII, such as full names and Social Security Numbers (SSNs), agencies first encrypt the PII using a hashing technique before transferring any data. Once data is transferred, CPL then links the data using the hashed identifiers. To date, CPL has used this approach to help agencies address vexing policy issues, including: ● A project with the California Franchise Tax Board (FTB) and the California Department of Social Services (CDSS) linking tax and social services data to understand whether Californians on food stamps are claiming important anti-poverty tax credits. ● A project with CDSS, the California Student Aid Commission, the California Community Colleges, and the University of California Office of the President, linking student, financial aid, and social services data to understand the extent to which college students are eligible for food stamps, but are missing out on receiving these benefits.

What is hashing and how is it applied to PII?

Hashing is a method of turning readable data into unique but unreadable data -- this can be applied to PII to protect the values in a way that they can still be used to link datasets. When agencies hash data, they apply a one-way encryption that turns the PII into indecipherable strings of letters and numbers. The original datasets at each partner agency contain both PII (for example, names, addresses, and SSNs) and other analytic data that does not include PII (for example, variables about the individuals such as college enrollment details and financial aid received). Once the hashing process is complete, CPL receives the hashed PII as well as the un-hashed analytic data. Phase 1.The hashing process A hash function performs a one-way transformation that can convert a readable piece of data of any length into a different, unreadable piece of data of a fixed length. Hashing can turn PII (such as the name “John” or an SSN) into an indecipherable but uniform string of characters. For example, using the SHA 256 hashing algorithm, “John” hashed becomes “a8cfcd74832004951b4408cdb0a5dbcd8c7e52d43f7fe- 244bf720582e05241da”. If even one character in the input string changes, the hash is different. For example, even one digit in an SSN changes the hash completely:

To prevent re-identification or deciphering of the hash, a ‘salt’ is appended to the original PII as an additional input to the hashing function to obfuscate the original value. A salt is a string of characters decided by the agencies that own the data -- the salt is agreed to amongst the agencies whose data will be linked and is never shared with CPL. So, if the original PII was an SSN and the agreed-upon salt were “I love apple pie”, the obfuscated value would be the hash of “123-45-6789I love apple pie”. The hash of the SSN + salt would be completely different than the original SSN. This means that the hash cannot be reverse-engineered, becauseCPL never knows the salt and therefore cannot work out the meaning of the hashed data. CPL uses cryptographic hash algorithm SHA-256, specified in the Federal Information Processing Standards: FIPS 180-4, Secure Hash Standard. SHA-256 is the industry standard and is available in most programming languages. Agencies will not need to perform the hashing process on their own -- CPL provides support. CPL has written easy-to-follow code that conducts the hashing process, and can tweak or rewrite the code in the agency’s preferred programming language, if necessary. What data will be hashed? ● Full identifier strings of PII (e.g., names, SSN, DOB) ● Substrings of PII (e.g., the first three letters or digits) and phonetic equivalents (e.g., John and Jon). This allows CPL to conduct probabilistic matches described below. Phase 2. Linking the data Once PII variables have been hashed, agencies send the dataset to CPL to conduct the linkage process. In this step, CPL compares the hashed PII from each of the data contributors and identifies matches. As mentioned above, we will not only be comparing hashes of the full PII strings, but also substrings of PII as well. This will make it possible to identify both exact matches between datasets and non-exact matches that allow for discrepancies in information between datasets (e.g., an individual named “Jon” in one dataset and “Jonathan” in the other). This allows CPL to identify cases where, for example, the last names match exactly, the first names are spelled differently in the two datasets but are phonetically

Linking Administrative Datasets using Hashing: A Privacy-Preserving Approach, Lecture notes of Programming Languages

Related documents

Partial preview of the text

Download Linking Administrative Datasets using Hashing: A Privacy-Preserving Approach and more Lecture notes Programming Languages in PDF only on Docsity!

Linking datasets without revealing personally identifiable information (PII)

What is hashing and how is it applied to PII?