

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
How the california policy lab (cpl) links individual-level administrative data from multiple agencies without sharing personally identifiable information (pii). Agencies first encrypt pii using a hashing technique before transferring data, which is then linked using the hashed identifiers. Details on the hashing process, including the use of a salt and the sha-256 algorithm. Cpl has used this approach to address policy issues related to tax and social services, student financial aid, and college enrollment.
What you will learn
Typology: Lecture notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!
Policy-relevant research often requires a combination of administrative datasets from different agencies, especially when research questions concern more than one service or program. The preferred and most accurate method is linking at the individual-level using PII. However, a major consideration in linking individual-level administrative data is the need to protect individual privacy and comply with privacy constraints. Statutes may also restrict the types of data that agencies can share. The California Policy Lab (CPL) has developed a process to link individual-level data from multiple agencies without receiving PII. Rather than sharing PII, such as full names and Social Security Numbers (SSNs), agencies first encrypt the PII using a hashing technique before transferring any data. Once data is transferred, CPL then links the data using the hashed identifiers. To date, CPL has used this approach to help agencies address vexing policy issues, including: ● A project with the California Franchise Tax Board (FTB) and the California Department of Social Services (CDSS) linking tax and social services data to understand whether Californians on food stamps are claiming important anti-poverty tax credits. ● A project with CDSS, the California Student Aid Commission, the California Community Colleges, and the University of California Office of the President, linking student, financial aid, and social services data to understand the extent to which college students are eligible for food stamps, but are missing out on receiving these benefits.
Hashing is a method of turning readable data into unique but unreadable data -- this can be applied to PII to protect the values in a way that they can still be used to link datasets. When agencies hash data, they apply a one-way encryption that turns the PII into indecipherable strings of letters and numbers. The original datasets at each partner agency contain both PII (for example, names, addresses, and SSNs) and other analytic data that does not include PII (for example, variables about the individuals such as college enrollment details and financial aid received). Once the hashing process is complete, CPL receives the hashed PII as well as the un-hashed analytic data. Phase 1.The hashing process A hash function performs a one-way transformation that can convert a readable piece of data of any length into a different, unreadable piece of data of a fixed length. Hashing can turn PII (such as the name “John” or an SSN) into an indecipherable but uniform string of characters. For example, using the SHA 256 hashing algorithm, “John” hashed becomes “a8cfcd74832004951b4408cdb0a5dbcd8c7e52d43f7fe- 244bf720582e05241da”. If even one character in the input string changes, the hash is different. For example, even one digit in an SSN changes the hash completely:
To prevent re-identification or deciphering of the hash, a ‘salt’ is appended to the original PII as an additional input to the hashing function to obfuscate the original value. A salt is a string of characters decided by the agencies that own the data -- the salt is agreed to amongst the agencies whose data will be linked and is never shared with CPL. So, if the original PII was an SSN and the agreed-upon salt were “I love apple pie”, the obfuscated value would be the hash of “123-45-6789I love apple pie”. The hash of the SSN + salt would be completely different than the original SSN. This means that the hash cannot be reverse-engineered, becauseCPL never knows the salt and therefore cannot work out the meaning of the hashed data. CPL uses cryptographic hash algorithm SHA-256, specified in the Federal Information Processing Standards: FIPS 180-4, Secure Hash Standard. SHA-256 is the industry standard and is available in most programming languages. Agencies will not need to perform the hashing process on their own -- CPL provides support. CPL has written easy-to-follow code that conducts the hashing process, and can tweak or rewrite the code in the agency’s preferred programming language, if necessary. What data will be hashed? ● Full identifier strings of PII (e.g., names, SSN, DOB) ● Substrings of PII (e.g., the first three letters or digits) and phonetic equivalents (e.g., John and Jon). This allows CPL to conduct probabilistic matches described below. Phase 2. Linking the data Once PII variables have been hashed, agencies send the dataset to CPL to conduct the linkage process. In this step, CPL compares the hashed PII from each of the data contributors and identifies matches. As mentioned above, we will not only be comparing hashes of the full PII strings, but also substrings of PII as well. This will make it possible to identify both exact matches between datasets and non-exact matches that allow for discrepancies in information between datasets (e.g., an individual named “Jon” in one dataset and “Jonathan” in the other). This allows CPL to identify cases where, for example, the last names match exactly, the first names are spelled differently in the two datasets but are phonetically