Why You Should Encrypt ALL Personally Identifiable Information (PII)

Bear GilesAugust 25th, 2015Last Updated: August 24th, 2015

0 278 6 minutes read

Many critics have pointed out that Ashley Madison should have encrypted all personally identifiable information (PII). The database contained sensitive information that would cause harm to users if it was released.

We are probably not involved in dating websites based on infidelity, at least not as a developer. But the nature of our business doesn’t matter after a breach – if there’s financial information involved then there’s the need to notify users, offer credit monitoring, face the exposure to lawsuits, increased costs for PCI-DSS compliance inspections, etc. You need to consult a lawyer but I think many of these costs can be mitigated if all PII is encrypted.

Encryption may also be required by law or industry standards. I have worked at several sites, including a federal contractor, where all UII must be encrypted. At least one site required all PII to be encrypted as well. It’s not the norm, yet, but it’s not unheard of either.

Bottom line: there’s a very real impact on the bottom line if information is leaked, this isn’t entirely in your business’s control (e.g., the Target breach was due to a subcontractor being sloppy with its access credentials), and relatively inexpensive countermeasures can reduce this exposure. It will be very uncomfortable if something goes wrong and a former user’s attorney asked you why you didn’t take all reasonable steps to protect her client.

What is personally identifiable information (PII)?

There are two closely related concepts, personally identifiable information (PII) and uniquely identifiable information (UII). Personally identifiable information (PII) is enough to give someone a good idea who the individual is but not with 100% certainty. Uniquely identifiable information (UII) gives you nearly 100% certainty.

It’s important to note that a UII is like being pregnant. You can’t be a little pregnant – it’s all or nothing. Your UII might be nothing but an burner email address. It doesn’t matter if the user used the same burner email address at a different site where additional information is available, or perhaps the user self-disclosed that information in the site’s content. Once the person has been identified on the second site the attacker can come back to your site and use that burner email address to tie information back to the individual. You can’t prevent that so you should assume the UII is already compromised elsewhere.

Some things are intrinsically UII:

social security number
email address
license number (e.g., driver’s license, professional license)
account number

Other things are considered safe in isolation but with two or more pieces of information they become PII and then UII.

first name
last name
address (risky)
zip code
phone number area code
full phone number (risky)
birth year (or age)
date of birth
IP address (with timestamp)
name of employer
car make and model
and so on

A good conceptual model for the difference between PII and UII as that you can fit all of the PII matches onto a piece of paper the size of a postcard. A UII allows you to actually address a postcard. Another model could be that PII allows you to approach individuals and ask “why should I believe you are this person?” and expect them to make a solid case. UII allows you to approach an individual and ask “why should I believe you are not this person?” and expect them to be unable to make a solid case.

Obviously for some people (e.g., “John Smith”) you will need many pieces of information. For other people a last name and zip code may be sufficient, or full date of birth and zip code.

The usual warnings about false matches still apply. Your “unique match” could still be two men with the same full name, the same address, and same birth date (month and day) if they are father and son who coincidentally share the same birthday. The son could be a minor, a college student using his parent’s address to receive mail, or a boomerang kid.

A more unusual case happened in Australia a few years back. It seemed to be a clear case of fraud – a woman was getting student aid at two schools while working in a third city. It was actually three unrelated women who, by pure chance, had the same full name and date of birth.

These are still considered UII matches since this happens so rarely – literally less than one-in-a-million occurances.

Sidenote: given someone’s full birthdate and the city where they were born you can often make a good guess at their social security if they’re under 35 or so. The IRS now requires a SSN for anyone claimed as a dependent so most people get them for their children immediately after birth. SSNs aren’t issued sequentially at the national level but they do (or did) follow a predictable pattern if you know the state where they are issued and the person’s last name. The average person won’t know a good starting SSN for a search but an attacker might. This should drive home the fact that seemingly safe information can be combined to reveal much more sensitive information.

Objection: I use the information constantly

A common objection to encrypting PII information is that it’s constantly used so it’s an unreasonable burden to be repeatedly decrypting it.

This is an objection that rarely stands up to careful examination. Take Amazon as a theoretical example. It’s heavily used by many tens or even hundreds of millions of users. Surely PII encryption would be too costly.

Except… when do they actually need my PII? They use my first name for personalization but they only use my full PII in two places:

When I pay for an order (billing address) and
When I ship an order (mailing address)

And that’s it. I’m sure it comes up elsewhere but the vast bulk of usage is related to either billing or mailing. Everything else can use an internal ID that reveals nothing about me.

Objection: my application still needs unencrypted PII data

That might have been true in the ’00s. But as businesses “move into the cloud” they’ll often switch to a microservices architecture where monolithic applications are broken down into tiny pieces that do just one thing. Even traditionally hosted applications may break down a monolithic application into separate components in order to spread the work across multiple servers.

What are two obvious candidates for a microservice? Billing (finances) and shipping (fulfillment).

Those two services need to have access to the unencrypted PII but they don’t have to share it with the overall application. They don’t even need to share their database with the overall application – this is a good place to put the service on a dedicated system behind your firewall.

Objection: my application identifies users with their email address and you said that is UII

This is valid concern but easily handled. Do not use the email address for authentication – use the hash of the email address. If you’re worried about hash collisions you can use the hash (once verified) to pull up the PII and verify it.

Objection: my application needs to be able to search for users

Again we can handle this with careful hashing. Store the hash of the search criteria, e.g., last name, city, state, etc. You can now hash the search criteria, perform the search using the hashed values, and then decrypt the PII for final comparisons.

We can make this more flexible by using the soundex index of our search criteria. This will create larger buckets but we’ll be more likely to catch ‘near misses’.

Objection: won’t all of these hashes be subject to a salami attack?

Short answer: yes. You might not know whether the most frequent hash on last name is “Smith” or “Johnson” but you know it’s more likely to be that than “Roberts”. Likewise the most frequent hashed states are more likely to be California or Texas than Wyoming or Idaho.

There’s an easy workaround – salted hashes. We can start with a small number of salts that are partitioned by the data. (E.g., names from A-F use salt 12). Statistics are periodically collected and if a single salt is notably more frequently used than others then either the bin can be split (names A-C use salt 20, D-F use salt 21) or additional salts can be used on the existing bin. There isn’t a large performance hit as long as the number of possible salts is reasonable.