If you are building a service that stores sensitive data, your number one concern should be how to protect it. What IS sensitive data? There are some obvious examples, like medical data or bank account data. But would you consider a dating site database as sensitive data? Based on a recent leaks of a big dating site I’d say yes. Is a cloud turn-by-turn nagivation database sensitive? Most likely, as users journeys are stored there. Facebook messages, emails, etc – all of that can and should be considered sensitive. And therefore must be highly protected. If you’re not sure if the data you store is sensitive, assume it is, just in case. Or a subsequent breach can bring your business down easily.
Now, protecting data is no trivial feat. And certainly cannot be covered in a single blog post. I’ll start with outlining a few good practices:
- Don’t dump your production data anywhere else. If you want a “replica” for testing purposes, obfuscate the data – replace the real values with fakes ones.
- Make sure access to your servers is properly restricted. This includes using a “bastion” host, proper access control settings for your administrators, key-based SSH access.
- Encrypt your backups – if your system is “perfectly” secured, but your backups lie around unencrypted, they would be the weak spot. The decryption key should be as protected as possible (will discuss it below)
- Encrypt your storage – especially if using a cloud provider, assume you can’t trust it. AWS, for example, offers EBS encryption, which is quite good. There are other approaches as well, e.g. using LUKS with keys stored within your organization’s infrastructure. This and the previous point are about “Data at rest” encryption.
- Monitoring all access and auditing operations – there shouldn’t be an unaudited command issued on production.
- In some cases, you may even want to use split keys for logging into a production machine – meaning two administrators have to come together in order to gain access.
- Always be up-to-date with software packages and libraries (well, maybe wait a few days/weeks to make sure no new obvious vulnerability has been introduced)
- Encrypt internal communication between servers – the fact that your data is encrypted “at rest”, may not matter, if it’s in plain text “in transit”.
- in rare cases, when only the user has to be able to see their data and it’s very confidential, you may encrypt it with a key based (in part) on their password. The password alone doe not make a good encryption key, but there are key-derivation functions (e.g. PBKDF2) that are created to turn low-entropy passwords into fair keys. The key can be combined with another part, stored on the server side. Thus only the user can decrypt their content, as their password is not stored anywhere in plain text and can’t be accessed even in case of a breach.
You see there’s a lot of encryption happening, but with encryption there’s one key question – who holds the decryption key. If the key is stored in a configuration file on one of your servers, the attacker that has gained access to your infrastructure, will find that key as well, get it, get the whole db, and then happily wait for it to be fully decrypted on his own machines.
To store a key securely, it has to be on a tamper-proof storage. For example, an HSM (Hardware Security Module). If you don’t have HSMs, Amazon offers it as part of AWS. It also offers key management as a service, but the particular provider is not important – the concept is important. The you need to have a securely stored key on a device that doesn’t let the key out under no circumstances, even a breach (HSM vendors claim that’s the case).
Now, how to use these keys depends on the particular case. Normally, you wouldn’t use the HSM itself to decrypt data, but rather to decrypt the decryption key, which in turn is used to decrypt the data. If all the sensitive data in your database is encrypted, even if the attacker gains SSH access and thus gains access to the database (because your application needs unencrypted data to work with; homomorphic encryption is not yet here), he’ll have to get hold of the in-memory decryption key. And if you’re using envelope encryption, it will be even harder for an attacker to just dump your data and walk away.
Note that the ecnryption and decryption here are at the application level – so the encrypted data can be not simply “per storage” or “per database”, but also per column – usernames don’t have to be kept so secret, but the associated personal data (in the next 3 database columns) should. So you can plug the encryption mechanism in your pre-persist (and decryption – in post-load) hooks. If speed is an issue, i.e. you don’t want to do the decryption in real-time, you may have a (distributed) cache of decrypted data that you can refresh with a background job.
But if your application has to know the data, an attacker that gains full control for an unlimited amount of time, will also have the full data eventually. No amount of enveloping and layers of encryption can stop that, it can only make it harder and slower to obtain the dump (even if the master key, stored on HSM, is not extracted, the attacker will have an interface to that key to use it for decrypting the data). That’s why intrusion detection is key. All of the above steps combined with an early notification of intrusion can mean your data is protected.
As we are all well aware, there is never a 100% secure system. Our job is to make it nearly impossible for bulk data extraction. And that includes proper key management, proper system-level and application-level handling of encryption and proper monitoring and intrusion detection.