Data classification: start with the end in mind

My response to a HARO question:  To be read by IT managers at SMB's: tips for setting up a data classification system that assigns levels of sensitivity to data. What do they need to know to get started and what tools will they need to invest in? What are the costs involved and what are the benefits and pitfalls?

The key to data classification is to start with the end in mind. I have seen organizations execute multi year projects with hundreds of consultants and thousands of workshop hours to classify all their data, only to find it a complete waste of time. This was because the whole exercise was a tickbox exercise to meet a regulatory or audit requirement and success was seen as classifying the data. Even if the project managed to complete, the classifications were stale almost as soon the project finished. The processes they put in place to sustain the efforts where just not effective. Processes such as requiring every user to classify the data they created or an annual process for every business unit to inventory their data and update the classifications.

Why? Because people do not want to classify data. As much as security people would like end users to classify everything as soon as they create it and think about the classification before they email something, it just does not happen. End users focus on their day jobs and farming on Facebook not classifying data.

Instead these organizations should have viewed classification as a means to an end. A starting point to using a risk based approach to protecting the most important data to the organization. To ensure the scarce security resources are spent in the most effective manner and productivity and usability are not hampered unnecessarily. The classification process and associated security controls need to be automated and transparent to the end user if you want them to be sustainable and truly effective.

The whole point of classifying data is a close alignment of security controls to the potential impact to the organization if the data gets compromised. Typically you will do one or more of the following to protect your most sensitive data:

  • Encryption - in transit and/or storage
  • Reduce data loss - stop it from leaving the organization when it should not
  • Stronger authentication - make it harder for the wrong people to get access through the front door
  • Improved network controls - protect the back door

So rather than run a data classification project in isolation or even viewing classifying data as a major phase to be completed first: identify, protect and iterate. Automate so that future data of the same type is classified and protected without requiring human intervention.

Firstly you do need a classification scheme but this is not as important and you would think. There is very little point in having more than three classification levels:

  • Public - self evident, data that has no impact if disclosed
  • Confidential - your most important data that will have a major financial, reputational, legal, operational etc impact if is disclosed
  • Internal - everything else in the middle, data that you do not want the general public to access but will not hurt you as much as Confidential data if it is disclosed
Remember the end goal - you will get the most benefits in the line between Confidential and Internal. Unless you are starting a complete green fields your public data is obvious will already be on places like your website front page and your marketing materials. Some of your Confidential data is not being sufficiently protected exposing you to that impact and some of your Internal data maybe over protected, providing an opportunity to improve performance or usability or reduce costs.

Confidential data without sufficient controls

Being a security person I will of course start here but you may want to skip ahead to Internal data with too many controls.

Some will say there is no point implementing encryption or data loss protection unless you have first classified your data. I take almost the opposite view, the best time to identify your most sensitive data is when you are in a position to protect it. 

The approach I take in Simple security risk assessment (SSRA) is that that almost all companies will have data that fits into one of these categories, within each there will be data that will be extremely sensitive:
  • Personally Identifiable Information (PII) - health and medical information, criminal records, disability, religion, sexual orientation, government identifiers e.g. SSN, NI, disputes, disciplinary, investigations, industrial relations, redundancy plans
  • Customer information - customer lists, sales strategies, unreleased products and promotions, projections
  • Financial information - unreleased results, forecasts, guidance
  • Corporate, IP, legal - legal cases, forensics, investigations, prospective acquisitions, mergers, due diligence, major risk assessments, regulatory issues and correspondence, IP, trade secrets, unapproved patents and trademarks, private keys and passwords to access other Confidential information, unreleased restructure, redundancy, board papers, senior/exec mgmt correspondence
  • Transaction information - high value, low volume, special, exception transactions
  • Card data - card number, CVV2, expiry, magnetic strip, Track 2 info

If you want to think beyond confidentiality:
  • Data requiring very high integrity - where there will be immediate impact for loss of integrity e.g. automated decisions made on data
  • Data requiring very high availability - immediate impact for loss of availability, instant loss of revenue / increased cost 
You can go through and manually identify this data but this not sustainable. The best approach is to automate the identification and protection processes. Data loss prevention tools have become very good at doing this and you do not have to invest millions to use them. Try MyDLP an open source tool. 

Start with even the most structured of the above list: credit card numbers or government identifiers such as national identity or social security numbers. These are great because you can define a precise regular expression to find them. Most DLP tools have rules for this out of the box but even if they do not, instructions are not hard to find. Use the DLP software to find where this information currently lives, where it moves both internally and external to your organization and where and who uses it. If you are like most organizations you will be surprised when data you thought was in a single database is copied to emails,  spreadsheets, share drives. If you do not want to try MyDLP or similar, most DLP vendors like Symantec or RSA will come out and do this type of discovery exercise for you free of charge (obviously as a great entry point for their tool).

Then put protections in place:
  • Encryption - a database of credit card numbers or social security numbers that is not encrypted is a great place to start. Hopefully you will not have too many of these and identifying them for your DBA's to encrypt is a good start. Going forward setting up an automated email if this is detected again is a good step - it will most likely be someone taking an extract or copy that they should not be. If you have email encryption software you maybe able to configure it to automatically encrypt emails that match the same signature or that are specifically flagged by your DLP software. If your disputes team or HR team  need to regularly extract and work with credit card or employee data respectively, consider creating an encrypted share drive for them. They can then continue to work as normal and the data will be encrypted transparently
  • Reduce data loss - this is primary purpose for the DLP software. I have some practical lessons learnt on where and how to implement DLP
  • Stronger authentication - if you have two factor authentication for remote access already extend it to servers and applications that are storing your Confidential data. If you do not have something consider something open source like WikiD. A yubikey can be easy way to add two factor to just about anything due to its broad integration. Again hopefully this is not something have to do frequently - adding it to a project checklist and setting up a similar alert to the 2FA admins when a system holding this type of data is detected should be sufficient to ensure it continues going forward
  • Improved network controls - create specific VLAN's for systems that hold this data, put access controls in place to limit network access only to systems that should communicate with these systems. Everyone on your LAN should not be able to access the database storing your card data or HR data for example.  If you have a firewall that supports dynamic rules, you maybe able to automate this, otherwise again it is project checklist and an alert to your network admin on detection

Structured data is the best place to start with this process, but you can also protect less structured data by taking a hash or indexing data that you first manually identify as Confidential. Everything from your CEO's email, the folder where your finance staff store the annual reports and projections, to your R&D database can be hashed and then an exact or partial match of the hash found. The above protections and alerts can then also be applied to this less structured data.

Internal data with too many controls

Unfortunately this will tend to be more rare than you hoped for. It will generally be present where you have applied it a control to everything because it was too hard to apply specifically. Nevertheless the benefits of having a uniform environment for troubleshooting and support will mean this actually makes sense. For other controls the type of threat it is mitigating may mean it is important to apply it to everything rather than focusing on just protecting your most sensitive information. For example you have anti-virus on all servers, or a standard build where you have disabled unnecessary or insecure services such as telnet or tftp or you maybe performing vulnerability scanning on all your systems. In these cases the benefits of classification are two fold:
  • Prioritization - once you know where your Confidential data is you can focus on patching these first in addition to your Internet facing servers
  • Monitoring - if you get an alert of a virus, a large number of failed logins etc where your Confidential data is these can automatically receive a higher rating
Typically the overhead of security controls is not sufficient where you would remove something already in place. But you could run also use your DLP software to identify data flows and stores that are not Confidential and remove some controls where appropriate, for example:
  • Very large files transferred frequently that you are transferring over SFTP that could actually be moved to standard FTP
  • If you define network zones to protect your most sensitive information could you remove some firewalls or at least simplify the ruleset
  • Systems only storing Internal data that any new employee could be given access to immediately, meaning they are more productive from day one and you have less access request to approve and provision

Summary
Three takeaways:
  • Start with the end in mind - the whole point of classifying data is so you can protect what is most important and save money and time on the rest
  • Identify, protect, iterate - do not get stuck classifying
  • Automate - consider how your most sensitive data will continue to be identified and protected with minimal human involvement. At least setup alerts which will trigger action

No comments:

Post a Comment

Author

Written by