hadoopData security remains a top concern for data professionals. To help organizations put up a best defense, Reiner Kappenberger, senior executive focused on big data and Hadoop at HPE Seecurity-Data Security offers five steps on how to best secure data in the Hadoop environment.

Audit and understand your Hadoop data.

To get started, take an inventory of all the data you intend to store in your Hadoop environment. You’ll need to know what’s going in so you can identify and rank the sensitivity of that data. It may seem like a daunting task, but attackers can take your data quickly and sort it at their leisure. If they are willing to put in the time to find what you have, you should be too.

Perform threat modeling on sensitive data.

The goal of threat modeling is to identify the potential vulnerabilities of at-risk data and to know how the data could be used against you if stolen. This step can be simple: For example, we know that personally identifiable information always has a high black market value. But assessing data vulnerability isn’t always so straightforward. Date of birth may not seem like a sensitive value alone, but when combined with a zip code, a date of birth gives criminals a lot more to go on. Be aware of how various data can be combined for corrupt purposes.

Identify the business-critical values within sensitive data.

It’s no good to make the data secure if the security tactic also neutralizes its business value. You’ll need to know if data has a characteristic that is critical for downstream business processes. For example, certain digits in a credit card number are critical to identifying the issuing bank, while other digits have no value beyond the transaction. By identifying the digits you need to retain, you can be sure to use data masking and encryption techniques that make re-identification possible.

Apply tokenization and format-preserving encryption on data as it is ingested.

You’ll need to use one of these techniques to protect any data that requires re-identification. While there are other techniques for obscuring data, these are particularly suited for Hadoop because they do not result in collisions that prevent you from analyzing data. Each technique has different use cases; expect to use both, depending on the characteristics of the data being masked. Format-preserving technologies enable the majority of your analytics to be performed directly on the de-identified data, securing data-in-motion and data-in-use.

Provide data-at-rest encryption throughout the Hadoop cluster.

As just mentioned, Hadoop data is immediately replicated on entering the environment, which means you’ll be unable to trace where it’s gone. When hard drives age out of the system and need replacing, encryption of data-at-rest means you won’t have to worry about what could be found on a discarded drive once it has left your control. This step is often overlooked because it’s not a standard feature offered by Hadoop vendors. Source