From the first time Hadoop appeared it had a security problem. Apache Knox and Cloudera Manager have been solutions for providing authentication and authorization for basic database management functions. Also, the underlying Hadoop Filesystem now incorporates Unix-like permissions. But the issue has not been solved, so usually the pattern followed is to “plunk the S-word after the name of a new technology and you have a “BOLD IDEA FOR A NEW STARTUP!!!!””, as explained in Trust me: Big data is a huge security risk.
There have been other cases (SOA security, AJAX security, open source security) where a security startup came up.
In Hadoop, and in big data in general, the real security problem is that when we have a lot of data to aggregate we may lose context. Hadoop allows to store context, but checking all that context with each piece of data is an expensive proposition.
The important thing to know about context is, for example, not only how to get access to a database as a certain user, but also how to aggregate more data also preserving granular rights and permissions.
In order to succeed in having data ownership and data context rules in place without killing the performance, there are emerging technology solutions, such as Accumulo, created by the big data community — including everyone’s favorite member, the NSA.
Since security problem has been a hot topic for almost a decade now, there has also been research. When building a big data project for data aggregation and wondering about security, one should search on “datawarehouse security”.
Though 70 percent of the results will be vendor pitches or complaints about RBAC, there will also be plenty of results that explain exactly how this was done before, describing neither technologies nor tools, but methodologies — and those more or less translate directly to big data.