Data volumes are growing exponentially. Unstructured data from Twitter, LinkedIn, Mailling Lists, etc. has the potential to transform many industries if it could be combined with structured data. Machine learning, natural language processing, sentiment analysis, etc. everybody talks about them, hardly anybody is really using them at scale. Too many people when they talk about Big Data unfortunately start with the answer and then ask what the problem it. The answer seems to be Hadoop. News flash: Hadoop is not the answer and if you start from the answer to look for problems then you are doing it wrong.
What are Common Data Problems?
Most Big Data problems are about storage and reporting. How do I store all the exponentially growing data in such a way that business managers can get to in seconds when they need it? Ad-hoc reporting, adequate prediction, and making sense of the exponentially growing data stream are the key problems.
Big Data Storage?
Do you have relational data, unstructured data, graph data, etc.? How do you store different types of data and make it available inside an enterprise? The basics for big data storage is cloud storage technology. You want to store any type of data and be able to quickly scale up storage. RedHat did not buy Inktank for $175M because traditional storage has solved all of today’s problems. Premium SAN and other storage technologies are old school. They are too expensive for Big Data. They were designed with the idea that each byte of data is critical for an enterprise. Unfortunately this is no longer the case. You mind loosing transactional sales data. You don’t mind so much loosing sample tweets you bought from Datasift or Apache log files from an internal low-impact server. This is where cloud storage solutions like Inktank’s Ceph allow commodity storage to be built that is reliable, scalable and extremely cost effective. Does this mean you don’t need SANs any more? Wrong again. TV did not kill Radio. Same here.
Cloud storage technologies are needed because each type of data behaves differently. If you have log data that only is appended then HDFS is fine. If you have read-mostly data then a relational database is ideal. If you have write-mostly data then you need to look at NoSQL. If you need heavy read-and-write then you need strong Big Data architecture skills. What is more important: short latency, consistency, reliability, cheap storage, etc.? Each of these means that the solution is different. No latency means in-memory or SSD. Consistency means transactional. Reliability means replication. You can even now find inconsistent databases like BlinkDB. There is no longer one size fits all. Oracle is no longer the answer to everybody’s data questions.
What will companies need? Companies need cloud storage solutions that offer these different storage capabilities like a service. Amazon’s RDS, DynamoDB, S3 and Redshift are examples of what companies need. However companies need more flexibility. They need to be able to migrate their data between public cloud providers to optimise their costs and have added security. They also need to be able to store data in private local clouds or nearby hosted private clouds for latency or regulatory reasons.
The future of ETL & BI
Traditional ETL will see a revolution. ETL never worked. Business managers don’t want to go and ask their IT department to make a change in a star schema in order to import some extra data from the Internet followed by updates to reports and dashboards. Business managers want an easy to use tool that can answer their ad-hoc queries. This is the reason why Tableau Software + Amazon Redshift are growing like crazy. However if your organisation is starting to pump terabytes of data into Redshift, please be warned: The day will come that Amazon sends you a bill that your CxO will not want to pay and he/she will want you to move out of Amazon. What will you do then? Do you have an exit strategy?
The future of ETL and BI will be web tools that any business manager can use to create ad-hoc reports. The Office generation wants to see dynamic HTML5 GUIs that allow them to drag-and-drop data queries into ad-hoc reports and dashboards. If you need training then the tool is too difficult.
These next-generation BI tools will need dynamic back-office solutions that allow storing real-time, graph, blob, historical relational, unstructured, etc. data into a commonly accessible cloud storage solution. Each one will be hosted by a different cloud service but they will all be an API away. Software will be packaged in such a way that it knows how to export its own data. Why do you need to know where Apache stores the access and error logs and in which format? Apache should be able to export whatever interesting information it contains in a standardised way into some deep storage. Machine learning should be used to make decisions on how best to store that data for ad-hoc reporting afterwards. Humans should no longer be involved in this process.
Talking about machine learning. With the volumes of data growing from gigabytes into petabytes, traditional data scientists will not scale. In many companies a data scientist is similar to a report monkey: “Find out why in region X we sold Y% less”, etc. Data scientist should not be synonymous for dynamic report generators. Data scientists should be machine learning experts. They should tell the computer what they want, not how they want it. Today’s data scientists pride themselves they know R, Python, etc. These tools are too low-level to be usable at scale. There are just not enough people in the world to learn R. Data is growing exponentially, R experts at best can grow linear. What we need are machine learning GUI solutions like RapidMiner Studio but supported by Petabyte cloud solutions. A short term solution could be an HTML5 GUI version of RapidMiner Studio that connects to a back-end set of cloud services that use some of the nice Apache Spark extensions for machine learning, streaming, Big Data warehousing/SQL, graph retrieval, etc. or solutions based on Druid.io. For sure there are other solutions possible.
What is important is that companies start realising that data is becoming a strategic weapon. Those companies that are able to collect more of it and convert it into valuable knowledge and wisdom will be tomorrow’s giants. Most average machine learning algorithms become substantially better just by throwing more and more data at them. This means that having a Big Data architecture is not as critical as having the best trained models in the industry and continue to train them. There will be a data divide between the have’s and have-not’s. Google, Facebook, Microsoft and others have been buying any startup that smells like Deep Belief Networks. They have done this with a good reason. They know that tomorrow’s algorithms and models will be more valuable than diamonds and gold. If you want to be one of the have’s then you need to invest in cloud storage now. You need to have massive historical data volumes to train tomorrow’s algorithms and start building the foundations today…