Your business is growing up, you onboarding more and more customers using your system. Essentially you persisting more and more transactions and from 1000’s of daily transactions you grow up to 100,000’s and eventually breaking the millions transactions per day barrier
In one hand you still small enough in order not to call yourself “Big data” but on the other not too small (whatever you guys think millions of transactions per day still need to have some thoughts)
For some companies data is their main business but I want to share my experience on companies where data is not their main business but still thire key to retrieve KPI’s, reports and analytics for internal and thire customer usages
When your data is not your main business you start to think of it on later stages. When it actually start to grow up. Then questions like how to store and query it starting to arise together with budget and team resources concerns. Your business’s services send more and more data and your customers start to require it . First things first start to think on your data targets:
- For which purpose this data is for:
- ETL’s for customer sources?
- Transactional backup?
- Fast searching?
- Which queries am I going to perform on it:
- Aggregation functions?
- Transactional reports? (by date? by time? other conditions?)
Once resolve those questions you can get an idea how you want to persist this data. The next thing would be to understand how to tackle the infra and the tech solutions:
- Where you going to store your data? which datasource technology (relational, columnar store, file system, etc…)
- How you going to transfer your data? (aka data pipelines)
- Your SLA concerns? (is it OK not to have HA, How fast do I need this data, What would be the up the data integrity ratio)
- What are my resources? (Dev team, Devops team ,etc..)
So the first questions arise from your business requirements but the other ones are pretty common once you realize how your data should look like. I want to discuss the infra&tech questions as they usually repeat themself and have important gotcha’s we musttake under consideration
Data warehouse or Data Lake ?
Iam not going to review all the data sources in the market I’m sure there are enough resources for that. But I do want to help distinct between two generic datasource strategies which could get you a clue for a direction which datasource you shall try to hit.
If you know your data and it’s structured with finalized repetitive queries. For example:
- Return all players with amount > X
- Return the avg session time for user K
- Return customers on location Z
That sounds more like a data warehouse (or OLAP) and technologies you can start looking for would be: AWS Redshift, Snowflakes, Memsql and alike
If your data is coming from multiple varied resources and has multiple forms it sounds like you need data-lake which can store everything all together in one place for later usage . Technologies you can start looking for would be: AWS S3, Presto, AWS Athena and alike
Data transfer and Pipelines
We usually dont think about this one at first sight but as I see it today most of the datasource technologies don’t provide this out of the box. I think this subect should be one of your major considerations when you thinking about datasource provider. We have data from different sources (business services, logs, files etc..) and we need to transfer it into our beloved datasource. Transferring at sight feels easy with all the tools we have around but it’s not:
Data transfer Gotchas:
- Handle batching. Usually data come in big boxes we need to batch our way to the final destination. Need to think how you aggregate your data before persist it
- Failure handling. What happens if batch is failing? how to take care of it? How shall we track it? Recovering? Re-trying? DLQ’ing it?
- Tracking. How do we know all our data has arrived. How do we track and log?
- Duplications. One of the most paining concerns. How we handle them? How we avoid? Shall we delete them? Which deduplication strategy to pick?
As you can see data transfer is big topic. unfortunately it always involve dev resources to get a working flow. As we call it today:
In the end of the day we need to move data from multiple points to our datasource. That flow being called: Pipeline. Pipelines can be achieved by implementing in-house flow or using 3d party solutions. Anyhow your budget concerns (if it is a concern) going to riseup. Specially if you are small-mid business who just entered the ‘million’s’ transactions a day. Consider that before any solution else you will understand your scope is much bigger than you estimated
For me budget and resources are always a concern therefore I always check for hosting and out of the box providers who can deliver both. I will peek few solutions that could give you a jumpstart towards your strategy:
- Typical for data warehouse. Powerful managed columnar store solution.
- No pipeline strategy.
- Transferring your data into Redshift requires from you to almost develop all your pipelines in-house. For Redshift you must use the COPY command to move your data but you need to handle almost everything yourself (check above my note on: Data transfer Gotchas)
- You can cover this up with 3rd party providers like alooma.com, Stitch, etc.. but if budget is on your mind you better sum it all together and decide
- Typical for Datalake. Also managed as it works on top of AWS S3.
- Working with Athenna requires you to transfer your data into S3. Creating pipelines to S3 are easier as S3 is well known file system with alot of integration tools around. If you are using Kafka for example you can use tools like Kafka-Connect that will handle your Data transfer Gotchas
- Typical for data warehouse. Very fast and powerful solution for analytics. Can be both managed or maintained in-house.
- What I love about MemSQL is thiers out of the box pipelines. I think it’s the only solution (I’am aware of as for today) that gives you full coverage for Data transferring. It’s built-in pipeline mechanism covers almost all your Data transfer Gotchas
- It’s very expensive solution. choose it carefully.
I have reviewed common datasource technologies. I think this could give you an idea what to expect and what work you need to do around it. In my next post I will cover columnar data warehouse concepts and design to avoid your next performance issues:)
Published on Java Code Geeks with permission by Idan Fridman, partner at our JCG program. See the original article here: Our growing-up data challenges and how we tackle them
Opinions expressed by Java Code Geeks contributors are their own.