A Deep Dive into InfluxDB

Gianluca ArbezzanoNovember 15th, 2017Last Updated: November 15th, 2017

0 112 4 minutes read

I gave you an overview of the TICK stack in a previous article, but here we’ll focus on InfluxDB, a database optimized to store and run time series data.

I care about time series for a variety of reasons. With the adoption of agile methodologies, cloud, and DevOps, we are now able to develop faster to release new features. This is great, but it means that now more than ever we need to understand how our application behaves and how a specific code change impacts our systems (negatively or positively).

We also have the capability to use machine learning and other algorithms to predict and understand how a system will perform based on historical data. Because of this, collecting and using time series data is a good challenge for me.

InfluxDB is a database written in Go to help you work specifically with time series data. A time series is a set of points that each contain a timestamp. In InfluxDB, the timestamp can be stored in nanosecond intervals. In practice, we might see something like this:

cpu_usage value=49 1502043216
cpu_usage value=50 1502193042
cpu_usage value=5 1502196258

So why do we need a specific storage for time series? Why can’t we use a traditional database like MySQL, Cassandra, MongoDB, or Elasticsearch? To answer these questions, you should consider your use case. There are various benchmarks around InfluxDB versus other databases, and you can quickly see that InfluxDB outperforms them all.

But it isn’t only about performance; time series is a specific domain, and InfluxDB as a time series database provides different capabilities to work with time. This is probably the most important reason to use InfluxDB.

InfluxDB offers a powerful engine and two entry points to interact with it. It supports an HTTP API that runs by default on port 8086 with reading and writing capabilities. And it supports UDP as writing protocol.

The Data Model

The data model is the structure of the data manipulated and managed by InfluxDB. You can see a measurement as a table that contains a set of points that are usually under the same domain. Every point is labeled with tags and fields.

We call a set of tags the tagset. The main difference between tags and fields is the index. tags are indexed, and fields are not. Indexing tags allows you to make optimized queries with fewer resources consumed. Both tag and field are key values but tag accepts strings, where fields accepts integers and floats.

Every point has a time; we call it a timestamp. We use a protocol called line protocol to describe this data model:

measurement,tag=value,tag1=value1 field=value,field1=value1 timestamp

In practice, a point looks like this:

h2o_feet,location=coyote_creek water_level=8.120,level\ description="between 6 and 9 feet" 1439856000

The timestamp is not mandatory, as InfluxDB will add it if not specified.

Understanding the InfluxDB model is important in order to design a fast structure around your dataset. A combination of .measurement + tagset is called a series. To identify a specific point, the right combination is measurment + tagset + timestamp.

To keep low cardinality and to increase the performance of your InfluxDB instance, you should keep the number of series as low as possible.

!Sign up for a free Codeship Account

TCP Versus UDP

You know the difference between TCP and UDP, so why does InfluxDB support both? How do you choose the right one for your use case?

To answer this, you need to think about the key differences between these two protocols: UDP doesn’t guarantee the success of the request. Applied to InfluxDB, the client doesn’t know if the points are stored successfully in InfluxDB. If you are storing sensitive data, you probably need to be 100 percent sure that all the points are there. On the other hand, if it’s not important for you to have all the points stored, you can use UDP. If, for example, you’re taking CPU usage, you can probably miss a point until the behavior is readable and clear.

UDP is preferred in the case where there is a problem with your InfluxDB server. In this instance, your application will continue to run because UDP is not going to send a failure back to the application. You may lose all the points, but if you are happy to have the monitoring down and the application still working, it can be a useful design.

By the way, just to be clear, UDP is faster and uses fewer resources than TCP. To give you an idea, we made a benchmark with php-sdk developed by Corley SRL:

Corley\Benchmarks\Influx DB\AdapterEvent
    Method Name                Iterations    Average Time      Ops/second
    ------------------------  ------------  --------------    -------------
    sendDataUsingHttpAdapter: [1,000     ] [0.0026700308323] [374.52751]
    sendDataUsingUdpAdapter : [1,000     ] [0.0000436344147] [22,917.69026]

Influx CLI

With InfluxDB, there is a CLI called influx; if you install it via apt, yum, or Docker, you will find it in your system. The CLI is the default entry point. You can do everything from there — insert points, query, and manage database access. It uses the REST API to communicate with InfluxDB.

Query Engine

InfluxDB uses a SQL-like query language. It’s a bit controversial, and there are a lot of internal conversations about where to take this in the next major release.

The benefit of this query language is the onboarding process. It’s very simple since a lot of people know SQL, but for complex queries, sometimes it looks too complicated and hard to manipulate. That’s why the InfluxDB team is thinking about a different solution. I hope to have something to share with you about this new idea soon!

SELECT * FROM measurement WHERE time > now() - 1h LIMIT 10000

Retention Policy

The number of series and points stored is rapidly growing, which is the nature of monitoring, where you are continuously collecting and storing data. At some point, read performance will become a problem.

On other hand, if you don’t need to keep all your data in InfluxDB forever, there is a feature called retention policy. By default, it is set to keep your data forever but you can change it. If you set the retention policy to two weeks, all points stored with that retention will be removed after two weeks.

This helps you to automatically keep your InfluxDB clean and performing fast. You can have multiple retention policies working in the same database for various series. You just need to specify them.

curl -i -XPOST 'http://localhost:8086/write?db=mydb&rb=myretention' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'

With the rb query params, you can override the default retention policy.

Published on Java Code Geeks with permission by Gianluca Arbezzano, partner at our JCG program. See the original article here: A Deep Dive into InfluxDB

Opinions expressed by Java Code Geeks contributors are their own.