In this article we will talk about parsing log files: how it is different from parsing your average programming language and how to accomplish it.
Parsing log files is a common need. Big companies like Microsoft release tools just for that purpose and there are even entire companies built around the task of parsing and analyzing log files. So if this is a topic of such large interest, why have we never talked about it?
Part of the reason is right in what we just wrote. There are already tools to do it at scale for the people that need log parsing: system administrators. Another reason is that log parsing is one of the few tasks in which regular expressions can actually work.
So, for one-off small tasks you can get away without professional parsing libraries and for real-life company needs you can just use ready-to-use services.
That is not always the case though, nor may be the best way to do it. Sometimes you need to parse custom log files that are created for ad-hoc purposes. Like tracking statistics of the behavior of a program. In such cases, you need to write a bit of code that will be a crucial part of your system. Therefore a bunch of regular expressions might not be a good idea. That is where this article comes in.
You Can Log Any Way You Want
The first issue in parsing a log file is that there are many formats of log files. Sure, there is the syslog standard for applications that creates log for system administration, but there are many other log use cases. Furthermore, there are slight differences in the way people log timestamps, for instance when relating to timezone information.
If you think about it, logging just means to record information about notable events. An application or library can choose whatever format might work best to convey information to a user.
Ultimately, logging is meant both for human consumption and computer handling. So it can be hard to define a stringent format.
Nobody expects a system administrator to read a million lines of log information directly, so some computer handling is expected. However, a person should be able to interpret any format you choose. That is because, in the end, the one that will make sense of the log information is a person.
That Looks Like Free Text
The second issue with parsing log files descends from this last fact. Since logging is meant for humans, log files often do not have any meaningful structure. It is not exactly free text, but it is not a language or a data format.
For example, a typical entry in an Apache error log is full of lines like this.
[Thu Mar 23 19:04:13 2024] [error] [client 192.168.0.1] File does not exist: /var/www/secret.txt
Structurally is just a series of sections between square brackets and then some final free text.
You could find log files in JSON or XML formats, but even in that case there is no guarantee that the message itself will follow any structure. Each message is usually logged in its own line or record, so there is that. But there is often little more than that.
This makes a log file a bad fit for parser tools and libraries, because parsing means analyzing the syntax, the structure of some text. Parsing it is not well suited for when you need to understand semantic information.
Let’s see a brief example. You would probably say that the text that you are reading has some organization. If it was just random words or even a stream of consciousness without punctuation, it would be much harder to read. However, this organization is not in the structure of the text itself, it derives from the meaning of its individual components.
One word follows another without following any strict rule. You make sense of the words because they follow English rules, like having a subject, a verb, etc. In other words, if you did not know how to speak English, you could not say if a series of words was actually meaningful Content or not.
Compare that situation with a programming language: you might not know Ruby, but you can still probably recognize if some piece of text is Ruby code or a random series of words. How is that possible? It happens because a programming language has a meaningful structure.
A Semi-structured Custom Log Format
Sometimes you see custom log files, which are a cross between a log file and a data format. For example, a log file for logging api calls. It could be argued that this is probably a bad fit for a log file, since when there is a need to precisely record information for automatic data handling you probably need a data format. However, it is hard to draw a line for such a broad concept, so this still happens.
We are thinking about something like this.
Time: 2023-01-20 11:21:31 Name: Function_1 Arguments: [1/2] 5 [2/2] text Software: Magic Library v1.2.3 Device: Main Server
Now, this kind of log file does have a bit more structure, the problem is that this is still usually not a good fit for a parsing library. That is because the structure is often context-sensitive. For example, the lines containing the arguments can vary depending on the number of arguments the function has.
For Once ANTLR Is Not A Good Option
A modern parsing tool like ANTLR can handle context-sensitive input, but it is not a good fit: it would require some complications in the grammar and it is probably overkill.
So, for once we might say that ANTLR is not the solution to your problems. Does this mean that you are forced to use regular expressions?
Well, no. Putting together a bunch of regular expressions is still too brittle and unmaintainable code. The problem is not in the power of regular expressions, but that they do not provide any organization to your code. Actually, I am going on record saying that even a single regular expression can be hard to maintain because of readability issues. If you disagree with me I suggest you read the regex to parse an email address.
A tool like ANTLR requires you to write a grammar describing the language in its own format. Parser combinator libraries instead allow you to create a parser directly with code in your favorite language. You build a parser by combining different pattern matching functions, that are equivalent to grammar rules.
They are integrated in your normal workflow and take full advantage of your IDE’s functionality. They are a good fit for this use case because they work well with context-sensitive parsing. It is quite easy to pick a parsing combinator depending on the code you are reading. For example, imagine you are trying to parse this piece of our previous custom log example.
Arguments: [1/2] 5 [2/2] text Software: Magic Library v1.2.3
It is very easy to read the first line with an argument and then decide whether to parse the next line as an argument or as the Software line depending on the numbers between the square brackets.
With Sprache, you could create something like this parsing rule.
public static Parser<Argument> Argument = from start in Parse.Char('[') from index in Parse.Digit.AtLeastOnce().Text() from separator in Parse.Char('/') from total in Parse.Digit.AtLeastOnce().Text() from end in Parse.Char(']') from value in Parse.LetterOrDigit.Or(Parse.WhiteSpace).Many().Text() select new Argument(int.Parse(index), int.Parse(total), value);
Then, you would just need to compare the value of
Argument.Total to understand whether to parse the next line as an argument or as a software line.
In this article, we have analyzed the issues in parsing log files. On one hand, they are too simple to make good use of parsing tools, so you get the overhead without the advantages. On the other hand, there are often already powerful services that can handle log files and support the needs of system administrators.
You might need to build your own parser for log files when you have a custom format for an application and need to extract some data from it. We have seen how this can be a great use case for using parsing combinators. Using one of them you can probably parse any log file with readable and efficient rules.