This post describes one important problem I ran into while implementing the IDE approach.
When someone is working with logs usually he or she needs to investigate only one time interval. The available logs usually span days but the time interval that must be investigated is 1-2 hours. The task is to select all log records within that time interval.
Basic Log Record Regex
In order to select a log record we need a regular expression that matches any log record. For the simple log4j format like
2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message
I found the following regex:
This regular expression matches both single-line and miltiline log records. Time regex could be
So if somebody wanted to load all logs into a text window he could open the log files one by one and use Matcher.find() with this regex to get all log records.
This regex is based on the fact that the time regex pattern is never repeated in the body of the log message which is true in 99% of all cases.
Datetime of the Log Record
In order to search for a specific time interval and use other features it makes sense to extract the dtaetime information from the log record. Thankfully this task has been solved by JDK with DateTimeFormatter. It is enough to specify the format for the log type and the date can be extracted. For example for the log record above the format is
As soon as we can extract the datetime information we can specify the interval as datetime values not Strings in some specific format.
Now that we have found a way to select any log record and extract the date information from it the path forward seems clear:
- specify the interval,
- select the records one by one
- extract the date information from the log record
- compare the datetime with the interval
- if the datetime is within the interval add this record to the List of found records
- after searching through all files show the found records
There is one big issue with this approach:
time. With 50 log files 50 MB each it will take hours to scan all them to find 10 MB of records in the interval.
We can use one trick to filter out the files that do not contain a single record in the interval. We use the fact that the log records in the log files are written one after the other. This means the time of the next record is equal or after the time of this record. For example only 2 situations are possible:
2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message 2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2
2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message 2018-08-10 11:00:56,278 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2
I rarely saw some examples where under high load the log records can go in reverse but the difference is in milliseconds. We can consider this difference insignificant for our purpose.
This means if neither the first nor the last record in the file are not in the interval all the records in the file are not in the interval and this file can be filtered out. Java regular expressions have special constructs to find the first and the last records.
The first record:
The last record:
\A means the beginning of the text, \Z means the end of the text. You can find more details in the javadocs for java.util.regex.Pattern.
The solution is to use a special prescanning technique. Before scanning the whole text of a log file find the first and last records and if non of them is in the interval skip the file. Of the 50 files maybe 1-2 needs to be scanned.
REAL uses this technique to speed up searches for a datetime interval. I found that it takes approximately 5-10 seconds to decide if the file must be skipped. Most of the time if spent executing Matcher.find() for the last record. The first record is found much faster. I think it is possible to speed it up even further by selecting the last 5 MB of a 50 MB file to search for the last record. But even in the current state it is fast enough.