Python on the Job: Condensing Large Text Files

Python on the Job: Condensing Large Text Files

Python is an excellent tool for the processing of large text files. Say, for example, you have a software debugging log with about one million lines. Nearly all of the log entries, however are identical except for the date and time stamp and are normal info messages or harmless errors. Here is a very short excerpt:

MAR 3 2016 4:30:25 PM:
INFO: Application Started OK

APR 5 2016 9:00:15 AM:
INFO: Application Update

APR 5 2016 9:10:46 AM:
INFO: Application Started OK

JUL 30 2016 4:15:30 PM:
ERROR: User denied access

JUL 30 2016 4:15:40 PM:
INFO: Application Started OK

NOV 16 2016 12:00:00 AM:
ERROR: User data is corrupted!

Suppose you are looking through this log for something “not ordinary”, but are not sure exactly what to look for. It would be incredibly helpful to generate a copy of this log file, just like the original, but without all the repetitive info or non-critical error messages. This would be like finding a needle in a haystack by burning all the hay in order to search through a much smaller pile of ashes.

We can write a Python program to create a copy of this large log file, but skip any line which says “INFO”, as well as the line above (the date and time for that INFO entry) and below (which in our excerpt just includes an empty line). We will do the same thing with any line that says “denied access”, since those kinds of errors are often caused by someone momentarily forgetting to login with the correct username (notice the successful application launch just ten seconds later?)

This is easy to do in Python:

import sys

err_log = open ("log.txt", "r")
log_list = (err_log.read()).split("\n")
output_file = open("condensed.txt","a")
for i in range(len(log_list)):
if ("INFO" in log_list[i]) or ("denied access" in log_list[i]):
log_list[i-1:i+1] = " "
for j in range(len(log_list)):
if (log_list[j] == " ") or (log_list[j] == ""):
continue
else:
output_file.write(log_list[j] + "\n")
err_log.close()
output_file.close()

The above code) opens the original log file (“log.txt”), reads all the contents into memory and then generates a list where each line from that log file is an item in that list. A new file (“condensed.txt”) is created which we will fill later with only the information we are interested in. The program then checks each item in this list, checking to see if the word “INFO” or phrase “denied access” is inside. If there is a match, that list item, as well as the previous and next items, are replaced with a single space (“ “). After the program has checked every item in the list (i.e. every line in the log file), the program then restarts at the beginning of the list, but this time, the program looks to see if the only thing each list item contains is a single space, and only writes out the items which do not to our output file (“condensed.txt”). Lines which are completely empty are also omitted from our output file.

Here is the result of running the above program on our example log file:

NOV 16 2016 12:00:00 AM:
ERROR: User data is corrupted!

This much smaller log file copy can then be reviewed manually or even processed by another Python program. If the log file contains recorded measurements of other important numbers, however it will probably be imported into a spreadsheet such as Microsoft Excel or LibreOffice Calc for further analysis or graphing of the data. Both programs have powerful features for importing data from a text file so that the data can be easily plotted on graph.

Mastery of Excel, however requires memorization of where to click and which menu option to select in order to unlock the most useful features. Advanced spreadsheet skills are very valuable to a company, but not everyone who may analyze the data in text files will know where to click to get the answer he or she needs. It is more common to find programming skills among software developers and electrical engineers who write firmware, for example. Also, those with programming skills in one part of an organization can greatly help spreadsheet users in another part of the organization by reducing size of the data to import, as we saw above with our example log file.

Python by itself is a powerful tool when it comes to processing large text files, but it’s greatest strength may be its ability to work together with other software tools.

Copyright © Python People