1

How is it possible to join multiple lines of a log file into 1 dataframe row?

ADDED ONE LINE -- Example 4-line log file:

[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE  ((                            mr.cap_em >
 0 AND             mr.cap_em > 5
 ))  ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
    version = 2.0
    filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
    sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]

(See representative 8-line extract at https://pastebin.com/bsmWWCgw.)

The structure is clean:

[PRIOR][datetime][ClassName] - [Msg]

but the message is often multi-lined, there may be multiple brackets in the message itself (even trailing…), or ^M newlines, but not necessarily… That makes it difficult to parse. Dunno where to begin here…

So, in order to process such a file, and be able to read it with something like:

#!/usr/bin/env Rscript

df <- read.table('D:/logfile.log')

we really need to have that merge of lines happening first. How is that doable?

The goal is to load the whole log file for making graphics, analysis (grepping out stuff), and eventually writing it back into a file, so -- if possible -- newlines should be kept in order to respect the original formatting.

The expected dataframe would look like:

PRIOR   Datetime              ClassName             Msg
-----   -------------------   -------------------   ----------
WARN    2016-12-16 13:43:10   ConfigManagerLoader   Low max...
DEBUG   2016-05-26 10:10:22   DataSourceImpl        SELECT ...

And, ideally once again, this should be doable in R directly (?), so that we can "process" a live log file (opened in write mode by the server app), "à la tail -f".

user3341592
  • 1,419
  • 1
  • 17
  • 36
  • Based on what you've shown here, I think you'll have to first write a parser that 'demultilines' messages. Fitting stuff into a dataframe after that should be pretty easy. To "demultiline", you could perhpas take advantage of `[LEVEL]` and remove newlines after this tag and stop just shy of next tag. – Roman Luštrik Mar 15 '19 at 13:42
  • @Roman, how should that be done? In R? Loosing newlines? – user3341592 Mar 15 '19 at 14:19

1 Answers1

0

This is a pretty wicked Regex bomb. I'd recommend using the stringr package, but you could do all this with grep style functions.

library(stringr)

str <- c(
  '[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
  [DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE  ((                            mr.cap_em >
   0 AND             mr.cap_em > 5
   ))  ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
  [ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
  workflow rule = BenCommonResources-getDataRecords
      version = 2.0
      filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
      sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
  Result code = ruleFailed
  Result message = Database error while processing request.
  Result details = null
  ]]'
)

Using regex we can split each line by checking for the pattern you mentioned. This regex is checking for a [, followed by any non-line feed character or line feed character or carriage return character, followed by a [. But do this is a lazy (non-greedy) way by using *?. Repeat that 3 times, then check for a -. Finally, check for a [, followed by any characters or a group that includes information within square brackets, then a ]. That's a mouthful. Type it into a regex calculator. Just remember to remove the extra backlashes (in a regex calculator \ is used but in R \\ is used).

# Split the text into each line without using \n or \r.
# pattern for each line is a lazy (non-greedy) [][][] - []
linesplit <- str %>%
  # str_remove_all("\n") %>%
  # str_extract_all('\\[(.|\\n|\\r)+\\]')
  str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
  unlist()

linesplit # Run this to view what happened

Now that we have each line separated break them into columns. But we don't want to keep the [ or ] so we use a positive lookbehind and a positive lookahead in the regex code to check to see if the are there without capturing them. Oh, and capture everything between them of course.

# Split each line into columns
colsplit <- linesplit %>% 
  str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")

colsplit # Run this to view what happened

Now we have a list with an object for each line. In each object are 4 items for each column. We need to convert those 4 items to a dataframe and then join those dataframes together.

# Convert each line to a dataframe, then join the dataframes together
df <- lapply(colsplit,
  function(x){
    data.frame(
      PRIOR = x[1],
      Datetime = x[2],
      ClassName = x[3],
      Msg = x[4],
      stringsAsFactors = FALSE
    )
    }
  ) %>%
  do.call(rbind,.)

df
#   PRIOR                Datetime           ClassName             Msg
# 1 WARN  2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
# 2 DEBUG 2016-05-26 10:10:22,185      DataSourceImpl SELECT mr.lb_id
# 3 ERROR 2016-12-21 13:51:04,710  DWRWorkflowService Update Wizard -

# Note: there are extra spaces that probably should be trimmed,
# and the dates are slightly messed up. I'll leave those for the
# questioner to fix using a mutate and the string functions.

I will leave it to you to fix the extra spaces, and date field.

Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
  • There are still many little things which I don't understand / know yet (such as the %>%), but that looks like a wonderful and very professional answer to my needs. I'll clearly learn a lot from this! Thanks... – user3341592 Mar 15 '19 at 15:12
  • If I want to test it on a real-life file, how can I correctly populate the `str`? – user3341592 Mar 15 '19 at 15:13
  • The problem is in `linesplit`: it seems to stop the line at the first `]` in the message? – user3341592 Mar 15 '19 at 16:48
  • I'll need to set aside some time to look closer unless someone else responds first. – Adam Sampson Mar 16 '19 at 17:59
  • sorry about the %>%. That's a function available in `magrittr` or `stringr` or any other `tidyverse` package. It's called "pipe" and passes the output of whatever was before it to the the next function. Example: `vector %>% mean(na.rm = TRUE)` is the same as writing `mean(vector,na.rm = TRUE)`. This is useful when you want to string a bunch of actions together and then discard all the intermediate results. – Adam Sampson Mar 16 '19 at 18:02
  • The code I wrote will take any character field (any length of string, but not multiple strings). You could read an entire file or the end of a file. If you read the end of the file it would only return full matches and would exclude any partial matches. – Adam Sampson Mar 16 '19 at 18:05
  • FYI, complicated multi-questions take a lot longer to get answered on stack overflow. – Adam Sampson Mar 16 '19 at 18:05
  • Thanks for the added explanation about the "R pipe"! – user3341592 Mar 18 '19 at 11:26
  • Not sure to understand your comment abot full or partial match, when reading a full file. Ideally, though, the script should be able to consume the output of `tail -f`, hence read line by line, and join them when needed. Following the comment of @Roman, I do think that the best approach could be to define the pattern of the beginning of a standard line (could be as short as `[`) and append the non-matching data to the previous line, when reading new lines. WDYT? – user3341592 Mar 18 '19 at 11:29
  • I realize I may be crazy about the `tail -f` stuff: I guess we can't make the df grow with every new line that's being read. So, this wish may be forgotten. Hence, a script that reads a static file would already be great! – user3341592 Mar 18 '19 at 11:31
  • I'm not very familiar with reading line by line. If you are grabbing the last n lines, you would simply combine them into a single string using a `paste(lines,collapse = "")`. – Adam Sampson Mar 18 '19 at 13:26
  • There may be other ways to do things, but because you might have extra new line characters my method assumes you re-combine everything into a single string and then uses pattern matching to separate them into corrected lines. – Adam Sampson Mar 18 '19 at 13:27
  • OK for the explanation, thx. Do you confirm, though, the bug? – user3341592 Mar 18 '19 at 16:36