0

I want to convert log files to a format which can be read in R for further analysis.

things i came across while trying to look for a solution to this. Regex,RecordBreaker,OpenRefine or GoogleRefine,R has stringr and dplyr etc.

i tried using OpenRefine and it seemed useful but still would like to have more guidance since they say log files are the real big data.

Data looks like this;

M 8000000 NADR     14273 18:17:43.22 STC35256 00000291  DSNT375I  +HPN2 PLAN=DISTSERV WITH 026
 D                                         026 00000291          CORRELATION-ID=db2jcc_appli
 D                                         026 00000291          CONNECTION-ID=SERVER
 D                                         026 00000291          LUW-ID=G93FF023.DB11.CDD5C8DE241F=29839
 D                                         026 00000291
 D                                         026 00000291  THREAD-INFO=SAPHPNDB:9.63.240.123:SAPHPNDB:db2jcc_application:DYNAMIC
 D                                         026 00000291  :46835:*:*
 D                                         026 00000291          IS DEADLOCKED WITH PLAN=DISTSERV WITH
 D                                         026 00000291          CORRELATION-ID=db2jcc_appli
 D                                         026 00000291          CONNECTION-ID=SERVER
 D                                         026 00000291          LUW-ID=G93FF07C.EE5F.CDD5C82B2305=29799
 D                                         026 00000291
 D                                         026 00000291  THREAD-INFO=SAPHPNDB:9.63.240.33:SAPHPNDB:db2jcc_application:DYNAMIC:
 D                                         026 00000291  46835:*:*
 E                                         026 00000291          ON MEMBER HPN2
............................................................................

The underlying structure is like this;

  1. Each record starts with M and ends with E

  2. The D's are the variables that give more information about a single record. So the first instance of this as shown in the log text above,starts with M ends with E and in between the D's provide information such as the correlation ID, connection ID etc.

So the above log file should be one row in a data table format with the D's as the variables.

  [1]: https://i.stack.imgur.com/hw9zY.png

possible solution:

data <- readLines("data1.txt")
pattern <- "(M\\s+\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\d+:\\d+:\\d+.\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\w+\\s+)(\\+\\w+\\s+\\w+(\\=|\\s+)\\w+\\s+\\w+\\s+\\d+)"

m <- regexec(pattern,data)

matches <- regmatches(data, m)

parts <- do.call(rbind,lapply(regmatches(data, m), `[`,c(2L,3L,4L,5L,6L,7L,8L,9L)))

colnames(parts) <- c("ID1","ID2","Date","Time","ID3","ID4","ID5","description")

parts <- as.data.frame(parts)

parts1 <- na.omit(parts)
vinay
  • 57
  • 1
  • 12
  • Is the total set of D's predefined? Otherwise how do you imagine mapping the different D's to variables? – LauriK Feb 12 '15 at 12:22
  • Its not predefined it varies, There is a max no. of variables that could occur, and there would be multiple cases where a subset of them would occur. eg. So correlation ID would be a column but in a snippet where correlation ID is not generated in the log then there we should have NA. – vinay Feb 12 '15 at 12:28

1 Answers1

0

Well, you could do it one log row at the time. Pseudocode would be something like this:

IF logrow.record == 'D' AND logrow.type == 'CORRELATION' THEN
  current.record$correlation = logrow.value
ELSE IF logrow.record == 'E' THEN
  all.records[n+1] = current.record
ELSE IF logrow.record == 'M' THEN
  current.record = empty new record
  current.record$ID = logrow.value
END

Basically if it's M, then you start a new record. If it's E then you end the current one. And if it's D, then add data to the current record based on the other information present.

It's not going to be too easy, but not too hard either. Start with one record, create a good amount of intermediate variables and take one step at the time.

LauriK
  • 1,899
  • 15
  • 20