I want to convert log files to a format which can be read in R for further analysis.
things i came across while trying to look for a solution to this. Regex,RecordBreaker,OpenRefine or GoogleRefine,R has stringr and dplyr etc.
i tried using OpenRefine and it seemed useful but still would like to have more guidance since they say log files are the real big data.
Data looks like this;
M 8000000 NADR 14273 18:17:43.22 STC35256 00000291 DSNT375I +HPN2 PLAN=DISTSERV WITH 026
D 026 00000291 CORRELATION-ID=db2jcc_appli
D 026 00000291 CONNECTION-ID=SERVER
D 026 00000291 LUW-ID=G93FF023.DB11.CDD5C8DE241F=29839
D 026 00000291
D 026 00000291 THREAD-INFO=SAPHPNDB:9.63.240.123:SAPHPNDB:db2jcc_application:DYNAMIC
D 026 00000291 :46835:*:*
D 026 00000291 IS DEADLOCKED WITH PLAN=DISTSERV WITH
D 026 00000291 CORRELATION-ID=db2jcc_appli
D 026 00000291 CONNECTION-ID=SERVER
D 026 00000291 LUW-ID=G93FF07C.EE5F.CDD5C82B2305=29799
D 026 00000291
D 026 00000291 THREAD-INFO=SAPHPNDB:9.63.240.33:SAPHPNDB:db2jcc_application:DYNAMIC:
D 026 00000291 46835:*:*
E 026 00000291 ON MEMBER HPN2
............................................................................
The underlying structure is like this;
Each record starts with M and ends with E
The D's are the variables that give more information about a single record. So the first instance of this as shown in the log text above,starts with M ends with E and in between the D's provide information such as the correlation ID, connection ID etc.
So the above log file should be one row in a data table format with the D's as the variables.
[1]: https://i.stack.imgur.com/hw9zY.png
possible solution:
data <- readLines("data1.txt")
pattern <- "(M\\s+\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\d+:\\d+:\\d+.\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\w+\\s+)(\\+\\w+\\s+\\w+(\\=|\\s+)\\w+\\s+\\w+\\s+\\d+)"
m <- regexec(pattern,data)
matches <- regmatches(data, m)
parts <- do.call(rbind,lapply(regmatches(data, m), `[`,c(2L,3L,4L,5L,6L,7L,8L,9L)))
colnames(parts) <- c("ID1","ID2","Date","Time","ID3","ID4","ID5","description")
parts <- as.data.frame(parts)
parts1 <- na.omit(parts)