0

I have big data.log file, a few line examples below. I want to convert it to data frame for EDA.

{"date":"2018-03-29T12:49:25.308+0000","level":"INFO","message":"User 
authenticated","action":"user_authenticated","username":"test@test.com"}
{"date":"2018-03-29T12:49:35.518+0000","level":"INFO","message":"User changed 
password with recovery (Web)","action":"recovery_password_changed","requestSource":"WEB","username":"test123@test.com"}

I tried json load from jsonlite library but I got error, parse error: trailing garbage. I checked out wd, everything is OK.

mydata <- fromJSON("data.log")

Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage ,"username":"test@test.com"} {"date":"2018-03-29T12:49:35.51 (right here) ------^

nick
  • 1,090
  • 1
  • 11
  • 24
eagle23
  • 35
  • 6

2 Answers2

0

You don't have a valid json here. You will need to preprocess it to something like this

x <- '[{"date":"2018-03-29T12:49:25.308+0000","level":"INFO","message":"User authenticated","action":"user_authenticated","username":"test@test.com"},
{"date":"2018-03-29T12:49:35.518+0000","level":"INFO","message":"User changed password with recovery (Web)","action":"recovery_password_changed","requestSource":"WEB","username":"test123@test.com"}
]'

library(jsonlite)

fromJSON(x)

                          date level                                   message
1 2018-03-29T12:49:25.308+0000  INFO                        User authenticated
2 2018-03-29T12:49:35.518+0000  INFO User changed password with recovery (Web)
                     action         username requestSource
1        user_authenticated    test@test.com          <NA>
2 recovery_password_changed test123@test.com           WEB

Or into one entry per row.

> y <- '{"date":"2018-03-29T12:49:25.308+0000","level":"INFO","message":"User authenticated","action":"user_authenticated","username":"test@test.com"}'
> fromJSON(y)
$`date`
[1] "2018-03-29T12:49:25.308+0000"

$level
[1] "INFO"

$message
[1] "User authenticated"

$action
[1] "user_authenticated"

$username
[1] "test@test.com"

If you have the log file with {...} entry in each line, you could traverse through each line and convert it to json. mylog.txt contains two entries.

xy <- readLines("mylog.txt")
sapply(xy, fromJSON, USE.NAMES = FALSE)

[[1]]
[[1]]$`date`
[1] "2018-03-29T12:49:25.308+0000"

[[1]]$level
[1] "INFO"

[[1]]$message
[1] "User authenticated"

[[1]]$action
[1] "user_authenticated"

[[1]]$username
[1] "test@test.com"


[[2]]
[[2]]$`date`
[1] "2018-03-29T12:49:35.518+0000"

[[2]]$level
[1] "INFO"

[[2]]$message
[1] "User changed password with recovery (Web)"

[[2]]$action
[1] "recovery_password_changed"

[[2]]$requestSource
[1] "WEB"

[[2]]$username
[1] "test123@test.com"

Or you could directly coerce it to a data.frame.

sapply(xy, FUN = function(x) {
  out <- fromJSON(x)
  as.data.frame(out)
}, USE.NAMES = FALSE)

[[1]]
                          date level            message             action
1 2018-03-29T12:49:25.308+0000  INFO User authenticated user_authenticated
       username
1 test@test.com

[[2]]
                          date level                                   message
1 2018-03-29T12:49:35.518+0000  INFO User changed password with recovery (Web)
                     action requestSource         username
1 recovery_password_changed           WEB test123@test.com
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • if I have more that 900,000 rows in .log file with { } each row, is jsonlite the easiest way or there is alternative? – eagle23 Aug 19 '18 at 19:37
  • OK. I tried with sapply, it created a list and try to convert it to data frame with myDf <- as.data.frame(df) but I got 1 observations with 25695 variables. How can i solve this please? – eagle23 Aug 30 '18 at 10:23
  • Please either edit your original question or post it as a separate question. I would need to see the full code you're using, incl. data. – Roman Luštrik Aug 30 '18 at 12:30
0

You can use ndjson::stream_in() or jsonlite::stream_in(). What you have is newline-delimited JSON. It's super common these days.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205