1

I need to parse a txt file like this:

2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s

I am interest in parsing the date and the number in a R data frame. I am trying a parser like this (only for the date):

df <- read_table("myFile.txt", col_names = FALSE, col_types = cols(X1 = col_datetime(format = "%Y %b %d %H:%M:%S")))

But it doesn't work:

Warning: 31502 parsing failures.
row col                    expected actual                                                file
  1  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  2  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  3  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  4  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  5  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
... ... ........................... ...... ...................................................
See problems(...) for more details.

The problem is clearly that it's trying to parse the first column with the recipe of the whole date time.

Which is the correct way to parse this txt file in a data frame?

Regards, S.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Stefano Bossi
  • 1,138
  • 1
  • 9
  • 19

2 Answers2

2

1) read.zoo Read it into a zoo object, z, and then convert that to a data frame (or just leave it as a zoo object). This makes use of the fact that junk at the end of the index column will be ignored when converting to POSIXct.

We have used Lines in the Note at the end for reproducibility but text = Lines can be replaced with "myFile.txt".

library(zoo)

z <- read.zoo(text = Lines, sep = "=", 
  format = "%Y %b %d %H:%M:%S", tz = "", comment.char = "s")
fortify.zoo(z)

giving this data frame having POSIXct and numeric columns:

                Index     z
1 2021-09-27 15:54:50 0.321
2 2021-09-27 15:54:52 0.036
3 2021-09-27 15:54:54 0.350
4 2021-09-27 15:54:56 0.317

2) Base R Read it into a data frame dd and then convert the first column to POSIXct.

dd <- read.table(text = Lines, sep = "=", comment.char = "s")
dd$V1 <- as.POSIXct(dd$V1, format = "%Y %b %d %H:%M:%S")

Note

Lines <- "2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

This should get you started: Read the text file and replace the spaces (or whatever string separates the columns) with a comma (or semicolon etc). Then pass this to read.csv using the text= argument. Then use any of the many date parsers to convert the strings to date datatypes.

1.Creating example data

txt <- "2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s"

2.Read data using read_lines. In your case txt is the path to the text file

read.csv(text=gsub("     ",  ", ", read_lines(txt)), sep=",", header = FALSE)

Returns:

                    V1       V2 V3        V4
1 2021 Sep 27 15:54:50  avg_dur  =   0.321 s
2 2021 Sep 27 15:54:52  avg_dur  =   0.036 s
3 2021 Sep 27 15:54:54  avg_dur  =   0.350 s
4 2021 Sep 27 15:54:56  avg_dur  =   0.317 s
dario
  • 6,415
  • 2
  • 12
  • 26
  • This would be a good workaround, but I guess they want to use readr::read_table with parsing while reading. – zx8754 Sep 28 '21 at 12:12
  • Why to replace white spaces with comma before reading the table? Anyway, this doesn't parse the date anyway. – Stefano Bossi Sep 28 '21 at 12:20
  • Because that way we can use `read.csv`. Date parsing can then be done as usual... (Answering to the question *Which is the correct way to parse this txt file in a data frame?*) – dario Sep 28 '21 at 12:21
  • ok, I got the point.... but there's not a way to do the parsing directly via readr in one pass? – Stefano Bossi Sep 28 '21 at 12:39
  • Unfortunately I highly doubt it. If we check `?readr::read_table` we find that *read_table() and read_table2() are designed to read the type of textual data where each column is separated by one (or more) columns of space.* -> the text in your example uses spaces for column AND for word separation... – dario Sep 28 '21 at 12:45