3

I have text file of csv data running into 100s of thousands of what should be separate records but they forgot to put new lines into it. There is a repeated pattern to pick out where the start of a new line should be though, before a time, a comma, and a name, e.g. from below "07:04:08.401,Buzzard". But because the string goes on for 1000s of lines in the file I cannot use the start ^ or end $ to anchor the string.

My plan has been to regex from the start of each of these points backwards until the next comma so that I can then str_replace() itself back in but with "\n" on the end, thereby inserting the new lines where I want them.

I need help with both parts.

library(stringr)
library(data.table)

Data_raw <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.0007:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.3107:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")

look_x <- function(rx) str_view_all(Data_raw, rx) 
look_x("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

Gets me the four preceding characters. But the characters before the time back to the next comma are variable. e.g. above they range from "0.00" to "-401.31" and "Obj 2 N.A.". So the comma it is. So I've been trying along the lines of:

look_xy("(?<=,).(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

..And failing to get every character preceded by "," and followed by whatever hh:mm:ss.sss,Buzz comes next.

I also need help with the next bit would go on to do, I have tried:

Data_st_rep_all_2 <- data.frame(str_replace_all("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)",
                                              paste0(str_extract(Data_raw, "[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)"),"\n"), Data_raw))

Though I am now wondering whether this will work because all the regex pieces are different.

I am stuck. Can anyone help?!

There will no doubt be a very neat solution that I have totally missed!

Thank you.

The end result should look like this:

Data_1 <- data.frame(Records = c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00",
                                 "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31",
                                 "07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.",
                                 "07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."))
mysplits <- max(lengths(strsplit(Data_1$Records, ",")))
Data_2 <- setDT(Data_1)[, paste0("column", 1:mysplits) := tstrsplit(Records, ",", fixed=T)]
Data_2[, Records := NULL]

Or rather:

Data_raw_2 <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00\n07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31\n07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.\n07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")
wd <- getwd()
write_lines(Data_raw_2, paste0(wd, '/', 'Data_raw_2.txt'))
Skeiff
  • 31
  • 4
  • Just an idea without having fully understood your task: [`str_split(Data_raw, "(?=\\d\\d:\\d\\d:\\d\\d\\.\\d{3},Buzz)(?!^)")`](https://tio.run/##rZJdS8MwFIbv@ytir1JIDyffbVDGiiCCKMzb4ugcSHVYySrqxN9ek@oQceBAIbw5F0me54Tjh2HVLnzjX@i69@39jc@SJDlu@mbumydymJNrmqJ1qBwWoJCz6nGzafySVL7r7hgXUmkDtvisLBQl46UBoVmOoHhMgwwBLTubzi6ZFKZABMmZsaq0pQBjWHV6QqruOR7Dr/hX7s4bP7l5YAW5iBZOa5Da7oPWW6YSMYPvGL@TLxa3RJBzmMLuMopIx8OHGbWniLBRQeAoIj/a@rtIGgYjjMh8/bBqe7qdEEZSOjmq62VY7ttW1xDyVb6N1hmdHFxl8Y1heAc) – bobble bubble Dec 11 '22 at 13:56

1 Answers1

1

Is this what you need?

library(stringr)
str_split(Data_raw, "(?<!^)(?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook)")
[[1]]
[1] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00"                 
[2] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31"              
[3] "07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."
[4] "07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."

How this works:

  • (?<!^): negative look-behind to assert that we do not want to split at string start
  • (?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook): positive look-behind to assert that the point at which we split must be followed by a timestamp-like expression, a comma, and the string "Buzzard Brook"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34