1

My text data is consistently separated by vertical lines ("|"), but the text between the vertical lines is rarely consistent and often includes characters that could be used as separators ("-", ",", and carriage returns). I would like there to only be 2 columns (report number and comment).

Goal:

ReportNumber Report
4312822 Comment: This person did a great job working with other
-Class standing was 15/265
-Final academic average/standing was 83.51% /209 out of 265
3059758 Comment, Part I: This is a dummy report.

What the data looks like:

4312822|Comment: This person did a great job working with others. -Class standing was 1/10
-Final academic average/standing was 83.51% /209 out of 265|

3059758|Comment, Part I: This is a dummy report.|

I've tried both read.delim and read.table:

Reports = read.delim('reports.txt', sep = "|", stringsAsFactors = FALSE, skipNul = TRUE, blank.lines.skip = TRUE)

The result, however, is jumbled and not split neatly by the "|"

ArlJerry
  • 11
  • 2

1 Answers1

1

One approach is to use the readr package's function, read_delim. readr is available on CRAN, so you can install it from within an R session:

install.packages("readr")
readr::read_delim("a;a|b,b|c.c|d:d
", delim = "|", col_names = FALSE)

To ensure that the above example works, be sure that you split the first character vector over two lines, ie, you need to include a carriage return.

Then, to actually use it with your file, you'd type, within an R session,

readr::read_delim("reports.txt", delim = "|")
Fred Boehm
  • 656
  • 4
  • 11
  • Unfortunately, that (and fread method that Wimpel suggested) didn't work with carriage returns, I will try to strip out to returns and try again with readr. – ArlJerry Jul 22 '21 at 12:42
  • How do you distinguish between a carriage return that denotes a new record and one that doesn't denote a new record? – Fred Boehm Jul 22 '21 at 16:04
  • 1
    I would like to ignore carriage returns completely when it comes to records and only separate by "|" (due to how different all of the records are with some having multiple lines, some with bullet points, and some a single sentences). When I read the data in, however, it looks like the carriage returns are interpreted as new/different records. – ArlJerry Jul 22 '21 at 16:18
  • How are carriage returns encoded in your file? \n? "\n"? "\\n"? – Fred Boehm Jul 23 '21 at 17:30
  • Also, which character is at the end of your lines? How do you know where the end of a line is? – Fred Boehm Jul 23 '21 at 17:41