1

I am trying to import a large .txt file that uses |,| to separate columns. The raw data looks like this: enter image description here

The original .txt file has 593 118 lines (entries). However, using my import line I can only import 191 838 lines, and a lot of these lines are imported incorrectly. The imported file looks like this (e.g., lines 189880:189889 are imported correctly, the others are not):

enter image description here

The number of columns is correct using this code, it only fails to import all rows correctly. In addition, when using my import code, the following warning message pops up:

Test<-read.csv("test2.txt", header = FALSE, sep = ",", quote = "|")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

I can import 544 605 lines using the following import command (switching the quote:"|" and sep=","):

Test<-read.csv("test2.txt", header = FALSE, sep = “|", quote = “,”)

Only now, the whole file looks messy, and the data are in the wrong columns and a wrong number of columns are created (41 instead of 39): enter image description here

Does someone know how to import this .txt file correctly?

Aurèle
  • 12,545
  • 1
  • 31
  • 49
M1ke
  • 67
  • 8
  • Could you include lines 189889 and 189890 from the text file, as text (not screenshot), in your question? (redacting confidential information if any, of course) – Aurèle Feb 22 '22 at 10:27
  • @Aurèle thank you for your comment. I have uploaded a small sample .txt file with an error upload. It only contains public info. – M1ke Feb 22 '22 at 10:42
  • 2
    The file contains this field on line 403 `|Cardlytics, Inc.Strike price: $1.11 | NeitherExpires: 01/25/2021|` with an enclosed `|` that is not meant as a quote. That's what breaks things I guess – Aurèle Feb 22 '22 at 11:00
  • Hi @Mike! If the separator is |,|, why don't you replace this separator with a more usual one, like: , (comma), ;, \t, among others? – jassis Feb 22 '22 at 11:12
  • 1
    Maybe you could adapt this answer to your problem https://stackoverflow.com/a/41613610/6197649 (similar issue with separator instead of quotes, could also be solved with regex I think). Or fix the problem at the source, escaping the enclosed `|` like `\|` – Aurèle Feb 22 '22 at 11:13
  • @Aurèle Excellent find, I think that that is the problem. Regarding your helpful regex post, which regex statement should I use for my file? Altering the [^,] and using ^([^|]*),([^|]*) (for 39 columns) results in a Large matrix in my R environment with only NA’s. – M1ke Feb 22 '22 at 11:57

1 Answers1

1

data.table automatically resolves the issue, while displaying an informative message:

library(data.table)

DT <- fread(
  "data/test3.txt",
  header = FALSE, 
  quote = "|"
)
Warning message:
In fread("data/test3.txt", header = FALSE, quote = "|") :
  Found and resolved improper quoting out-of-sample. First healed line 403: <<|B160001953|,|S|,|N00035516|,|16|,|Y|,| |,| |,|Cardlytics, Inc.Strike price: $1.11 | NeitherExpires: 01/25/2021|,|Cardlytics Inc|,||,||,||,||,||,||,||,||,||,|I |,,| |,| |,| |,| |,| |,| |,| |,|p |,| |,| |,| |,.0000,| |,| |,| |,,||,||,| |>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
DT[403]
           V1 V2        V3 V4 V5 V6 V7
1: B160001953  S N00035516 16  Y      
                                                                 V8             V9
1: Cardlytics, Inc.Strike price: $1.11 | NeitherExpires: 01/25/2021 Cardlytics Inc
   V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29
1:                                      I   NA                              p     
   V30 V31 V32 V33 V34 V35 V36 V37 V38 V39
1:           0 
Aurèle
  • 12,545
  • 1
  • 31
  • 49