1

I have been trying to process a fairly large file (2 million-plus observations) using the reader package. However, using the read_table2() function (for that matter, read_table()) also generates the following error:

Warning: 2441218 parsing failures. row col expected actual file 1 -- 56 columns 28 columns '//filsrc/Research/DataFile_1.txt'

With some additional research, I was able to calculate the maximum number of fields for each file:

max_fields <- max(count.fields("DataFile_1.txt", sep = "", quote = "\"'", skip = 0,
                                 blank.lines.skip = TRUE, comment.char = "#"))

and then set up the columns using the max_fields for the read_Table2() as follows:

file_one=read_table2("DataFile_1.txt", col_names = paste0('V', seq_len(max_fields)), col_types = NULL,
                      na = "NA",n_max=Inf, guess_max =  min(Inf, 3000000),progress = show_progress(), comment = "")

The resulting output shows Warning as I mentioned earlier.

My question is:

Have we compromised the data integrity? In other words, do we have the same data just spread out into more columns during parsing without an appropriate col_type() assigned to each column and that is the issue, or we actually lost some information during the process?

I have checked the dataset with another method read.table() and it seemed to have produced the same dimensions (rows and columns) as read_table2(). So what exactly does Parsing failures mean in this context?

tamtam
  • 3,541
  • 1
  • 7
  • 21
  • Have you tried to open the text file without R? Is the format correct (namely white spaces and looking like the table you want)? – Érico Patto Apr 06 '21 at 17:13
  • 1
    @ÉricoPatto thank you for the response. I have exported a small chunk of the unprocessed file into notepad and opened it, which seemed like the delimiter is the white space. However, since each row has a different length, I set it up to the maximum possible fields (as I mentioned in the original question). I opened the output file in excel and it seemed to be fine. But it is a file with 2.5 M+ observations, and I could not say for sure if the data transfer was without error for the entire file. Hence, I wanted to understand the meaning of `parsing failures` in this context. – user3674508 Apr 06 '21 at 17:48
  • Compare results & diagnostics with `data.table::fread()` and/or `vroom::vroom()` ? Best is if you can get the diagnostics to localize the point of failure, then use shell-tools to extract those lines and examine them, e.g. https://stackoverflow.com/questions/50364556/how-to-extract-specific-rows-based-on-row-number-from-a-file – Ben Bolker Apr 06 '21 at 22:14
  • @BenBolker, thank you for suggesting the two options. I have used `data.table::fread()`, but it breaks down after 1860 lines and spews the following error code: `Error in setnames(ans, col.names) : Can't assign 48 names to a 30 column data.table In addition: Warning messages: 1: In fread("DataFile_1.txt", fill = TRUE, nrows = Inf, stringsAsFactors = F, : Stopped early on line 1861. Expected 30 fields but found 31. Consider fill=TRUE and comment.char=. First discarded non-empty line:`. – user3674508 Apr 07 '21 at 20:14
  • So, I have tried another variant of the code without setting the `max_fields` count for assigning column names, and it generated the following error: `Warning messages: 1: In fread("DataFile_1.txt", fill = TRUE, nrows = Inf, stringsAsFactors = F, : Stopped early on line 1861. Expected 30 fields but found 31. Consider fill=TRUE and comment.char=. First discarded non-empty line: ` . Note the difference is here instead of 48, it had 30 fields. However, there are still 31 values for that row. – user3674508 Apr 07 '21 at 20:18
  • Furthermore, there was another warning but I think it would be relatively easy to manage: `2: In require_bit64_if_needed(ans) : Some columns are type 'integer64' but package bit64 is not installed. Those columns will print as strange looking floating point data. There is no need to reload the data. Simply install.packages('bit64') to obtain the integer64 print method and print the data again.` and I suspect it stems from the fact that I have not defined the `col_types()` , I am not too worried about this. – user3674508 Apr 07 '21 at 20:22
  • The main problem is there are multiple files each approximately of the same size. So going line-by-line to fix the issue may not be feasible. – user3674508 Apr 07 '21 at 20:24
  • My main point is that now that you know that line 1861 is the problem, you can examine line 1861 in the raw data (you could *try* to use `readLines(your_file)[1861]` to inspect it but R might choke; better to use some other tool (e.g. python or `sed`) that can more easily/efficiently access an arbitrary line in the file without processing the lines before it), see what the issue is, and see whether you're happy with the way that `fread` decided to parse it. – Ben Bolker Apr 08 '21 at 01:45

0 Answers0