I have been trying to process a fairly large file (2 million-plus observations) using the reader
package. However, using the read_table2()
function (for that matter, read_table()
) also generates the following error:
Warning: 2441218 parsing failures. row col expected actual file 1 -- 56 columns 28 columns '//filsrc/Research/DataFile_1.txt'
With some additional research, I was able to calculate the maximum number of fields for each file:
max_fields <- max(count.fields("DataFile_1.txt", sep = "", quote = "\"'", skip = 0,
blank.lines.skip = TRUE, comment.char = "#"))
and then set up the columns using the max_fields
for the read_Table2()
as follows:
file_one=read_table2("DataFile_1.txt", col_names = paste0('V', seq_len(max_fields)), col_types = NULL,
na = "NA",n_max=Inf, guess_max = min(Inf, 3000000),progress = show_progress(), comment = "")
The resulting output shows Warning
as I mentioned earlier.
My question is:
Have we compromised the data integrity? In other words, do we have the same data just spread out into more columns during parsing without an appropriate col_type()
assigned to each column and that is the issue, or we actually lost some information during the process?
I have checked the dataset with another method read.table()
and it seemed to have produced the same dimensions (rows and columns) as read_table2()
. So what exactly does Parsing failures
mean in this context?