7

data.table 1.9.2

I'm reading in a large table and there appears to be at least one row which produces an error of the following nature:

Error in fread(paste(base_dir, filename, sep = "")) : 
Expected sep ('|') but '' ends field 23 on line 190333 when reading data:...

Is it possible to direct fread in data.table package to skip erroneous rows?

Or any other way I can work around this sort of error in the future?

bibzzzz
  • 193
  • 1
  • 10
  • Not that I can tell. But even still - if it does than it's an erroneous line that I would want to skip. I realize this could be very difficult due to essentially required fread to know how many separators were in the erroneous line. Perhaps it is impossible in which case the answer to my question is no, but I just thought I'd ask in case. – bibzzzz May 13 '14 at 02:55
  • Can you post the output of `readLines` for, say, 3 lines before and 3 lines after line 190333? Are you specifying a `sep` in `fread`? – A5C1D2H2I1M1N2O1R2T1 May 13 '14 at 03:14
  • `count.fields()` can be your friend here. Along with the `skip` and `nrows` arguments in `fread()` – Rich Scriven May 01 '15 at 18:11
  • @bibzzzz has this question been answered or have you found an alternative solution. if so please accept whichever answer or post+accept your own or provide details as to what's missing. – npjc May 12 '15 at 09:24

2 Answers2

9

One workaround if you wish to skip erroneous rows:

First read in the file only separating according to new rows by using sep="\n" then count the number of separators for each row and filter for the correct # of separators then collapse the data and separate according to the true column separator. see example below.

example data:

require(data.table)

wrong <- fread("
var1|var2|var3|var4
a|1|10|TRUE
b|2|10|FALSE
c|3|10FALSE      # note the missing separator between 10 and FALSE.
d|4|10|TRUE
e|5|10|TRUE",sep="\n")

count number of strings:

The are a number of ways to do this, see stringr's ?str_count for one:

wrong[,n_seps := str_count(wrong[[1]],fixed("|"))] # see below for explanation.

Or with some simplifying assumptions via an rcpp analogue:

If the separator is a single character (which it usually is) then I have found the simple function below to be most efficient. It is written is c++ and exported to R via the Rcpp package's sourceCpp() workhorse.

in a seperate "helpers.cpp" file

    #include <Rcpp.h>
    #include <algorithm>
    #include <string>

    using namespace Rcpp;
    using namespace std;

    // [[Rcpp::export]]

    NumericVector v_str_count_cpp(CharacterVector x, char y) {
        int n = x.size();
        NumericVector out(n);

        for(int i = 0; i < n; ++i) {
            out[i] = std::count(x[i].begin(), x[i].end(), y);
        }
        return out;
    }

New column with counts:

We then apply the function to count the number of occurences of | for each row and return the results in a new column called n_seps.

wrong[,n_seps := apply(wrong,1,v_str_count_cpp,"|")]

Now wrong looks like:

> wrong
var1|var2|var3|var4 n_seps
1:         a|1|10|TRUE      3
2:        b|2|10|FALSE      3
3:         c|3|10FALSE      2
4:         d|4|10|TRUE      3
5:         e|5|10|TRUE      3

now filter for the nice rows and collapse it back:

collapsed <- paste0( wrong[n_seps == 3][[1]], collapse = "\n" )

and lastly read it back with the proper separator:

correct <- fread(collapsed,sep="|")

which looks like:

> correct
V1 V2 V3    V4
1:  a  1 10  TRUE
2:  b  2 10 FALSE
3:  d  4 10  TRUE
4:  e  5 10  TRUE

Hope this helps.

npjc
  • 4,134
  • 1
  • 22
  • 34
  • Nice answer. I had to `writeLines` rather than pasting and collapsing due to memory limitations. – jbaums Mar 16 '15 at 23:14
  • This works for what I'm trying to do, thanks very much. Thanks for taking the time to explain the inner workings as well – bibzzzz May 18 '15 at 10:00
1

No. There is no option to make fread to do that.

There is discussion on GitHub about it, but it does not say what option should be used to make fread skip those lines (here: https://github.com/Rdatatable/data.table/issues/810)

userJT
  • 11,486
  • 20
  • 77
  • 88