0

I have used the readr package to import a .csv (let's call it x) which produced a tibble.

EDIT: As there was confusion between the actual tibble generated by readr and the problems(x)-tibble posted below, here is the beginning of the actual tibble that generates the problem

> x
# A tibble: 46,080 x 18
      x_1   x_2   x_3   x_4   x_5   x_6    x_7     x_8    x_9     x_10      x_11        x_12        x_13      x_14  x_15
    <int> <int> <int> <int> <int> <dbl>  <dbl>   <dbl>  <dbl>    <dbl>     <dbl>       <dbl>       <dbl>     <dbl> <int>
 1     1     1     1     1    29  84.4   72.5  10.1     48.5     35.3      34.2        293.        117.      24.5    20
 2     1     1     1     2   120 214.   142.   -0.488   55.8     42.1      36.3        589.        124.     257.     84
 3     1     1     1     3    28 258.    42.3   2.09    43.7     29.2      32.1        352.        117.      72.2    19
 4     1     1     1     4    39 623.   249.   12.1     95.7     75.7      58.6        998.        176.     243.     14
 5     1     1     1     5   222 320.   244.   -2.10    70.7     51.4      48.4       1232.        242.     711.    111
 6     1     1     1     6    33 485.   142.   12.3     61.8     51.9      34.6        764.        117.     160.     24
 7     1     1     1     7    32 884.   458.   11.0    110.      88.1      64.5       1525.        237.     283.      5
 8     1     1     1     8    58 695.   187.  -12.7     64.6     50.5      41.7       1090.        175.     403.     37
 9     1     1     2     1    46  58.0   65.3   5.10    49.4     35.2      34.7        234.        117.      26.7    18
10     1     1     2     2   136 217.   191.   -0.431   60.5     43.2      42.2        706.        185.     295.     72
# ... with 46,070 more rows, and 3 more variables: x_16 <dbl>, x_17 <dbl>, x_18 <dbl>

I tried various combinations of the na = attribute of read_csv to avoid wrongly read data, however, I did not make it work for my case: While using the readr package I got a message concerning problems in some of the columns, so I used >problems(x) to find out what's going on. This is the output:

> problems(x)
# A tibble: 264 x 5
     row col   expected   actual file                              
   <int> <chr> <chr>      <chr>  <chr>                             
 1  1992 x_5  an integer NaN    'raw-data/x.csv'
 2  1992 x_15 an integer NaN    'raw-data/x.csv'
 3  2320 x_5  an integer NaN    'raw-data/x.csv'
 4  2320 x_15 an integer NaN    'raw-data/x.csv'
 5  2581 x_5  an integer NaN    'raw-data/x.csv'
 6  2581 x_15 an integer NaN    'raw-data/x.csv'
 7  2582 x_5  an integer NaN    'raw-data/x.csv'
 8  2582 x_15 an integer NaN    'raw-data/x.csv'
 9  2583 x_5  an integer NaN    'raw-data/x.csv'
10  2583 x_15 an integer NaN    'raw-data/x.csv'
# ... with 254 more rows

I do understand that apparently in several columns and several rows the .csv reading failed which lead to NaN's in fields where an integer was expected.

I tried to convert those NaN's to "real" NA's by using the is.nan method but this fails as the method does not seem to support whole tibbles.

> x[is.nan(x)] <- NA #convert NaN to NA
    Error in is.nan(x): default method not implemented for type 'list'

I also tried to used the replace_with_na_all method from the naniar package however this also fails

> replace_with_na_all(data = x, condition = ~.x == NaN)
    Error in .x[sel] <- map(.x[sel], .f, ...) : NAs are not allowed in subscripted assignments

Therefore I'm looking for a way to convert all NaN's in all columns and all rows with NA's in one go or avoid creating NaN's all together during the read_csv.

Dom42
  • 147
  • 11
  • 2
    `x$actual[is.nan(x$actual)] <- NA` – r2evans Jul 01 '18 at 12:41
  • `x[] <- lapply(x, function(a) ifelse(is.nan(a), NA_real_, a))` – r2evans Jul 01 '18 at 12:43
  • @r2evans Could you please elaborate on what the code does and if I need both lines posted here. Or could you post it as an answer? – Dom42 Jul 01 '18 at 13:00
  • Your issue deals with data types—your `x` column is a character, so `is.nan("NaN")` returns false, because it's just reading a string, not an actual `NaN` value. It would be more helpful if you post the output of `dput`, because that will make this issue more obvious to folks helping you – camille Jul 01 '18 at 13:32
  • @camille My `readr` output (the one you get automatically) actually specified that the columns in question (`x_1` and `x_2`) are actually recognized as col_integer(), therefore I'm not sure if the "reading a string" applies here. Concerning `dput`: Do you mean I should post the output of dput(x)? This returns an output that is way to large to post here. – Dom42 Jul 01 '18 at 13:45
  • Oh, and @r2evans, I just tried to understand your code. Could it be that you mistake the `problems(x)`-output-tibble with the actual tibble generated by readr? As posted above, the actual tibble is 46,080 x 18 – Dom42 Jul 01 '18 at 13:53
  • @Dom42 I didn't realize that what you posted isn't your actual data, but instead the `problems` output. You can call `dput` on a representative sample of your data and post the output – camille Jul 01 '18 at 14:03
  • @camille I'm sorry about any confusion I produced. I thought my explanation was clear enough that the tibble shown was the one that was produced by `problems(x)` which returns a tibble with information about problems encountered by readr. The tibble that produced this problem is `x`. I now included some more information about this actual tibble in the question. Thanks for your help so far! – Dom42 Jul 01 '18 at 14:08
  • The data you posted doesn't have any `NaN`s, so we still can't reproduce the issue. `problems` gave you a set of rows that were read incorrectly; can you include some of them in your sample, and post the `dput`? – camille Jul 01 '18 at 14:11
  • @camille Actually, I think I solved my problem. Your idea with reproducing the problem pointed me to the right direction. Apparently, my .csv actually contained the string "NaN" whenever it wanted to point out values which should be treated as NaN. I always thought, it was recognized as NaN by R. – Dom42 Jul 01 '18 at 14:32
  • Indeed, I mistook `problems(x)` for the real data (which wasn't up yet when I commented). – r2evans Jul 01 '18 at 23:24

1 Answers1

0

Although this is only a partial answer to my own question (it does not tell you how to convert NaN's to NA's), I wanted to point to a possible solution in case the problem was caused by the same root.

The .csv I wanted to import with readr was written out by Matlab and contained the string NaN in the cells where the value was NaN in Matlab. Therefore, it was not R that had problems recognizing a number but rather the problem that NaN was contained as a string.

Using the na = "NaN" attribute in read_csv apparently solved the problem.

Dom42
  • 147
  • 11