Pandas: parse CSV differentiating between the missing value sentinel and a string that happens to equal it

Question

The R readr read_csv function allows passing quoted_na=FALSE, which allows the parser to distinguish between the unquoted NA meaning a missing value, and a quoted "NA" meaning a string that happens to equal NA.

Is there a way for the read_csv function from Python Pandas to have similar behaviour?

For example, the CSV

var_1,var_2
"NA",NA

I would like to ideally be parsed as

    var_1  var_2
0      NA    NaN

The argument na_values does not appear to be relevant: it appears to be applied to the strings after CSV parsing that removes the quotes.

For example, something like the below seems to still not be able to distinguish the two cases:

pd.read_csv('na.txt', na_values=['NA'], keep_default_na=False)

as in, both values are parsed as NaN.

`na_values`, `keep_default_na` and `na_filter` tweaking these parameters should help. Should `NA` and `"NA"` be detected as `NA`? — Ch3steR, Jun 06 '20 at 13:20
@Ch3steR `Should NA and "NA" be detected as NA?` Only `NA` should be detected as `NA` in my case, and `"NA"` should be a string. — Michal Charemza, Jun 06 '20 at 13:23
As @Ch3steR mentioned, you can set it the arguments for na values. This [na values](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-values) link in the docs could help — sammywemmy, Jun 06 '20 at 13:26
@sammywemmy I've investigated that, and I don't think it helps. Have added some details to the question. — Michal Charemza, Jun 06 '20 at 13:31

Ch3steR · Answer 1 · 2020-06-06T14:00:51.050

4

You can try this. Using pd.read_csv with parameter quoting=3.

#test.csv
A,B,C,D
NA,1,2,3
"NA",NA,4,5
6,7,NA,"NA"

 df = pd.read_csv(r'test.csv',quoting=3,na_values=['NA'],keep_default_na=False)

     A    B    C     D
0   NaN  1.0  2.0     3
1  "NA"  NaN  4.0     5
2     6  7.0  NaN  "NA"

Note :

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

edited Jun 06 '20 at 14:00

answered Jun 06 '20 at 13:55

Ch3steR

20,090
4
28
58

i was racking my brain with the quoting=3 angle. u figured it faster. +1 – sammywemmy Jun 06 '20 at 14:04
@sammywemmy `"` this single quote made us go mad. – Ch3steR Jun 06 '20 at 14:08
This is helpful, and indeed it does distinguish the cases... but the `quoting=3`: it preserves the quotes around strings? I think in general (at least for my cases) I would like them as type `str`, but without the quotes. – Michal Charemza Jun 06 '20 at 14:08
`df.replace('"NA"','NA')` this can be added but couldn't find anything on how to do it with `read_csv`.Consider adding bounty to attract more answerers. @MichalCharemza – Ch3steR Jun 06 '20 at 14:41

Pandas: parse CSV differentiating between the missing value sentinel and a string that happens to equal it

1 Answers1