Incorrect data output with pandas

Question

I have a csv file that looks as follows:

start_date,end_date,pollster,sponsor,sample_size,population,party,subject,tracking,text,approve,disapprove,url
    2020-02-02,2020-02-04,YouGov,Economist,1500,a,all,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,42,29,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,376,a,R,Trump,FALSE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,75,6,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,523,a,D,Trump,TRUE,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,21,51,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-02,2020-02-04,YouGov,Economist,599,a,I,Trump,,Do you approve or disapprove of Donald Trump’s handling of the coronavirus outbreak?,39,25,https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/73jqd6u5mv/econTabReport.pdf
    2020-02-07,2020-02-09,Morning Consult,"",2200,a,all,Trump,TURE,Do you approve or disapprove of the job each of the following is doing in handling the spread of coronavirus in the United States? President Donald Trump,57,22,https://morningconsult.com/wp-content/uploads/2020/02/200214_crosstabs_CORONAVIRUS_Adults_v4_JB.pdf

I am interested in the column "tracking" that has values "TURE", "FALSE" or NAN

For some reason, when I read it with pandas, the all of the "tracking" column values are loaded as "False":

data = pd.read_csv("covid_approval_polls.csv")
data.head() 

start_date  end_date    pollster    sponsor     sample_size     population  party   subject     tracking    text    approve     disapprove  url
0   2020-02-02  2020-02-04  YouGov  Economist   1500.0  a   all     Trump   False   Do you approve or disapprove of Donald Trump’s...   42.0    29.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...
1   2020-02-02  2020-02-04  YouGov  Economist   376.0   a   R   Trump   False   Do you approve or disapprove of Donald Trump’s...   75.0    6.0     https://d25d2506sfb94s.cloudfront.net/cumulus_...
2   2020-02-02  2020-02-04  YouGov  Economist   523.0   a   D   Trump   False   Do you approve or disapprove of Donald Trump’s...   21.0    51.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...
3   2020-02-02  2020-02-04  YouGov  Economist   599.0   a   I   Trump   False   Do you approve or disapprove of Donald Trump’s...   39.0    25.0    https://d25d2506sfb94s.cloudfront.net/cumulus_...
4   2020-02-07  2020-02-09  Morning Consult     NaN     2200.0  a   all     Trump   False   Do you approve or disapprove of the job each o...   57.0    22.0    https://morningconsult.com/wp-content/uploads/.

..

When I search for the unique values of that column with the command:

data.tracking.unique()

I get the correct output:

array([False, True, nan], dtype=object)

But when I execute the command:

print(data[data["tracking"] == "FALSE"])

I get:

Empty DataFrame
Columns: [start_date, end_date, pollster, sponsor, sample_size, population, party, subject, tracking, text, approve, disapprove, url]
Index: []

I am quite sure I am missing something here, but have no idea what might be causing the problem? I would like to get the rows based on the column "tracking" value "FALSE"

@HenryEcker no, actually the csv column "tracking" originally contains capital values "TRUE" and "FALSE" but for some reason when uploading it with pandas, the whole column gets uploaded with the value "False" with no track of the "TURE" values ( nor "True"). — Olaola, Jun 01 '21 at 19:11
You're looking to prevent `read_csv`s behaviour of converting boolean-like string values to `bool`? — Henry Ecker, Jun 01 '21 at 19:13
I only want to get the rows where the value of column tracking is "FALSE", I guess there is some kind of problem with converting boolean-like string values to bool, so I think ... yes I would like to prevent this type of converting if possible. Do you know of any way to do that? I have tried in PyCharm and Jupiter and the output is the same. — Olaola, Jun 01 '21 at 19:14
I couldn't duplicate your issues, FALSE is a string and `df['tracking'] == 'FALSE'` return two True values. — Scott Boston, Jun 01 '21 at 19:37

score 2 · Answer 1 · answered Jun 01 '21 at 19:31

2

To force type, use dtype parameter:

data = pd.read_csv("covid_approval_polls.csv", dtype={"tracking": str})

answered Jun 01 '21 at 19:31

Corralien

109,409
8
28
52

Incorrect data output with pandas

1 Answers1