I wrote an exam that was done in R. During the exam I had to read a csv file, I did so like I usually do with the read.csv
function. More specifically I did
mydata <- read.csv("data.csv", sep=",", header = T, stringsAsFactors = T)
I looked at the data.csv
file and it had a header, it was separated with commas and had a mix of categorical variables represented with letters and numerical variables.
However, I failed to filter the data based on column values if the column values were factors; when I did something like mydata[mydata[["Gender"]] == "Man",]
I received an error which I can't remember word for word, but it said something along the lines of that I can't compare strings and factors. The error in itself made kind of sense, but I did not really know what to do since it had never happened to me before when working with categorical variables and filtering data frames the same way.
I got home and tried to reproduce the error by creating the following dummy data file, testdata.csv
:
Gender,Age
Man,45
Woman,33
Man,22
and executing the following code:
mydata <- read.csv("testdata.csv", sep=",", header = T, stringsAsFactors = T)
mydata[mydata[["Gender"]] == "Man",]
But now I got the output I expected at the exam: all the rows for which the (categorical) Gender
variable has the value Man
:
Gender Age
1 Man 45
3 Man 22
The one other difference that I can remember is that at the exam, when executing something like mydata[1,1]
the output would look like so:
[1] Man
Levels: Man Woman
but with some added information about it being a factor, if I remember correctly. In contrast, when I execute the command with my dummy data, I only get the output above. I can't remember that I have done anything differently but I feel like I must have. (Note I do not consider the added information to be an error, it is just an observation in trying to figure out what I might have done differently).
I would like to avoid ending up in the same situation again, but since I struggle with reproducing the error, it will be hard to correct whatever I did wrong. So my question is, does anybody have any idea about what can potentially have caused this "error" (that I was unable to filter the data frame based on column values if the column was based on a categorical variable)?