0

I wrote an exam that was done in R. During the exam I had to read a csv file, I did so like I usually do with the read.csv function. More specifically I did

mydata <- read.csv("data.csv", sep=",", header = T, stringsAsFactors = T)

I looked at the data.csv file and it had a header, it was separated with commas and had a mix of categorical variables represented with letters and numerical variables.

However, I failed to filter the data based on column values if the column values were factors; when I did something like mydata[mydata[["Gender"]] == "Man",] I received an error which I can't remember word for word, but it said something along the lines of that I can't compare strings and factors. The error in itself made kind of sense, but I did not really know what to do since it had never happened to me before when working with categorical variables and filtering data frames the same way.


I got home and tried to reproduce the error by creating the following dummy data file, testdata.csv:

Gender,Age
Man,45
Woman,33
Man,22

and executing the following code:

mydata <- read.csv("testdata.csv", sep=",", header = T, stringsAsFactors = T)
mydata[mydata[["Gender"]] == "Man",]

But now I got the output I expected at the exam: all the rows for which the (categorical) Gender variable has the value Man:

  Gender Age
1    Man  45
3    Man  22

The one other difference that I can remember is that at the exam, when executing something like mydata[1,1] the output would look like so:

[1] Man
Levels: Man Woman

but with some added information about it being a factor, if I remember correctly. In contrast, when I execute the command with my dummy data, I only get the output above. I can't remember that I have done anything differently but I feel like I must have. (Note I do not consider the added information to be an error, it is just an observation in trying to figure out what I might have done differently).

I would like to avoid ending up in the same situation again, but since I struggle with reproducing the error, it will be hard to correct whatever I did wrong. So my question is, does anybody have any idea about what can potentially have caused this "error" (that I was unable to filter the data frame based on column values if the column was based on a categorical variable)?

  • The second part is because you have a `factor` column and thus it shows all the `levels`. You can add `droplevels` to drop the unused levels. `subdat <- droplevels(mydata[mydata[["Gender"]] == "Man",])` Or you may convert the column to `character` - `stringsAsFactors = TRUE` will keep string columns as `factor`. You may use `= FALSE` and then it will be `character` class – akrun Jan 14 '22 at 19:20
  • @akrun Ok, good to know. But I do not consider the display of the factor levels to be an error, it was just an observation. And yes `stringsAsFactors = TRUE` will keep string columns as factors, but I've done the same thing in my dummy example and I can still filter factor columns based on string values, so that should not be the problem. – DancingIceCream Jan 14 '22 at 19:24
  • You said there is some error. what is the error you received. When you read as factor and if there are leading/lagging spaces in the column, and you use `==`, it may not match. You may need to check `levels(mydata$Gender)` – akrun Jan 14 '22 at 19:25
  • @akrun Unfortunately I can't remember the error verbatim. As I wrote it said something about that I can't compare factors to strings, when I tried to execute something like `mydata[mydata[["Gender"]] == "Man",]`. – DancingIceCream Jan 14 '22 at 19:26
  • ok, then it is not very helpful to debug the issue. My understanding is that it is not matching the values because of leading/lagging spaces. – akrun Jan 14 '22 at 19:27
  • Try with `subdat <- droplevels(mydata[as.character(mydata[["Gender"]]) == "Man",])` – akrun Jan 14 '22 at 19:28
  • @akrun Thanks for your suggestions! Just to be clear, one problem I have is that I cannot reproduce the error. – DancingIceCream Jan 14 '22 at 19:30
  • then probably your original error may be a result of some env issues, which got resolved when you tried on a new session – akrun Jan 14 '22 at 19:31

0 Answers0