7

I have a data.frame similar to below. I pre-process it by deleting rows that I am not interested in. Most of my columns are 'factors', whose 'levels' are not updated as I filter the data.frame.

I can see that what I am doing below is not ideal. How do I get the factor levels update as I modify the data.frame? Below is a demonstration of what is going wrong.

# generate data
set.seed(2013)
df <- data.frame(site = sample(c("A","B","C"), 50, replace = TRUE),
                 currency = sample(c("USD", "EUR", "GBP", "CNY", "CHF"),50, replace=TRUE, prob=c(10,6,5,6,0.5)),
                 value = ceiling(rnorm(50)*10))

# check counts to see there is one entry where currency =  CHF
count(df, vars="currency")

>currency freq
>1      CHF    1
>2      CNY   13
>3      EUR   16
>4      GBP    6
>5      USD   14


# filter out all entires where site = A, i.e. take subset of df
df <- df[!(df$site=="A"),]

# check counts again to see how this affected the currency frequencies
count(df, vars="currency")

>currency freq
>1      CNY   10
>2      EUR    8
>3      GBP    4
>4      USD   10

# But, the filtered data.frame's levels have not been updated:
levels(df$currency)

>[1] "CHF" "CNY" "EUR" "GBP" "USD"

levels(df$site)

>[1] "A" "B" "C"

desired outputs:

# levels(df$currency) = "CNY" "EUR" "GBP" "USD
# levels(df$site) = "B" "C"
divibisan
  • 11,659
  • 11
  • 40
  • 58
Zhubarb
  • 11,432
  • 18
  • 75
  • 114

1 Answers1

12

Use droplevels:

> df <- droplevels(df)
> levels(df$currency)
[1] "CNY" "EUR" "GBP" "USD"
> levels(df$site)
[1] "B" "C"
Jérôme B
  • 420
  • 1
  • 6
  • 25
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks, so I do this after filtering stuff out and it drops "unused levels from a factor or, more commonly, from factors in a data frame."? It just seems a bit strange that I have to remember dropping levels every time I apply a filter to the data.frame. (But I am sure there is a reason for that). Do you know why this has to be the case? – Zhubarb Dec 10 '13 at 16:25
  • 1
    @Berkan Because factors serve a specific statistical/data purpose: data which can take a specific set of values. Frequently when doing analyses you need the information on what all those levels could be, even if they don't appear. If this behavior displeases you, you should probably just use character columns instead. – joran Dec 10 '13 at 16:28