4

I'm working on a data.frame with about 700 000 rows. It's containing the ids of statusupdates and corresponding usernames from twitter. I just want to know how many different users are in there and how many times they've tweeted. So I thought this was a very simple task using tables. But know I noticed that I'm getting different results.

recently I did it converting the column to character like this

>freqs <- as.data.frame(table(as.character(w_dup$from_user))
>nrow(freqs)
[1] 239678

2 months ago I did it like that

>freqs <- as.data.frame(table(w_dup$from_user)
>nrow(freqs)
[1] 253594

I noticed that this way the data frame contains usernames with a Frequency 0. How can that be? If the username is in the dataset it must occur at least one time.

?table didn't help me. Neither was I able to reproduce this issue on smaller datasets.

What I'm doing wrong. Or am I missunderstanding the use of tables?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
supersambo
  • 811
  • 1
  • 9
  • 25
  • I made a similar error in my question, though I wanted to [keep the zero-frequency counts](http://stackoverflow.com/q/13705060/610108) in my table. `table` produces a contingency table, `tabular` produces a frequency table. – ThomasH Nov 11 '15 at 13:18

1 Answers1

4

The type of the column is the problem here and also keep in mind that levels of factors stay the same when subsetting the data frame:

# Full data frame
(df <- data.frame(x = letters[1:3], y = 1:3))
  x y
1 a 1
2 b 2
3 c 3
# Its structure - all three levels as it should be
str(df)
'data.frame':   3 obs. of  2 variables:
 $ x: Factor w/ 3 levels "a","b","c": 1 2 3
 $ y: int  1 2 3
# A smaller data frame
(newDf <- df[1:2, ])
  x y
1 a 1
2 b 2
# But the same three levels
str(newDf)
'data.frame':   2 obs. of  2 variables:
 $ x: Factor w/ 3 levels "a","b","c": 1 2
 $ y: int  1 2

so the first column contains factors. In this case:

table(newDf$x)

a b c 
1 1 0 

all the levels ("a","b","c") are taken into consideration. And here

table(as.character(newDf$x))

a b 
1 1 

they are not factors anymore.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • thanks. now I see that the problem has to do with levels but I'm not really sure, why there are more levels than occurring in the source of my table. I your example (table(df[1:2, 1])) you use just a part of your table but I use the whole column. But my df w_dup is a subset of another dataframe which I reduced to tweets within my investigation period. Are the levels kept altough I create a totally new df? – supersambo Sep 01 '12 at 10:53