I am trying to run a regression model that includes fixed effects for cities in the united states. I have over 10,000,000 million rows and 600 cities. The code below works, but it is really slow. When including a factor for a variable with lots of levels, is there any way to run the model faster.
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
system.time(a1 <- lm( a~cityfips+d , x ) )
system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
> system.time(a1 <- lm( a~cityfips+d , x ) )
user system elapsed
0.22 0.00 0.22
> system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
95.65 0.97 96.62
> system.time(a3 <- slm( a~as.factor(cityfips) + d , x ) )
user system elapsed
4.58 2.06 6.65