1

I have a set of data where i am trying to model the rate of TB cases per unit population. Am I correct in thinking to find the rate of TB per unit of the population is as simple as doing;

rate <- tbData$TB/tbData$Population

My df is called tbData with the following variables;

head(TBdata)
  Indigenous Illiteracy Urbanisation Density Poverty Poor_Sanitation Unemployment Timeliness  Year    TB Population Region   lon    lat    
1      0.335       6.35         84.1   0.714    31.3            15.3         5.41       59.2  2012   323     559543  11001 -60.7 -12.1  0.000577
2      6.45        8.49         71.4   0.743    48.6            29.4         5.92       58.1  2012    15      73193  11002 -64.0  -9.43
user438383
  • 5,716
  • 8
  • 28
  • 43
Joe
  • 795
  • 1
  • 11

1 Answers1

1

Apparently yes! R is vectorized, which means you can easily do vector arithmetic.

In many programming languages we need a for loop for this kind of calculation,

r <- numeric(length(nrow(TBdata)))
for (i in seq_len(nrow(TBdata))) {
  r[i] <- TBdata[i, 'TB'] / TBdata[i, 'Population']
}
r
# [1]   6.229102 134.133333

whereas in R we simply do—

TBdata$TB/TBdata$Population
# [1]   6.229102 134.133333

This isn't magic of course, imagine it being passed to a C implementation under the hood that is a for loop at the very end, but in R it would be very slow.


Data:

TBdata <- structure(list(Indigenous = 1:2, Illiteracy = c(0.335, 6.45), 
    Urbanisation = c(6.35, 8.49), Density = c(84.1, 71.4), Poverty = c(0.714, 
    0.743), Poor_Sanitation = c(31.3, 48.6), Unemployment = c(15.3, 
    29.4), Timeliness = c(5.41, 5.92), Year = c(59.2, 58.1), 
    TB = c(2012L, 2012L), Population = c(323L, 15L), Region = c(559543L, 
    73193L), lon = 11001:11002, lat = c(-60.7, -64), foo = c(-12.1, 
    -9.43)), class = "data.frame", row.names = c(NA, -2L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Nice i wasn't 100% sure this would work. Would logging the data and using an offset give the same results? Not sure which is the 'correct' method here – Joe Jul 15 '22 at 17:46
  • @Joe Not sure, what you mean by "logging"? – jay.sf Jul 15 '22 at 17:48
  • Sorry, my poor way of saying a log transform. Just, L_TB = log(TBdata$TB) then offsetting it in the model, offset(L_tb). Or maybe we log and offset the population here, im not sure – Joe Jul 15 '22 at 17:51
  • 1
    @Joe Yes, by doing `TBdata <- transform(TBdata, L_TB=log(TB))` the entire vector gets "logged" at once. For `offset`ing, there's a nice [answer](https://stackoverflow.com/a/16920820/6574038) around, you might want to read. You may also use tricks like `lm(log(mpg) ~ I(hp^2), mtcars)` in model formulae to quickly check an idea, where the `I()` is needed if you use operators such as `^`. – jay.sf Jul 15 '22 at 17:57
  • Think I understand this now, so to put it into practice this model would be correct; mod = gam(TB ~ offset(log(Population)) + s(Indigenous, k = 10, bs = "tp") + s(Urbanisation, k = 10, bs = "tp") + s(Density, k = 10, bs = "tp") + Region + s(lon, lat), data = TBdata, family = poisson(link = 'log')) – Joe Jul 15 '22 at 18:18
  • 1
    @Joe Looks technically okay to me and should do what is expected. I admit, though, I don't use `offset` very much and can't confirm if your model is really specified correctly. – jay.sf Jul 15 '22 at 18:25