0
library(survival)
library(survminer)
library(dplyr)


ovarian=ovarian
ovarian$weighting = sample(1:100,26,replace=T)

fitWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian,weight=weighting)
fitNOWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian)

In this example above the value of the R-Squared for fitWEIGHT equals to 1. However the same model without fake sample weights has R-Squared equals to less than half (0.5). Why is this happening?

bvowe
  • 3,004
  • 3
  • 16
  • 33

1 Answers1

0

Weighting here is effectively repeating the observations. You're calculating weights with a perfectly distributed random sample ovarian$weighting = sample(1:100,26,replace=T) that's distributed across your underlying data set. So re-observing each sets of data points according to the normally distributed weights is likely biasing the function to ensure perfect correlation between your dependent and independent variables. It's probably not perfectly perfectly correlated, but the 1:100 range is likely blowing it out beyond the default number of significant digits and so it rounds to 1. If you change the sample to 1:10 or 40:50 or something it would likely continue to push the correlation bias but to reduce the r2 to nearly-1 instead of rounded-to-1 value that you're seeing now under the current weighting strategy.

For additional discussion on weights for this function see below. To ensure that the weights you're specifying are the types of weights you're expecting for this analysis. It's really weighting the observation count (ie, a form of over/re-sampling the observation you're assigning the weight to). https://www.rdocumentation.org/packages/survival/versions/2.43-3/topics/coxph

Where it states:

Case Weights Case weights are treated as replication weights, i.e., a case weight of 2 is equivalent to having 2 copies of that subject's observation. When computers were much smaller grouping like subjects together was a common trick to used to conserve memory. Setting all weights to 2 for instance will give the same coefficient estimate but halve the variance. When the Efron approximation for ties (default) is employed replication of the data will not give exactly the same coefficients as the weights option, and in this case the weighted fit is arguably the correct one.

When the model includes a cluster term or the robust=TRUE option the computed variance treats any weights as sampling weights; setting all weights to 2 will in this case give the same variance as weights of 1.

Soren
  • 1,792
  • 1
  • 13
  • 16
  • thanks so much. @Soren But here is the key. This sample here above is just one example as I can not share the actual data due to it being school level data. However, the weights in that data set range drastically from 200 to over a thousand (it is school district level data). Thus, how do I use weighted analysis without getting a perfect R-Squared? – bvowe Mar 03 '19 at 19:20
  • Your question is hard to answer without knowing your objectives or what you're interesting in weighting your data by? It would have more to do with your survey design and needs than the r-function. I modified answer above on information about how weights work in this function. In your actual data, the weights range from 200 to 1000 what exactly? Here, for example, maybe your "200" observations are undersampled for whatever reason but normally distributed, you could assign only those a weight=5 to compare with the "1000" sample – Soren Mar 03 '19 at 19:28
  • @bvowe You haven't convinced me that this is a problem. No one should pay any attention to Rsquared for survival models. You should be looking at the log-likelihood and doing model comparisons for any sort of statistical inference. You should also be thinking about whether the case weights are really justified in your study. Just because your schools have 200 to 1000 students each does not warrant case weights of 200 to 1000 for the particular measurements. – IRTFM Mar 03 '19 at 21:00