I'm looking to model an outcome along the lines of this:
Energy ~ Specimen_Region + Specimen_Thickness + Tissue_Coefficinet....Age + Max_Diam
Where Energy is a quantitative outcome to be modeled on ~15 covariates consisting of both patient-level demographic data and experiment-level data. However, these observations are not independent.
We are trying to predict/model this outcome using a combination of data that includes both patient-level data and experiment-level data. We are most interested in evaluating the effect of age on energy when stratified by specimen_region, a factor variable (root, proximal, middle, distal).
The issues listed above relate to the following:
- Clustering (Dependence)
- Varying Covariates
- Non-Linearity
Clustering: Some covariates (age, sex, etc.) are clustered to individual patients. So we have repeat tissue samples from the same patient (up to 4 samples per patient). Indicating the samples are not independent.
Varying Covariates: With each repeat sample the region from where the sample is drawn varies. This can range for each patient from 1 sample to 4 samples and will always come from one of the four regions (root, proximal, middle, and distal)
The above two issues present a problem because some covariates are related to individual patients but some are related to the individual specimen. Accounting for both within a model would imply the need for something akin to time varying covariates. However, the varying covariate is not time, it is a discrete factor variable at this time. Additionally, when considering something like a hierarchical model and clustering on the individual patients, the data within each cluster (patient) is only a few data points (up to 4) which doesn't make much sense compared to other examples I've read about. While clustering on the region instead of the patient may address this problem, we then cannot draw conclusions about the region within a single patient.
- Non-Linearity: The remaining problem is that I have a low assumption that this relationship is linear. My team and I have employed methods previously such as Random Forest to reduce our assumptions of linearity within the models. In trying to answer this question, we haven't been able to adapt these methods to account for the aforementioned clustering/grouping and varying covariates issue.
Ultimately, I've discussed things like Random Forest (R Package - rfsrc) and tried to utilize very complex methods such as Longitudinal Boosting strategies (R Package - boostmtree, BoostMLR). Each has left me and the clinical team somewhat more confused and concerned that we are shopping around looking for the plot we "like" and not what is correct (i.e. - changing tuning parameters to change the curves).
Given this, I was wondering if anyone had any recommendations for the appropriate methodologies or experience in these types of problems.
Here is a picture of some sham data demonstrating the outcome (energy)and varying covariate (region).