0

I have a dataset that includes environmental data collected by 17 sensors (each with their own device_id). The data is collected every 10 minutes, over 3 months. I am trying to find the main variables that influence the changes of a dependent variable (pm25).

The options I have considered are:

  • GEE (Generalized Estimating Equation) analysis
  • Random Forest
  • PCA

Rather than just analyzing the effect of IndVariableA and IndVariableB on DepVariableC, I would like the analysis to consider the date/time if possible and the device_id as a clustering factor. The Proximity variables are constant for a particular device_id.

I tried doing these in SPSS and in Python, but I am not able to interpret the results properly, and not quite sure if the parameters I'm entering are correct. This is a snippet of the data I am using.

device_id date time temp hum pm1 pm10 pm25 tvoc eco2
0 14384 7/11/2021 4:11:00 25.92 68.67 4 7 7 93 1016
1 14384 7/11/2021 4:21:00 26.21 66.66 3 4 4 62 813
2 14389 7/11/2021 4:22:00 29.12 55.52 8 13 13 7 450
3 14392 7/11/2021 4:22:00 24.33 51.44 0 0 0 0 400
4 14389 7/11/2021 4:31:00 28.52 56.60 7 11 11 12 483

pressure AQI ProximitytoPark ProximitytoAve
0 98.20 23 0.06 0.07
1 98.19 13 0.06 0.07
2 96.97 42 0.52 0.16
3 97.46 0 0.03 1.00
4 96.97 35 0.52 0.16

Proximitytohighway ProximitytoTrainTracksBusway
0 0.49 0.56
1 0.49 0.56
2 0.32 1.60
3 0.78 2.20
4 0.32 1.60

I used the statsmodels api to obtain the following results, but I am not confident it is giving me a good result, as I would like them to be clustered by device_id.

Code:

model = smf.gee("pm25 ~ ProximitytoPark + ProximitytoAve + Proximitytohighway + ProximitytoTrainTracksBusway + temp + hum + device_id", "pm25", X, family = sm.families.Gaussian())

Results:

 GEE Regression Results Dep. Variable:  pm25    No. Observations:   230793
Model:  GEE     No. clusters:   196
Method:     Generalized     Min. cluster size:  1
    Estimating Equations    Max. cluster size:  26232
Family:     Gaussian    Mean cluster size:  1177.5
Dependence structure:   Independence    Num. iterations:    60
Date:   Mon, 19 Jun 2023    Scale:  226.346
Covariance type:    robust  Time:   16:57:21
    coef    std err     z   P>|z|   [0.025  0.975]
Intercept   861.5242    357.682     2.409   0.016   160.481     1562.568
ProximitytoPark     1.2047  4.722   0.255   0.799   -8.050  10.460
ProximitytoAve  3.9539  1.870   2.114   0.034   0.288   7.619
Proximitytohighway  -2.1691     3.306   -0.656  0.512   -8.650  4.312
ProximitytoTrainTracksBusway    -2.4669     0.577   -4.274  0.000   -3.598  -1.336
temp    0.7586  0.119   6.384   0.000   0.526   0.992
hum     0.1702  0.054   3.170   0.002   0.065   0.275
device_id   -0.0606     0.025   -2.409  0.016   -0.110  -0.011
Skew:   5.9555  Kurtosis:   179.9029
Centered skew:  0.2678  Centered kurtosis:  1.1036

Additionally, I have considered PCA and Random Forest, but I am not sure if they are right for this, as it does not consider the time-series (if i understand correctly).

I would appreciate any help in identifying a method of statistic analysis that will help me identify factors that influence the dependent variable (pm25).

  • Interesting question, but it's off topic here; try stats.stackexchange.com instead. Inferring influence or causation is a central problem, but unfortunately it's as difficult as it is important. I agree that PCA and RF aren't applicable here and you can put them aside. One way to assess influence is to build a model which predicts the dependent variable, and then look at all possible subsets of independent variables -- is there any subset which is as good as the whole set of independent variables? Is there any one variable or small set of variables which is almost as good as the whole set? – Robert Dodier Jun 20 '23 at 15:47

0 Answers0