I have a dataset that includes environmental data collected by 17 sensors (each with their own device_id). The data is collected every 10 minutes, over 3 months. I am trying to find the main variables that influence the changes of a dependent variable (pm25).
The options I have considered are:
- GEE (Generalized Estimating Equation) analysis
- Random Forest
- PCA
Rather than just analyzing the effect of IndVariableA and IndVariableB on DepVariableC, I would like the analysis to consider the date/time if possible and the device_id as a clustering factor. The Proximity variables are constant for a particular device_id.
I tried doing these in SPSS and in Python, but I am not able to interpret the results properly, and not quite sure if the parameters I'm entering are correct. This is a snippet of the data I am using.
device_id date time temp hum pm1 pm10 pm25 tvoc eco2
0 14384 7/11/2021 4:11:00 25.92 68.67 4 7 7 93 1016
1 14384 7/11/2021 4:21:00 26.21 66.66 3 4 4 62 813
2 14389 7/11/2021 4:22:00 29.12 55.52 8 13 13 7 450
3 14392 7/11/2021 4:22:00 24.33 51.44 0 0 0 0 400
4 14389 7/11/2021 4:31:00 28.52 56.60 7 11 11 12 483
pressure AQI ProximitytoPark ProximitytoAve
0 98.20 23 0.06 0.07
1 98.19 13 0.06 0.07
2 96.97 42 0.52 0.16
3 97.46 0 0.03 1.00
4 96.97 35 0.52 0.16
Proximitytohighway ProximitytoTrainTracksBusway
0 0.49 0.56
1 0.49 0.56
2 0.32 1.60
3 0.78 2.20
4 0.32 1.60
I used the statsmodels api to obtain the following results, but I am not confident it is giving me a good result, as I would like them to be clustered by device_id.
Code:
model = smf.gee("pm25 ~ ProximitytoPark + ProximitytoAve + Proximitytohighway + ProximitytoTrainTracksBusway + temp + hum + device_id", "pm25", X, family = sm.families.Gaussian())
Results:
GEE Regression Results Dep. Variable: pm25 No. Observations: 230793
Model: GEE No. clusters: 196
Method: Generalized Min. cluster size: 1
Estimating Equations Max. cluster size: 26232
Family: Gaussian Mean cluster size: 1177.5
Dependence structure: Independence Num. iterations: 60
Date: Mon, 19 Jun 2023 Scale: 226.346
Covariance type: robust Time: 16:57:21
coef std err z P>|z| [0.025 0.975]
Intercept 861.5242 357.682 2.409 0.016 160.481 1562.568
ProximitytoPark 1.2047 4.722 0.255 0.799 -8.050 10.460
ProximitytoAve 3.9539 1.870 2.114 0.034 0.288 7.619
Proximitytohighway -2.1691 3.306 -0.656 0.512 -8.650 4.312
ProximitytoTrainTracksBusway -2.4669 0.577 -4.274 0.000 -3.598 -1.336
temp 0.7586 0.119 6.384 0.000 0.526 0.992
hum 0.1702 0.054 3.170 0.002 0.065 0.275
device_id -0.0606 0.025 -2.409 0.016 -0.110 -0.011
Skew: 5.9555 Kurtosis: 179.9029
Centered skew: 0.2678 Centered kurtosis: 1.1036
Additionally, I have considered PCA and Random Forest, but I am not sure if they are right for this, as it does not consider the time-series (if i understand correctly).
I would appreciate any help in identifying a method of statistic analysis that will help me identify factors that influence the dependent variable (pm25).