0

In the prediction block Apparently I have an error associated to the index and the predict() function, for some reason the index passes as integer and not as datetimeIndex, producing an error in the forecast results.

I verified the index (X, y) and they look fine.

Model (auto_arima) function parameters:

    ```
    def model(endog, exog):
        from pmdarima import auto_arima
    
        ml = auto_arima(endog, exog,
                        start_p=1,
                        start_q=1,
                        max_p=5,
                        max_d=3,
                        max_q=5,
                        start_P=1,
                        start_Q=1,
                        max_P=5,
                        max_D=3,
                        max_Q=5,
                        max_order=6,
                        m=1,
                        seasonal=True,
                        stationary=True,
                        information_criterion='aic',
                        alpha=0.05,
                        test='kpss',
                        seasonal_test='ocsb',
                        stepwise=True,
                        n_jobs=1,
                        start_params=None,
                        trend=None,
                        method='lbfgs',
                        maxiter=50,
                        offset_test_args=None,
                        seasonal_test_args=None,
                        suppress_warnings=True,
                        error_action='ignore',
                        trace=False,
                        random=True,
                        random_state=True,
                        n_fits=10,
                        return_valid_fits=False,
                        out_of_sample_size=0,
                        scoring='mse',
                        scoring_args=None,
                        with_intercept="auto")
        return ml
    
    
    
    ```

The entire code:

# --------------------------------------------imports-----------------------------------------#
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import tseries, DateOffset
from pandas.plotting import register_matplotlib_converters
from testAdf import *
from autoArima import model

# ---------------------------Handly Matplotlib errors in Matplotlib---------------------------#
from sqlalchemy.dialects.mssql.information_schema import columns

register_matplotlib_converters()
# Convert the PeriodIndex of the target variable (y) to a DatetimeIndex, to solve some issues with Matplotlib


# -------------------DataBase import connection from dbconnection file-------------------------#
from dbconnection import dbconnect

# -----------------------------------Accessing to MongoDB---------------------------------------#
db = dbconnect()

# ----------------------------Extracting collection in database-------------------------------- #
guanajibo = db.hu21010003  # in this case hydrological units 21010003 Rio Guanajibo, Puerto Rico

# -----------------Creating a DataFrame from DataBase collection hu21010003---------------------#
df = pd.DataFrame(list(guanajibo.find()))

# ------------------------Converting Datetime Mongodb object to datetime------------------------#
df['Date'] = pd.to_datetime(df['Date']).dt.date

# ----------------------------Converting Date to DatetimeIndex----------------------------------#
df = df.set_index(pd.DatetimeIndex(df['Date']))

# ------------------------------------Converting to a Quality index-------------------------------------#

# Value to Index % of Barometric Pressure mmHg
df['cBarometricP'] = df.Barometric.apply(lambda x: (100 if (767 > x >= 762)  # High Pressure
                                                    else (80 if (761 >= x >= 760)  # Normal Pressure
                                                          else (
    60 if (759 >= x >= 749)  # Low Pressure, Posible RainFall
    else (40 if (748 >= x >= 753)  # Low Pressure, Posible Storm
          else 0)))))  # Hurricane or Tornado

# Value to Index % of Turbidity
df['cTurbidity'] = df.Turbidity.apply(lambda x: (100 if (10 >= x >= 0)
                                                 else (80 if (20 >= x >= 10)
                                                       else (60 if (30 >= x >= 20)
                                                             else (40 if (40 >= x >= 30)
                                                                   else 0)))))

# Value to Index % of Temperature
df['cTemperature'] = df.Temperature.apply(lambda x: (100 if (25.6 >= x >= 22)
                                                     else (80 if (27.6 >= x >= 25.6)
                                                           else (60 if (29.6 >= x >= 27.6)
                                                                 else (40 if (31.6 >= x >= 29.6)
                                                                       else 0)))))

# Value to Index % of pH
df['cpH'] = df.pH.apply(lambda x: (100 if (8.5 >= x >= 7)
                                   else (80 if (8.6 >= x >= 8.5) or (6.9 >= x >= 6.8)
                                         else (60 if (8.8 >= x >= 8.6) or (6.8 >= x >= 6.7)
                                               else (40 if (9 >= x >= 8.8) or (6.7 >= x >= 6.5)
                                                     else 0)))))

# Value to Index % of Dissolved oxygen mg/L
df['cDO'] = df.DO.apply(lambda x: (100 if (8.0 >= x >= 6.0)
                                   else (80 if (6.0 >= x >= 5.0)
                                         else (60 if (5.0 >= x >= 4.0)
                                               else (40 if (4.0 >= x >= 3.0)
                                                     else 0)))))

# Value to Index % of Nitrogen
df['cNitrogen'] = df.Nitrogen.apply(lambda x: (100 if (2.5 >= x >= 0.00)
                                               else (80 if (5.0 >= x >= 2.5)
                                                     else (60 if (7.5 >= x >= 5.0)
                                                           else (40 if (10.0 >= x >= 7.5)
                                                                 else 0)))))

# Value to Index % of Phosphorus
df['cPhosphorus'] = df.Phosphorus.apply(lambda x: (100 if (0.025 >= x >= 0.000)
                                                   else (80 if (0.050 >= x >= 0.025)
                                                         else (60 if (0.075 >= x >= 0.050)
                                                               else (40 if (0.10 >= x >= 0.075)
                                                                     else 0)))))

# Value to Index % of Ammonia
df['cAmmonia'] = df.Ammonia.apply(lambda x: (100 if (1.0 >= x >= 0.00)
                                             else (80 if (10.84 >= x >= 1.0)
                                                   else (60 if (21.6 >= x >= 10.84)
                                                         else (40 if (32.5 >= x >= 21.6)
                                                               else 0)))))

# Value to Index % of conductance
df['cSconductance'] = df.Sconductance.apply(lambda x: (100 if (200 >= x >= 100)
                                                       else (80 if (500 >= x >= 200)
                                                             else (60 if (1000 >= x >= 500)
                                                                   else (40 if (2000 >= x >= 1000)
                                                                         else 0)))))

# Value to Index % of Arsenic
df['cArsenic'] = df.Arsenic.apply(lambda x: (100 if (0.002 >= x >= 0.000)
                                             else (80 if (0.005 >= x >= 0.002)
                                                   else (60 if (0.007 >= x >= 0.005)
                                                         else (40 if (0.010 >= x >= 0.007)
                                                               else 0)))))

# Value to Index % of Mercury
df['cMercury'] = df.Mercury.apply(lambda x: (100 if (0.007 >= x >= 0.000)
                                             else (80 if (0.009 >= x >= 0.008)
                                                   else (60 if (0.0095 >= x >= 0.009)
                                                         else (40 if (0.010 >= x >= 0.0095)
                                                               else 0)))))

# Value to Index % of Chromium
df['cChromium'] = df.Chromium.apply(lambda x: (100 if (0.025 >= x >= 0.000)
                                               else (80 if (0.05 >= x >= 0.025)
                                                     else (60 if (0.075 >= x >= 0.05)
                                                           else (40 if (0.10 >= x >= 0.075)
                                                                 else 0)))))

# Value to Index % of Enterococci
df['cEnterococci'] = df.Enterococci.apply(lambda x: (100 if (15.25 >= x >= 0)
                                                     else (80 if (30.5 >= x >= 15.25)
                                                           else (60 if (45.7 >= x >= 30.5)
                                                                 else (40 if (61 >= x >= 45.7)
                                                                       else 0)))))

# Value to Index % of Total coliform
df['cTcoliform'] = df.Tcoliform.apply(lambda x: (100 if (0.01 >= x >= 0.00)
                                                 else (80 if (0.03 >= x >= 0.01)
                                                       else (60 if (0.04 >= x >= 0.03)
                                                             else (40 if (0.05 >= x >= 0.04)
                                                                   else 0)))))

# --------------------Converting Value to Index % Water Quality Index-------------------#
df['wqi'] = (df.cBarometricP +
             df.cTurbidity +
             df.cTemperature +
             df.cpH +
             df.cDO +
             df.cNitrogen +
             df.cPhosphorus +
             df.cAmmonia +
             df.cSconductance +
             df.cArsenic +
             df.cMercury +
             df.cChromium +
             df.cEnterococci +
             df.cTcoliform) / 14

# Value to Index % of Water Quality in datalabel = Excellent, Good, Fair, Deficient
df['cWqi'] = df.wqi.apply(lambda x: ('Excellet' if (100 >= x >= 90)
                                     else ('Good' if (89 >= x >= 80)
                                           else ('Fair' if (79 >= x >= 60)
                                                 else ('Deficient' if (59 >= x >= 40)
                                                       else 'Bad')))))

# ----------------------------------Exporting all df to a csv file------------------------------#
# df.to_csv('All Stations Water Quality Index Report.csv')

# -----------------------------------Query for Station 50138000---------------------------------#
data = df.query("Station == '50138000'")  # Query
# data = data.set_index(pd.DatetimeIndex(data['Date']))  # Converting Date to DatetimeIndex
# data = data.drop(columns=['Date'])  # Dropping column 'Date'
data = data[~data.index.duplicated(keep='first')].sort_index()  # Removing duplicates from DatetimeIndex

# --------------------------------Statistics for USGS-50138000----------------------------------#
s50138000 = data.query("Station == '50138000'").describe()  # .loc[['mean']]
print()
print('Statistics for USGS-50138000')
print(s50138000)
print()
# Exporting USGS-50138000 Statistics to a csv file
print('Statistics report for station USGS-50138000 exported to a CSV file')
s50138000.to_csv('USGS-50138000 Statistics Report.csv')
print()

# -------------Calculated variables DataFrame by station for Machine Learning Model------------------#
# Station USGS-50138000 on query variables "data"
USGS50138000 = data[[
    'DischargeFt',
    'cBarometricP',
    'cTurbidity',
    'Turbidity',
    'Temperature',
    'cTemperature',
    'cpH',
    'cDO',
    'cNitrogen',
    'cPhosphorus',
    'cAmmonia',
    'cSconductance',
    'cArsenic',
    'cMercury',
    'cChromium',
    'cEnterococci',
    'cTcoliform',
    'wqi']].sort_index()

# --------------------------split of some Dataseries  for the model--------------------------#
# Station USGS-50138000
X = USGS50138000[['cTurbidity']].sort_index()
y = USGS50138000[['wqi']].sort_index()

# ------------Check for Data Stationary using Augmented Dickey-Fuller(ADF) test----------------#
test_adf(y, 'Target WQI')  # Function test_adf(target variable, title)

# -------------------------------------AUTO ARIMA MODEL----------------------------------------#
model = model(y, X)  # Variable with Function model() and target variable "y"
print(model.summary())
print('Model AIC is ' + str(model.aic()))
# ------------------------Fit model with whole historical dataset------------------------------#

fit = model.fit(y)  # fit with target variable "y"

# ----------------------------Predict the next 4 quarters data--------------------------------#
future_dates = [y.index[-1] + pd.tseries.offsets.QuarterEnd(x) for x in
                range(5)]  # Always add 1 for quarter and 2 for months to range
# DateOffset(months=x) for x in range(1, 12) #<--- Optional for forecast next 12 months
# pd.tseries.offsets.QuarterEnd(x) for x in range(5) #<--- Optional for forecast next quarters
future_dates_df = pd.DataFrame(index=future_dates[1:], columns=y.columns)

# At this point the DatetimeIndex it's looks good.

future_dates_df['Forecast WQI'] = model.predict(index=y.index, n_periods=4)  # add periods to forecast
frames = [y, future_dates_df]
future_df = pd.concat(frames)  # Joining dataframes
future_df[['wqi', 'Forecast WQI']].plot(figsize=(12, 8))
plt.grid(color='green', linestyle='-', linewidth=0.5)

# --------------------------------Some Dataframes Prints-------------------------------------#
print('Future days DF')
print(future_dates_df)

# --------------------------------------Model Diagnosis-----------------------------------------#
model.plot_diagnostics(figsize=(14, 10))

# ------------------------------------For Plot all graphs---------------------------------------#
plt.show()

Here's the output:

```
Statistics for USGS-50138000
       Barometric  DischargeFt   Turbidity  ...  cEnterococci  cTcoliform        wqi
count   44.000000    44.000000   44.000000  ...     44.000000   44.000000  44.000000
mean   761.295455   172.181818   27.715909  ...     70.454545   45.454545  69.642857
std      1.636496   221.727706   41.531126  ...     46.152152   50.368620  11.618542
min    757.000000    21.000000    0.000000  ...      0.000000    0.000000  52.857143
25%    760.750000    43.250000    6.725000  ...      0.000000    0.000000  60.000000
50%    761.000000    96.000000   12.000000  ...    100.000000    0.000000  71.428571
75%    762.000000   203.250000   28.000000  ...    100.000000  100.000000  80.000000
max    766.000000  1260.000000  230.000000  ...    100.000000  100.000000  90.000000

[8 rows x 30 columns]

Statistics report for station USGS-50138000 exported to a CSV file

Strong evidence for Null Hypothesis
Data is not Stationary for: Target WQI  - Accept Null Hypothesis - Let’s check if I can make the data stationary by applying one difference using diff()
Strong evidence against Null Hypothesis
Data is Stationary - Reject Null Hypothesis
Proceeding to Model SARIMAX...
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                   44
Model:               SARIMAX(2, 0, 0)   Log Likelihood                -147.255
Date:                Mon, 05 Sep 2022   AIC                            304.511
Time:                        10:45:23   BIC                            313.432
Sample:                             0   HQIC                           307.819
                                 - 44                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     56.5442     11.367      4.974      0.000      34.266      78.823
cTurbidity     0.0952      0.020      4.878      0.000       0.057       0.133
ar.L1         -0.3770      0.109     -3.462      0.001      -0.590      -0.164
ar.L2          0.4744      0.115      4.133      0.000       0.249       0.699
sigma2        46.2233      9.941      4.650      0.000      26.739      65.708
===================================================================================
Ljung-Box (L1) (Q):                   0.49   Jarque-Bera (JB):                 4.11
Prob(Q):                              0.48   Prob(JB):                         0.13
Heteroskedasticity (H):               0.18   Skew:                            -0.66
Prob(H) (two-sided):                  0.00   Kurtosis:                         3.69
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Model AIC is 304.5109189128363
C:\Users\JCALDERIN\AppData\Local\Programs\Python\Python310\lib\site-packages\statsmodels\tsa\base\tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
  return get_prediction_index(
Future days DF
            wqi  Forecast WQI
2019-12-31  NaN           NaN
2020-03-31  NaN           NaN
2020-06-30  NaN           NaN
2020-09-30  NaN           NaN

Process finished with exit code 0
```
jcalderin
  • 47
  • 5

0 Answers0