1

I am doing an experiment using Kalman Filters. I have created a very small time series data ready with three columns formatted as follows. The full dataset is attached here for reproduciability since I can't attach a file on stackoverflow:

csv file

  time        X      Y
 0.040662  1.041667  1
 0.139757  1.760417  2
 0.144357  1.190104  1
 0.145341  1.047526  1
 0.145401  1.011882  1
 0.148465  1.002970  1
 ....      .....     .

I have read the documetation of the Kalman Filter and managed to do a simple linear prediction and here is my code

import matplotlib.pyplot as plt 
from pykalman import KalmanFilter 
import numpy as np
import pandas as pd



df = pd.read_csv('testdata.csv')
print(df)
pd.set_option('use_inf_as_null', True)

df.dropna(inplace=True)


X = df.drop('Y', axis=1)
y = df['Y']



estimated_value= np.array(X)
real_value = np.array(y)

measurements = np.asarray(estimated_value)



kf = KalmanFilter(n_dim_obs=1, n_dim_state=1, 
                  transition_matrices=[1],
                  observation_matrices=[1],
                  initial_state_mean=measurements[0,1], 
                  initial_state_covariance=1,
                  observation_covariance=5,
                  transition_covariance=1)

state_means, state_covariances = kf.filter(measurements[:,1]) 
state_std = np.sqrt(state_covariances[:,0])
print (state_std)
print (state_means)
print (state_covariances)


fig, ax = plt.subplots()
ax.margins(x=0, y=0.05)

plt.plot(measurements[:,0], measurements[:,1], '-r', label='Real Value Input') 
plt.plot(measurements[:,0], state_means, '-b', label='Kalman-Filter') 
plt.legend(loc='best')
ax.set_xlabel("Time")
ax.set_ylabel("Value")
plt.show()

Which gives the following plot as an output

enter image description here

As we can see in the plot, the pattern seems to be captured reasonably well. How can we statistically measure the root-mean-square error (RMSE) (the error distance between the red and blue lines in the plot above)? Any help would be appreciated.

Duck Dodgers
  • 3,409
  • 8
  • 29
  • 43
  • to find RMSE between two lists `x` and `y` you can do `np.sqrt(np.mean((x-y)**2))`. – overfull hbox Dec 29 '18 at 15:40
  • @TylerChen, that gives a `NaN` value sir. –  Dec 29 '18 at 15:54
  • are all of the entries in your arrays regular numbers, or are there some `inf` or `NaN`? – overfull hbox Dec 29 '18 at 15:59
  • @TylerChen, yes they are regular numbers sir. I have included the small dataset with my post for reproduciability. It is only about 400 rows and it will not take you much time to re-run and check if it works for you. Thanks. –  Dec 29 '18 at 16:03
  • could you post the two arrays you want to find the RMSE of? I don't have `pykalman` installed. – overfull hbox Dec 29 '18 at 16:10
  • Note `X` has three columns. Maybe you want to do it between `x=df['X']` and `y=df['Y']`? But in your plot it isn't this `y`. – overfull hbox Dec 29 '18 at 16:16
  • In your plot the blue line is `x` but the red line is `state_means` which came from the filter. – overfull hbox Dec 29 '18 at 16:29

2 Answers2

0

Try this!

from sklearn.metrics import mean_squared_error

mean_squared_error( measurements[:,1], state_means)
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
0

In scikit-learn 0.22.0 you can pass mean_squared_error() the argument squared=False to return the RMSE.

from sklearn.metrics import mean_squared_error
mean_squared_error(y_actual, y_predicted, squared=False)