0

I have data set from the internet and I wanted to try different normal tests for different columns. I find it funny, that different normality tests give me different results. Not just a couple of decimals different but COMPLETELY different outputs.

Here is my code.

from pandas import read_csv
url = "https://raw.githubusercontent.com/rashida048/Datasets/master/cars.csv"
data = read_csv(url)
y_1 = 'HWY (Le/100 km)' 
y_2 = 'HWY (kWh/100 km)' 
y_3 = 'CITY (kWh/100 km)' 
y_4 = '(km)'
m = data[y_1]
m_2 = data[y_2]
m_3 = data[y_3]
m_4 = data[y_4]
l = [m,m_2, m_3, m_4]
#Kolmogorov-Smirnov test for Normality
for i in l: 
    statistic, pvalue = stats.kstest(i, 'norm')
    print('statistic = %.2f, p = %.1f' %(statistic, pvalue))
    if pvalue > 0.05:
        print ('Gaussian')
    else:
        print('Not Gaussian')

Output:

statistic = 0.98, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
#NormalTest (D'agostino's)

for i in l:
    statistic, pvalue = stats.normaltest(i)
    print('statistic = %.2f, p = %.5f' %(statistic, pvalue))
    if pvalue > 0.05:
        print ('Gaussian')
    else:
        print('Not Gaussian')
output:
statistic = 3.12, p = 0.21050
Gaussian
statistic = 3.28, p = 0.19423
Gaussian
statistic = 70.15, p = 0.00000
Not Gaussian
statistic = 188.31, p = 0.00000
Not Gaussian

#chi-Square
for i in l:
    statistic, pvalue = stats.chisquare(i)
    print('statistic = %.2f, p = %.5f' %(statistic, pvalue))
    if pvalue > 0.05:
        print ('Gaussian')
    else:
        print('Not Gaussian')

output: 
statistic = 0.44, p = 1.00000
Gaussian
statistic = 3.73, p = 1.00000
Gaussian
statistic = 23.84, p = 0.99972
Gaussian
statistic = 4348.68, p = 0.00000
Not Gaussian

I am still learning the data science and everything behind it. But I am confused, how to make a statement with different values. Is it just about picking one method and stick with it? That can't be it can it?

Noob Programmer
  • 698
  • 2
  • 6
  • 22
  • It's normal. Why different methods would give the same results? – Maciej M Jan 18 '21 at 13:17
  • Okay I understand. Different methods, different procedure, different outcomes. But my question was who do I know which one to use. Should I just perform all and say "I like this outcome, I will use that". That seems highly unreliable. – Noob Programmer Jan 18 '21 at 13:21
  • 1
    This is what data scientists do - pick up the best-performing method for a given problem. If data would be different then another method may perform better. – Maciej M Jan 18 '21 at 13:58

0 Answers0