How can I test the complete data in a dataframe for normal distribution?

Question

I am trying to do a shapiro.test for my data in the dataframe. I have the following dataframe called data1:

 columnA   columnB   columnC
 0.333     0.231    -0.123
 0.235    -0.114    -0.034
-0.111    -0.234     0.442

As you see I have three columns. I would like to test all the data in one test, so all data in one sample. I know how to test one column. But is there a possibility to check the hole frame as one sample?

The data here is just an example. I have more columns and a lot more rows in the real data.

Thank you.

`unlist(data1)` will return all data as a single vector; make sure that you have no non-numeric data in there, otherwise classes will be coerced and lost. — r2evans, Oct 29 '21 at 15:06

score 2 · Answer 1 · answered Oct 29 '21 at 16:08

Visual inspection of the distribution of values in the dataset is the quickest way of establishing whether the data is normally distributed:


df <- data.frame(columnA = c(0.333, 0.235, -0.111), 
                 columnB = c(0.231,  -0.114, -0.234), 
                 columnC = c(-0.123, -0.034, 0.442))

# Convert dataframe to vector (with loss of data structure information)
vec <- as.vector(t(df))

vec

# [1]  0.333  0.231 -0.123  0.235 -0.114 -0.034 -0.111 -0.234  0.442

hist(vec)

An alternative to the above using data.table to transform the data before plotting in order to retain data structure information:


library(data.table)

df <- data.frame(columnA = c(0.333, 0.235, -0.111), 
                 columnB = c(0.231,  -0.114, -0.234), 
                 columnC = c(-0.123, -0.034, 0.442))

# Convert to data.table
dt <- as.data.table(df)

# Pivot long (columns to rows)
dt <- melt(dt)

#   variable  value
# 1:  columnA  0.333
# 2:  columnA  0.235
# 3:  columnA -0.111
# 4:  columnB  0.231
# 5:  columnB -0.114
# 6:  columnB -0.234
# 7:  columnC -0.123
# 8:  columnC -0.034
# 9:  columnC  0.442

hist(dt$value)

Alternatively you can use statistical description to contribute to your interpretation of whether the data is normally distributed or not, e.g. when data is normally distributed we expect the mean, median and mode to be approximately the same:

# Values are continuous so it is necessary to bin the data to calculate the mode
# The hist function does this for us
plt <- hist(dt$value)

# The mode is a bin range
mode <- paste0("(", 
               plt$breaks[which(plt$counts==max(plt$counts), arr.ind=T)], 
               ", ",
               plt$breaks[(which(plt$counts==max(plt$counts), arr.ind=T) + 1)],
               "]")

# Summarise mean and median and add mode to data displayed
dt[, .(mean=mean(value), median=median(value))
     ][, lapply(.SD, round, 3)
       ][, .(mean, median, mode)]

#      mean   median           mode
# 1: 0.069    -0.034   (-0.2, -0.1]

You need to interpret the numbers (and their approximate difference or lack thereof) to determine whether the data is normally distributed.

In theory you could try to use a chi-square goodness of fit test to compare your empirical data with a simulation of data generated randomly from the random distribution using the parameters of your sample but you would need to think through several questions (e.g. how many breaks do I need to bin my data into, how many records are a good number of records to ensure the bins are not empty due to low sample size but not so high that it would make the chi square over sensitive etc.).

There are other measures (e.g. skew, kurtosis, overdispersion) that you could also consider.

As far as visual inspections go, it's worth adding that the OP should consider the Q-Q plot of their data to test for normality. — LMc, Oct 29 '21 at 16:11

How can I test the complete data in a dataframe for normal distribution?

1 Answers1