0

When I try to use shapiro.test as a summary function on my R DataFrame I get the error:

df %>% summarize_all(shapiro.test)
Error: Column `A` must be length 1 (a summary value), not 4

Here is my setup:

df = data.frame(A=sample(1:10,5), B=sample(1:10,5))
df
df %>% summarize_all(mean)
df %>% summarize_all(sd)
df %>% summarize_all(sum)
df %>% summarize_all(shapiro.test)
df$A %>% shapiro.test()

Output:

> df = data.frame(A=sample(1:10,5), B=sample(1:10,5))
> df
   A B
1  1 8
2  8 4
3  5 5
4 10 6
5  7 1
> df %>% summarize_all(mean)
    A   B
1 6.2 4.8
> df %>% summarize_all(sd)
         A        B
1 3.420526 2.588436
> df %>% summarize_all(shapiro.test)
Error: Column `A` must be length 1 (a summary value), not 4
> df$A %>% shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.96086, p-value = 0.814

What is special about shapiro.test that makes it not work vectorized on the columns?

camille
  • 16,432
  • 18
  • 38
  • 60
abalter
  • 9,663
  • 17
  • 90
  • 145
  • The function returns a list of class `htest"` of length 4. Try defining a function `sh.test <- function(x) shapiro.test(x)$p.value` and use it in the pipe. – Rui Barradas Apr 25 '19 at 17:32
  • 1
    Look at the error message, and look at the output of your Shapiro test. Summary functions should return single values, the way that `mean` and `sd` do. `shapiro.test` clearly does not. What do you expect to have as the test output in a data frame? The W-stat? The p-value? – camille Apr 25 '19 at 17:32

2 Answers2

2

You can iterate over each column using map from purrr package as an alternative to apply

df %>%
  map(~shapiro.test(.))

Also consider using sapply and lapply

df %>% 
  sapply(.,shapiro.test)


df %>% 
  lapply(.,shapiro.test)
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
1

Just got it: shaprio.test doesn't return a single number. This however, does work:

> df %>% apply(2, shapiro.test)
$A

    Shapiro-Wilk normality test

data:  newX[, i]
W = 0.96086, p-value = 0.814


$B

    Shapiro-Wilk normality test

data:  newX[, i]
W = 0.98396, p-value = 0.9546

Also:

> f = function(x){st = shapiro.test(x); return(st$p.value)}
> f(df$A)
[1] 0.8139521
> df %>% summarise_all(f)
          A         B
1 0.8139521 0.9546435
abalter
  • 9,663
  • 17
  • 90
  • 145