0

I am trying to create a custom function that would, applied within a loop, give me a table with all the informations I need for all the variables of my table. My function is based on dplyr functions and base.

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))

My problem is that the base function (names()) requires the y argument (the variable name) to be given with quotation marks, but the dplyr function n_distinct needs to be simply so without quotation marks to give the right answer with na.rm=TRUE (if I use n_distinct(x[y], na.rm=TRUE) it doesn't give me a result without NA values). So I don't know how to find a solution to have the good form of the y argument to pass in both functions. I've tried using \" for the names() function, but it didn't seemed to work. Here the errors I obtain:

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, "cyl")

Error: Error in summarise_impl(.data, dots) : variable 'y' not found

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(y)), blank=n()-sum(!is.na(y)), distinct=n_distinct(y, na.rm=TRUE))
myfun(mtcars, cyl)

Error: Error in summarise_impl(.data, dots) : Evaluation error: object 'cyl' not found.

myfun <- function(x, y) summarise(x, var=names(x[y]), n=sum(!is.na(x[y])), blank=n()-sum(!is.na(x[y])), distinct=n_distinct(x[y], na.rm=TRUE))
myfun(mtcars, "cyl")

No error, but na.rm=TRUE doesn't seem to be seen.

My goal would then be apple with some loop to make a table with one row for each variable of my dataframe that I could then export to have these informations for all the variables in just one table.

I tried to make a minimal reproducible example:

library(dplyr)
myfun <- function(x, y) summarise(x, var=names(x[, y]), n=sum(!is.na(x[, y])), blank=n()-sum(!is.na(x[, y])), n_distinct=n_distinct(x[, y], na.rm=TRUE))
a <- mtcars%>%
  summarise(n=sum(!is.na(cyl)), blank=n()-sum(!is.na(cyl)), n_distinct=n_distinct(cyl, na.rm=TRUE))
a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x))))
a <- data.frame(bind_rows(a, myfun(mtcars, "cyl")))
a <- a%>%
  filter(!is.na(var))%>%
  distinct(var, .keep_all=TRUE)

But for some incomprehensible reason (at least for me) it doesn't work (line a <- lapply(colnames(mtcars), function(x) data.frame(bind_rows(a, myfun(mtcars, x)))), error message Error in summarise_impl(.data, dots) : Columnvaris of unsupported type NULL). It works fine with my dataframe, I subsetted it and it still worked fine, I manually created the same again by writting from hand all the same values in the same class, it didn't work... So I'm really lost, don't understand why it works for my dataset but no other, and because I'm new in R and just learn that by trying, without having lectures about this language code, I sometimes have no idea what I'm really doing but it works (like this code above for me), and then no more...

So this code works for me pretty good, there is just the problem as said that because I use n_distinct(x[, y]) it ignores na.rm=TRUE, what I cannot understand.

Sorry for the rather uncomprehensive question I asked I think, I would be glad to edit it if you leaves comment about how to clarify it. I'm simply totally lost with my try and have no idea how to present things in a clearer way. Thanks for the help and sorry for the mess

GaryDe
  • 492
  • 1
  • 5
  • 17

1 Answers1

1

I'm not entirely clear on what on exactly what you are trying to do, but this might get at it.

First create a function that will be run for each column.

fn <- function(x){
    a = levels(x)
    n = n=sum(!is.na(x))
    blank = length(x) - sum(!is.na(x))
    dist = length(unique(x))
    c(column = a, n=n, blank=blank, distinct=dist )
}

Then use apply to apply the function to each column of the data.frame. I've transposed it to provide rows.

t(apply(mtcars, 2, fn))
B Williams
  • 1,992
  • 12
  • 19
  • Thank you very much! It did the job! I just had to change length(unique(x)) with n_distinct(x, na.rm=TRUE) because otherwise it counted my NA values as a value, what I didn't wand. Thanks a lot! – GaryDe Aug 15 '17 at 11:05