1

Here is a toy data:

df <- tibble::tribble( ~var2, ~var1, ~var3,   ~var4,
                      2L,   "A",   1.2,  "1/6/2018",
                      4L,   "A",  1.34,  "1/3/2018",
                      7L,   "B",  2.43,  "1/7/2018",
                      3L,   "C",     4,  "1/4/2018",
                      7L,   "A",   3.2,  "1/9/2018",
                      3L,   "D",   2.3, "1/10/2018",
                      4L,   "A",  0.34,  "1/9/2018",
                      5L,   "C",   4.2,  "1/7/2018",
                      5L,   "D",   6.5, "1/10/2018") %>% 
      mutate(var4 = mdy(var4))

I want to create a dataframe of unique values of every variable in df sorted from largest (at the top) to smallest values (at the bottom) and vice-versa for date variables. Also variables should be ordered (left-to-right) from least unique to most unique values. The desire output should be:

 df_of_unique_values <- tibble::tribble(~var1, ~var2,    ~var4,  ~var3,
                                        "D",    7L,  "1/3/2018",   6.5,
                                        "C",    5L,  "1/4/2018",   4.2,
                                        "B",    4L,  "1/6/2018",     4,
                                        "A",    3L,  "1/7/2018",   3.2,
                                         NA,    2L,  "1/9/2018",  2.43,
                                         NA,    NA, "1/10/2018",   2.3,
                                         NA,    NA,          NA,  1.34,
                                         NA,    NA,          NA,   1.2,
                                         NA,    NA,          NA,  0.34) %>% 
  mutate(var4 = mdy(var4))

How can I do that preferably using tidyverse?

Geet
  • 2,515
  • 2
  • 19
  • 42

2 Answers2

3

I suppose one could use tidyverse by this seems really simple with order:

df[order(df$var1, df$var2, df$var3, -as.numeric(df$var4)),]
# A tibble: 9 x 4
   var2 var1   var3 var4      
  <int> <chr> <dbl> <date>    
1     2 A      1.2  2018-01-06
2     4 A      0.34 2018-01-09
3     4 A      1.34 2018-01-03
4     7 A      3.2  2018-01-09
5     7 B      2.43 2018-01-07
6     3 C      4    2018-01-04
7     5 C      4.2  2018-01-07
8     3 D      2.3  2018-01-10
9     5 D      6.5  2018-01-10

This is the tidyverse equivalent. Needed to find the ?arrange help page which advises using desc() for reverse ordering (the equivalent of using - prefix when using `order):

df %>% arrange(var1, var2, var3, desc(as.numeric(var4)))
# A tibble: 9 x 4 
   var2 var1   var3 var4      
  <int> <chr> <dbl> <date>    
1     2 A      1.2  2018-01-06
2     4 A      0.34 2018-01-09
3     4 A      1.34 2018-01-03
4     7 A      3.2  2018-01-09
5     7 B      2.43 2018-01-07
6     3 C      4    2018-01-04
7     5 C      4.2  2018-01-07
8     3 D      2.3  2018-01-10
9     5 D      6.5  2018-01-10

A list would be the way to return values that are of unequal lengths and unrelated to each other:

lapply(df, unique)
$var2
[1] 2 4 7 3 5

$var1
[1] "A" "B" "C" "D"

$var3
[1] 1.20 1.34 2.43 4.00 3.20 2.30 0.34 4.20 6.50

$var4
[1] "2018-01-06" "2018-01-03" "2018-01-07" "2018-01-04" "2018-01-09" "2018-01-10"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I need to only see the unique values of variables as I have millions of rows and sorted like the df_of_unique_values. – Geet Oct 25 '18 at 21:57
  • 2
    That result should not be either a tibble or dataframe but rather a list. – IRTFM Oct 25 '18 at 22:00
  • The intermediate step could be lists that should be stitched in one tibble somehow as there could be hundreds of variables. – Geet Oct 25 '18 at 22:03
  • 1
    Lists would not be a problematic storage structure. – IRTFM Oct 25 '18 at 22:05
  • @Geet - it would be quite contradictory to use the tidyverse coding logic and then use an output format that is close to the least tidy structure possible. – thelatemail Oct 25 '18 at 23:05
  • The objective is to show, not analyse, unique values for say 50 variables. And, between 50 different lists and a data frame, I found data frame to be more convenient to show. It’s for review, especially for variables with limited number of unique values hence ordering and sorting of variables was so important to me. – Geet Oct 26 '18 at 05:59
2

Borrowing from Combining lists of different lengths into data frame:

str(lists <- lapply(df, function(a) sort(unique(a), decreasing=!inherits(a,"Date"))))
# List of 4
#  $ var2: int [1:5] 7 5 4 3 2
#  $ var1: chr [1:4] "D" "C" "B" "A"
#  $ var3: num [1:9] 6.5 4.2 4 3.2 2.43 2.3 1.34 1.2 0.34
#  $ var4: Date[1:6], format: "2018-01-03" "2018-01-04" "2018-01-06" "2018-01-07" ...
str(lists <- lists[order(lengths(lists))])
# List of 4
#  $ var1: chr [1:4] "D" "C" "B" "A"
#  $ var2: int [1:5] 7 5 4 3 2
#  $ var4: Date[1:6], format: "2018-01-03" "2018-01-04" "2018-01-06" "2018-01-07" ...
#  $ var3: num [1:9] 6.5 4.2 4 3.2 2.43 2.3 1.34 1.2 0.34
(maxlen <- max(lengths(lists)))
# [1] 9
str(lists <- lapply(lists, function(l) c(l, rep(NA, maxlen-length(l)))))
# List of 4
#  $ var1: chr [1:9] "D" "C" "B" "A" ...
#  $ var2: int [1:9] 7 5 4 3 2 NA NA NA NA
#  $ var4: Date[1:9], format: "2018-01-03" "2018-01-04" "2018-01-06" "2018-01-07" ...
#  $ var3: num [1:9] 6.5 4.2 4 3.2 2.43 2.3 1.34 1.2 0.34
as.data.frame(lists)
#   var1 var2       var4 var3
# 1    D    7 2018-01-03 6.50
# 2    C    5 2018-01-04 4.20
# 3    B    4 2018-01-06 4.00
# 4    A    3 2018-01-07 3.20
# 5 <NA>    2 2018-01-09 2.43
# 6 <NA>   NA 2018-01-10 2.30
# 7 <NA>   NA       <NA> 1.34
# 8 <NA>   NA       <NA> 1.20
# 9 <NA>   NA       <NA> 0.34

Tidyverse equivalent:

library(dplyr)
library(purrr)

maxlen <- max(lengths(map(df, unique)))
df %>%
  map(~ sort(unique(.), decreasing = !inherits(., "Date"))) %>%
  .[order(lengths(.))] %>%
  map(`length<-`, maxlen) %>%                    # alternative 1
  # map(~ c(., rep(NA, maxlen - length(.)))) %>% # alternative 2
  tbl_df()

Bottom line, though: I agree with @42- and @thelatemail that this is really not the best format for storage. One interpretation of a data.frame is that everything on a row is related. In surveys, for example, each column is a question, each row is a respondent (survey-taker). By reordering differently between the columns, this association is completely discarded. The only rationale I can think of for not going with a simpler list format (as @42- ended with) is for report presentation, where I assume you'd do something like

options(knitr.kable.NA="")
knitr::kable(...)
# |var1 | var2|var4       | var3|
# |:----|----:|:----------|----:|
# |D    |    7|2018-01-03 | 6.50|
# |C    |    5|2018-01-04 | 4.20|
# |B    |    4|2018-01-06 | 4.00|
# |A    |    3|2018-01-07 | 3.20|
# |     |    2|2018-01-09 | 2.43|
# |     |     |2018-01-10 | 2.30|
# |     |     |           | 1.34|
# |     |     |           | 1.20|
# |     |     |           | 0.34|
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Just need to sort variables from largest to smallest values for variables and vice-versa for date variables. – Geet Oct 25 '18 at 22:11
  • Great! Thanks! Also curious if it can be written using purrr and dplyr? – Geet Oct 25 '18 at 22:18
  • 1
    You can also take advantage of `length<-` if you want - `lapply(lists, \`length<-\`, maxlen)` – thelatemail Oct 25 '18 at 22:41
  • Yeah, I knew that ... at some point. *sigh*. thanks for reminding me. – r2evans Oct 25 '18 at 22:50
  • @r2evans - I just realised that my suggestion actually breaks the Date variable. How unfortunate. – thelatemail Oct 25 '18 at 22:57
  • I'm missing it ... seems to work fine for me, I see `` in the tidy version with either `purrr::map` or `lapply`, still sorted correctly. – r2evans Oct 25 '18 at 23:08
  • Absolutely r3evans, presenting, not analysing, unique values is the objective! – Geet Oct 26 '18 at 06:12
  • r2evans, i can’t get my head around the `length<-`, can you please change that to the one you had previously suggested? – Geet Oct 26 '18 at 06:19
  • 1
    The expression `length(vec) <- newlen` changes the length by truncating or extending/padding with `NA`. That functionality was created with `\`length<-\` <- function(vec, newlen) {...}`, which is a special function in R. Other special functions are meant to be used either infix (`%in%`) or on the left hand side of an assignment (`names(vec) <- c(...)`). Defining the special functions requires the backticks; using "normally" does not, but using programmatically like this requires the backticks again. Make sense? – r2evans Oct 26 '18 at 13:55
  • Geet: I've re-added the alternative as a comment. Can you see the `edited x hours ago` to the left of my name under the answer? If you click on that, you can see the history of edits, including redactions, changes, additions. (I don't recall if this requires a minimum reputation.) – r2evans Oct 26 '18 at 14:49
  • It's a handy tool to see (a) what somebody thought was wrong or insufficient, and (b) remind yourself why you thought you saw something but now don't ... I use it frequently. Glad it helped. – r2evans Oct 26 '18 at 17:32