Find variables making Primary Key using SparkR

Question

Here is my toy data:

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)

My original question at the following link https://stackoverflow.com/a/53110342/6762788 was:

How can I get the combination of a minimum number of variables that uniquely identify the observations in the dataframe i.e which variables together can make the primary key? The following answer/code works absolutely fine, thanks so much to thelatemail.

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)

Now, to make it work on big data, I want to take this to SparkR. Leveraging this answer, how can I translate this code in SparkR? If it's difficult in SparkR, then I can use sparklyr.

This is quite broad question, especially as none of the linked brute-force solutions will scale. Have you done any research? Any attempt to solve this problem? Heuristically you could use Count-min sketch combined with some independence metric to build a greedy algorithm. — zero323, Nov 22 '18 at 21:28
I have recently learned SparkR, but found it difficult to recreate the solution using that. It helped me in understanding the granularity of any small/medium sized data. How should I use count-min sketch and other things you suggested? — Geet, Nov 24 '18 at 03:42
Think about following heuristic - find a column with the highest cardinality - if it is equal to N your done, otherwise add a column with the highest cardinality from the remaining columns, and so on... Repeat until find the solution. Then you can extend that by making a choice not only based on cardinality, but also on the independence from the columns already in the candidate key. — zero323, Nov 29 '18 at 14:31

score 0 · Answer 1 · answered Dec 20 '18 at 22:28

I broke the above problem into tiny pieces and tried the following SparkR code. However, the "counts <- lapply(nms,..." line seems to be very slow. Leveraging this code, can you suggest further performance improvement, may be, by updating "counts <- lapply(nms,..." line.

library(SparkR); library(tidyverse)

df_spark <- mtcars %>% as.DataFrame()

num_m <- seq_len(ncol(df_spark))

nam_list <- SparkR::colnames(df_spark)

combinations <- function(num_m) {
  combn(num_m, x=nam_list, simplify=FALSE)
}

nms <- spark.lapply(num_m, combinations) %>% unlist(rec=FALSE)

vars = map_chr(nms, ~paste(.x, collapse = ","))

counts <- lapply(nms, function(x) df_spark %>% SparkR::select(x) %>% SparkR::distinct() %>% SparkR::count()) %>% unlist()

out <- data.frame(
  vars = vars,
  counts = counts
)

primarykeys <- out %>% 
  dplyr::mutate(n_vars = str_count(vars, ",")+1) %>% 
  dplyr::filter(counts==nrow(df)) %>% 
  dplyr::filter(n_vars==min(n_vars))

primarykeys

Find variables making Primary Key using SparkR

1 Answers1