1

Hi I have a deeply nested json file. I used sparklyr to read this json file and called this "data" object.

Firstly I will show what the data structure looks like:

# Database: spark_connection
data
    -a : string
    -b : string
    -c : (struct)
       c1 : string
       c2 : (struct)
          c21: string
          c22: string

Something like this. So if I extract "a" using:

data %>% sdf_select(a)

I can view what the data inside, like:

   # Database: spark_connection
   a
   <chr>
1  Hello world
2  Stack overflow is epic

THE PROBLEM now comes is when i use sdf_select() a deeper structure i.e.

data %>% sdf_select(c.c2.c22)

Viewing the data inside, I get this

  # Database: spark_connection
  c22
  <list>
1 <list [1]>
2 <list [1]>
3 <list [1]>
4 <lgl [1]>

so if I collect the data so that the spark data frame turns into R data frame and viewing the data using commands

View(collect(data %>% sdf_select(c.c2.c22)))

The data shows

1 list("Good")
2 list("Bad")
3 NA

How do I turn every entry in each list above to a data frame table so that it shows Good, Bad, NA only instead with list("") on it?

Gabriel
  • 423
  • 6
  • 21
  • try `unlist()` (see also https://stackoverflow.com/questions/49200863/sparklyr-spark-read-parquet-reading-string-fields-as-lists) – nachti Mar 15 '18 at 09:51

1 Answers1

0

I was unable to reproduce this. I used

[{"a":"jkl","b":"mno","c":{"c1":"ghi","c2":{"c21":"abc","c22":"def"}}}]

written to a test.json, followed by

spk_df <- spark_read_json(sc, "tmp", "file:///path/to/test.json")
spk_df %>% sdf_schema_viewer()

sdf_schema_viewer output

This seems to match the schema that you provided. However when I use sparklyr.nested::sdf_select() I get a different result.

spk_df %>% sdf_select(c.c2.c22)
# # Source:   table<sparklyr_tmp_7431373dca00> [?? x 1]
# # Database: spark_connection
#   c22  
#   <chr>
# 1 def

where c22 is a character column.

My guess is that in your real data, one of the levels is actually an array of structs. If this is the case, then indexing into an array forces a list wrapping (or else data would need to be dropped). You can resolve this in spark land using sdf_explode or you can resolve it locally in a variety of ways. For example, using purrr you would do something like:

df <- collect(spk_df)
df %>% mutate(c22=purrr::map(c22, ~unlist))

It is possible that you will need to write a function wrapping unlist to deal with different data types in different rows (the NA values are logical).

unlist_and_cast <- function(x) {
  as.charater(unlist(x))
}

df %>% mutate(c22=purrr::map(c22, ~unlist_and_cast))

would do the trick I think (untested).

Matt Pollock
  • 1,063
  • 10
  • 26
  • Its been a while and I figured a way out. So basically I copy the column, name it. Delete the old column and rename it back to the original name. data %>% sdf_select(c.c2.c22) %>% sdf_separate_column(column = "c22", into= "c221") %>% select(-c22) %>% rename(c22 = c221) – Gabriel Mar 21 '18 at 23:07
  • 1
    Yep, basically taking the make it wider approach (separate_column) instead of the make it longer approach (explode). In your case where everything seems to be a 1 element list the two are equivalent. You would start to notice differences if your `c22` column had any rows where the list length was greater than 1. – Matt Pollock Mar 22 '18 at 13:41
  • Hi Matt, I just updated dplyr, sparklyR to a newer version. When I select data %>% sdf_select(c.c2.c22), the column name displayed will become c2 instead of c22 so the name of the column is the parent's nest name. Can you take a look at what happen? – Gabriel Dec 21 '18 at 12:56
  • this looks like a bug - tracking here https://github.com/mitre/sparklyr.nested/issues/21 – Matt Pollock Dec 21 '18 at 17:29
  • Many thanks! is it a bug in sparklyr or sparklyr.nested library? – Gabriel Dec 21 '18 at 17:39
  • 1
    sparklyr.nested. The logic for picking default column names was not written with multiple-levels deep nesting in mind – Matt Pollock Dec 21 '18 at 17:53