0

I have top level data 'whatever' comprising many (thousands) of records. Each record has data of the form

structA structB ... structN, varA varB ... varN

each structure may also contain other structures and variables e.g. structAA, structAB... varAA, varAB... etc

I would like to query these data and return records meeting a given criteria, for example

SELECT structA.structBC.structNN FROM whatever WHERE structB.structBN.varNN = "value"

Which would return structNN from each record where varNN equals "value"

Databricks will allow me to do something like

SELECT structA FROM whatever as a WHERE a.varN = "value"

but I cant seem to find out if/how one does more levels between SELECT and FROM or after WHERE. I dont particularly want the variable I select-on to be in the structure I return either, so that further complicates things. I suspect the data is accessed as a table which may be part of the problem i.e. I imagine the records as objects but they are columns and rows in some sense. If I cant do this with an SQL query, what can I use in Databricks to achieve the desired result?

1 Answers1

0

it's pretty simple.

First, let's create a data with underlying structure, for example:

data = [{\
  'level1a': {\
              'level2a': {\
                         'a':1,\
                         'b':2\
                        }\
             }\
 ,'level1b': {\
              'level2b': 'some text'
             }\
 }]
df = spark.createDataFrame(data)
df.createOrReplaceTempView("data_view")

Now, we can select any element of the structure (or the whole or part of the structure):

%sql
select level1a, level1a.level2a, level1a.level2a.a, level1a.level2a.b, level1b, level1b.level2b from data_view;

Here is the result: enter image description here

EDIT:

The query with condition:

%sql

select level1a, level1a.level2a, level1a.level2a.a, level1a.level2a.b, level1b, level1b.level2b 
from data_view
where level1a.level2a.a = 1;
hopeman
  • 153
  • 1
  • 9