Unnesting nested JSON structures in Apache Drill

Question

I have the following JSON (roughly) and I'd like to extract the information from the header and defects fields separately:

{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890",
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}

I have tried to access the individual elements with file.header.timeStamp etc but that returns null. I have tried using flatten(file) but that gives me

Cannot cast org.apache.drill.exec.vector.complex.MapVector to org.apache.drill.exec.vector.complex.RepeatedValueVector

I've looked into kvgen() but don't see how that fits in my case. I tried kvgen(file.header) but that gets me

kvgen function only supports Simple maps as input

which is what I had expected anyway.

Does anyone know how I can get header and defects, so I can process the information contained in them. Ideally, I'd just select the information from header because it contains no arrays or maps, so I can take individual records as they are. For defects I'd simply use FLATTEN(defectParts) to obtain a table of the defective parts.

Any help would be appreciated.

score 6 · Answer 1 · answered Mar 15 '16 at 06:17

What version of Drill are you using ? I tried querying the following file on latest master (1.7.0-SNAPHOT):

{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890"
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}
{
  "file": {
    "header": {
      "timeStamp": "2016-03-14T00:20:15.005+04:00",
      "serialNo": "3456",
      "sensorId": "1234567890"
    },
    "defects": [
      {
        "info": {
          "systemId": "DEFCHK123",
          "numDefects": "3",
          "defectParts": [
            "003", "006", "008"
          ]
        }
      }
    ]
  }
}

And the following queries are working fine: 1.

select t.file.header.serialno as serialno from `parts.json` t;
+-----------+
| serialno  |
+-----------+
| 3456      |
| 3456      |
+-----------+
2 rows selected (0.098 seconds)

2.

select flatten(t.file.defects) defects from `parts.json` t;
+---------------------------------------------------------------------------------------+
|                                        defects                                        |
+---------------------------------------------------------------------------------------+
| {"info":{"systemId":"DEFCHK123","numDefects":"3","defectParts":["003","006","008"]}}  |
| {"info":{"systemId":"DEFCHK123","numDefects":"3","defectParts":["003","006","008"]}}  |
+---------------------------------------------------------------------------------------+

3.

select q.h.serialno as serialno, q.d.info.defectParts as defectParts from (select t.file.header h, flatten(t.file.defects) d from `parts.json` t) q;
+-----------+----------------------+
| serialno  |     defectParts      |
+-----------+----------------------+
| 3456      | ["003","006","008"]  |
| 3456      | ["003","006","008"]  |
+-----------+----------------------+
2 rows selected (0.126 seconds)

PS: This should've been a comment but I don't have enough rep yet!

@Ian this is working fine with drill 1.5 . I tested same queries with same data. Also, your JSON is invalid ==> there should be no comma after `"sensorId": "1234567890"` — Dev, Mar 16 '16 at 05:29
The comma is there because I forgot to remove it from a JSON record that was a lot bigger. — Ian, Mar 16 '16 at 10:20
You must make sure you use the table alias when referring to nested data. `file.header.timeStamp` won't work but `t.file.header.timeStamp` should work fine. Note how `adeneche` has the alias in all of his examples. Also if something doesn't work that seems like it should then it's probably a reserved word - try wrapping it in backticks (e.g. `\`metadata\``). — Joe Harris, May 10 '16 at 00:36

score 0 · Answer 2 · answered Mar 14 '16 at 08:56

0

I don't have experience with Apache Drill, but checked the manual. Isn't this what you're looking for?

answered Mar 14 '16 at 08:56

Jørgen

2,157
1
23
24

I should have said that I checked the manual, and that works for fields within arrays. If I do a simple `SELECT file.header.timeStamp` I get null. Why I have no idea and that's why I'm stuck. – Ian Mar 14 '16 at 09:29
I just figured out that `select t2.q2.sensorId from (select t1.q1.header as q2 from (select file as q1 from `test.json`) t1) t2` seems to do what I want. I have no idea why I have to use subqueries though... – Ian Mar 14 '16 at 09:40
Yeah, weird that it doesn't want to except timeStamp while it accepts serialNo and sensorId. Also like to see you fix this for defects. Drill might be a super good tool but seems like you have to design your JSON to work with it's limitations. Or at least it doesn't query like you'd expect it. (but like I said: just downloaded it to check this question out). – Jørgen Mar 14 '16 at 10:03
The JSON is as provided. There is nothing I can change about the structure unfortunately unless I want to write code to re-parse it and then I might as well use Spark, which seems more up to the task than Drill... – Ian Mar 14 '16 at 12:35

Unnesting nested JSON structures in Apache Drill

2 Answers2