I am having to parse some very large xml files. There are a few fields within those xml files that I want to extract and then perform some work on them. However, there are some rules that I need to follow, i.e. I can only select fields if they follow a certain pattern.
Here is an example of what I'm trying to achieve:
// Some made up data
val schema = new StructType()
.add("devices",
new StructType()
.add("thermostats", MapType(StringType,
new StructType()
.add("device_id", StringType)
.add("locale", StringType)
.add("E2EDK1000", StringType)
.add("E2EDK2000", StringType)
.add("E2EDK3000", StringType))))
val nestDataDS2 = Seq("""{
"devices": {
"thermostats": {
"peyiJNo0IldT2YlIVtYaGQ": {
"device_id": "peyiJNo0IldT2YlIVtYaGQ",
"locale": "en-US",
"E2EDK1000": "4.0",
"E2EDK2000": "VqFabWH21nwVyd4RWgJgNb292wa7hG_dUwo2i2SG7j3-BOLY0BA4sw",
"E2EDK3000": "Hallway Upstairs"}}}}""").toDS
val nestDF2 = spark
.read
.schema(nestSchema2)
.json(nestDataDS2.rdd)
root
|-- devices: struct (nullable = true)
| |-- thermostats: map (nullable = true)
| | |-- key: string
| | |-- value: struct (valueContainsNull = true)
| | | |-- device_id: string (nullable = true)
| | | |-- locale: string (nullable = true)
| | | |-- E2EDK1000: string (nullable = true)
| | | |-- E2EDK2000: string (nullable = true)
| | | |-- E2EDK3000: string (nullable = true)
Given this I want to get to value field so I do the following
val tmp = nestDF2.select($"devices.thermostats.value")
root
|-- value: struct (nullable = true)
| |-- device_id: string (nullable = true)
| |-- locale: string (nullable = true)
| |-- E2EDK1000: string (nullable = true)
| |-- E2EDK2000: string (nullable = true)
| |-- E2EDK3000: string (nullable = true)
Here is my issue: I want to select all the fields within value that start with the following pattern E2EDK1. However, I'm stuck on how to do that. Here is the end result that I want:
root
|-- E2EDK1000: string (nullable = true)
I know I can select that field outright, but in the data I'm working with it is not always the case that E2EDK1000 will always be there. What always will be there is E2EDK1.
I've tried using startsWith() but that does not seem to work, e.g.
val tmp2 = tmp
.select($"value".getItem(_.startsWith("E2EDK1")))