0

How does one get particular fields from RDD[String] to a List of maps with the specific field. I have an RDD[String]: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] Each entry is JSON in this format:

{
  count: 1,
  itemId: "1122334",
  country: {
    code: {
      preferred: "USA"
    },
    name: {
      preferred: "America"
    }
  },
  states: "50",
  self: {
    otherInfo: [

    ],
    preferred: "National Parks"
  },
  Rating: 4  
}

How do I get a list of maps that have only itemId as the key and self.preferred as the value ({itemid , self.preferred}):

itemId : 1122334 self.preferred : "National Parks"
itemId : 3444444 self.preferred : "State Parks"
...

Is it efficient to broadcast the resulting map across all nodes? I need this map to be shared/referenced by further calculations.

Nathaniel Ford
  • 20,545
  • 20
  • 91
  • 102
Swetha
  • 177
  • 1
  • 1
  • 14
  • Whether it's efficient or not depends on the size of the map. If you really want it to be a list (or a HashMap, which would be more suitable here) you'll need to `.collect()` RDD to the driver which may not work if the RDD is too large to fit in the driver's memory. In that case you'll need to use an `RDD[(String, String)]` to hold your mapping and then employ `.join()` to translate item IDs to preferred values. – Dmitry Dzhus Sep 07 '16 at 23:17

1 Answers1

0

You can try :

    val filteredMappingsList = countryMapping.filter(x=> {
    val jsonObj = new JSONObject(x)
    jsonObj.has("itemId") 

})

val finalMapping = filteredMappingsList.map(x=>{
    val jsonObj = new JSONObject(x);
    val itemId = jsonObj.get("itemId").toString()
    val preferred = jsonObj.getJSONObject("self").get("preferred").toString()
    (itemId, preferred)
}).collectAsMap

To Broadcast :

val broadcastedAsins = sc.broadcast(finalMapping)