Nested data from an Rdd Scala spark

Question

My sample data looks like below

{ Line 1
Line 2
Line 3
Line 4
...
...
...
Line 6



Complete info:
Dept : HR
Emp name is Andrew lives in Colorodo
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Alex lives in Texas
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Mathew lives in California
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016

Dept : QC
Emp name is Nguyen lives in Nevada
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Cassey lives in Newyork
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Ronney lives in Alasca
DOB : 03/09/1958
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016


line21
line22
line23
...
}

Output I need ;

{

Dept    Empname     State       Dob     Projectname         DOJ     DOE
HR  Andrew      Colorodo    03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Andrew      Colorodo    03/09/1958  Retail          11/04/2011  08/21/2013
HR  Andrew      Colorodo    03/09/1958  Audit           09/11/2013  09/01/2014
HR  Andrew      Colorodo    03/09/1958  ControlManagement   06/04/2011  09/21/2011
HR  Alex        Texas       03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Alex        Texas       03/09/1958  ControlManagement   06/04/2011  09/21/2011
HR  Mathews     California  03/09/1958  Healthcare      06/04/2011  09/21/2011
HR  Mathews     California  03/09/1958  Retail          11/04/2011  08/21/2013
HR  Mathews     California  03/09/1958  Audit           09/11/2013  09/01/2014
HR  Mathews     California  03/09/1958  ControlManagement   06/04/2011  09/21/2011
QC  Nguyen      Nevada      03/09/1958  Healthcare      06/04/2011  09/21/2011
QC  Nguyen      Nevada      03/09/1958  Retail          11/04/2011  08/21/2013
QC  Nguyen      Nevada      03/09/1958  Audit           09/11/2013  09/01/2014
QC  Nguyen      Nevada      03/09/1958  ControlManagement   06/04/2011  09/21/2011
QC  Casey       Newyork     03/09/1958  Healthcare      06/04/2011  09/21/2011
QC  Casey       Newyork     03/09/1958  Retail          11/04/2011  08/21/2013
QC  Casey       Newyork     03/09/1958  Audit           09/11/2013  09/01/2014
QC  Casey       Newyork     03/09/1958  ControlManagement   06/04/2011  09/21/2011}

I have tried below options : 1 ) thought to use map inside map then went for matching. Got so many errors. Then read a post from here, which explained me map can't have another map inside. In fact no Rdd transformation can be done inside another. Sorry. Newbie to Spark.

2) tried using reg expression. And then call map over captured group. But since each dept have multiple emps and each employee have multiple project info, I can't group that portion of data repeatedly and not able to map with corresponding employee. Same goes with employee and dept details as well.

Q1 : Is it even possible to convert above sample data to above data format in Spark/ Scala.?

Q2: if so wats the logic/ concept that I shud be going after?

Thanks in advance.

it's not a great match for Spark. Anything that's a linear pass is usually not best done in Spark. It's pretty trivial to do in plain Scala though - just preprocess the file that way and put the result in Spark for later processing? — The Archetypal Paul, Dec 07 '16 at 21:46
Data will be around 75gb. If any solution/logic , available in spark(even if its complex/lengthy. inefficient code), I want to give it a try& before going with other. Any ideas? Thanks. — user7264473, Dec 08 '16 at 06:08

score 1 · Answer 1 · edited May 23 '17 at 10:30

Q1: Is it possible to convert such nested data format using Spark?

A: Yes. If the records where more granular, I would suggest using a multi-line approach like discussed in this question: How to process multi line input records in Spark

But, given that in the data "Dept" holds large amounts of data, I wouldn't recommend it.

Q2: What's the logic/ concept that I should be going after?

A2: This kind of linear processing, where there's a state being built as we traverse the lines, is better approached using a iterator or stream-based implementation:

We consume line per line, and produce records only when those are complete. The context is preserved in some state. With this approach, it really doesn't matter how big the file is, as the memory requirements are limited to the size of one record + the overhead of the state handling.

Here's a working example on how to deal with it using an iterator using plain Scala:

case class EmployeeRecord(dept: String, name: String, location: String, dob: String, project: String, joined: String, left: String) {
  def toCSV = this.productIterator.mkString(", ")
}


class EmployeeParser() {

  var currentStack : Map[String, String] = Map()

  val (dept, name, location, birthdate, project, joined, left) = ("dept", "name", "location", "birthdate", "project", "joined", "left")
  val keySequence = Seq(dept, name, location, birthdate, project, joined, left)
  val ParseKeys = Map("Project name" -> project, "DOJ" -> joined, "DOL" -> left, "DOB" -> birthdate, "Dept" -> dept)
  val keySet = Set(keySequence)

  def clearDependencies(key: String) : Unit = {
    val keepKeys = keySequence.dropWhile(k => k != key).toSet
    currentStack = currentStack.filterKeys(k => !keepKeys.contains(k))
  }

  def isValidEntry(key: String) : Boolean = {
    val precedents = keySequence.takeWhile(k => k != key).drop(1)
    precedents.forall(k => currentStack.contains(k))
  }

  def add(key:String, value:String): Option[Unit] = {
    if (!isValidEntry(key)) None else {
      clearDependencies(key)
      currentStack = currentStack + (key -> value)
      Some(())
    }
  } 

  def record: Option[EmployeeRecord] = 
    for {
      _dept <- currentStack.get(dept)
      _name <- currentStack.get(name)
      _location <- currentStack.get(location)
      _dob <- currentStack.get(birthdate)
      _project <- currentStack.get(project)
      _joined <- currentStack.get(joined)
      _left <- currentStack.get(left)
    } yield EmployeeRecord(_dept, _name, _location, _dob, _project,_joined, _left)

  val EmpRegex = "^Emp name is (.*) lives in (.*)$".r
  def parse(line:String):Option[EmployeeRecord] = {
    if (line.startsWith("Emp")) { // have to deal with that inconsistency in a different way than using keys
      val maybeEmp = Option(line).map{case EmpRegex(n,l) => (n,l)}
                                 .foreach{case (n,l) => add(name, n) ; add(location, l)}
      None
    } else {
      val entry = line.split(":").map(_.trim)
      for { entryKey <- entry.lift(0)
            entryValue <- entry.lift(1)
            key <- ParseKeys.get(entryKey)
            _ <- add(key, entryValue)
            rec <- record
          } yield rec
    }
  }
}

To use it, we instantiate the parser and apply it to an iterator:

val iterator = Source.fromFile(...).getLines
val parser = new EmployeeParser()
val parsedRecords = iterator.map(parser.parse).collect{case Some(record) => record}
val parsedCSV = parsedRecords.map(rec => rec.toCSV)
parsedCSV.foreach(line => // write to destination file)

Nested data from an Rdd Scala spark

1 Answers1

Q1: Is it possible to convert such nested data format using Spark?

Q2: What's the logic/ concept that I should be going after?