My sample data looks like below
{ Line 1
Line 2
Line 3
Line 4
...
...
...
Line 6
Complete info:
Dept : HR
Emp name is Andrew lives in Colorodo
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Alex lives in Texas
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Mathew lives in California
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Dept : QC
Emp name is Nguyen lives in Nevada
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Cassey lives in Newyork
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Ronney lives in Alasca
DOB : 03/09/1958
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
line21
line22
line23
...
}
Output I need ;
{
Dept Empname State Dob Projectname DOJ DOE
HR Andrew Colorodo 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Andrew Colorodo 03/09/1958 Retail 11/04/2011 08/21/2013
HR Andrew Colorodo 03/09/1958 Audit 09/11/2013 09/01/2014
HR Andrew Colorodo 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Retail 11/04/2011 08/21/2013
HR Mathews California 03/09/1958 Audit 09/11/2013 09/01/2014
HR Mathews California 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Retail 11/04/2011 08/21/2013
QC Nguyen Nevada 03/09/1958 Audit 09/11/2013 09/01/2014
QC Nguyen Nevada 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Retail 11/04/2011 08/21/2013
QC Casey Newyork 03/09/1958 Audit 09/11/2013 09/01/2014
QC Casey Newyork 03/09/1958 ControlManagement 06/04/2011 09/21/2011}
I have tried below options : 1 ) thought to use map inside map then went for matching. Got so many errors. Then read a post from here, which explained me map can't have another map inside. In fact no Rdd transformation can be done inside another. Sorry. Newbie to Spark.
2) tried using reg expression. And then call map over captured group. But since each dept have multiple emps and each employee have multiple project info, I can't group that portion of data repeatedly and not able to map with corresponding employee. Same goes with employee and dept details as well.
Q1 : Is it even possible to convert above sample data to above data format in Spark/ Scala.?
Q2: if so wats the logic/ concept that I shud be going after?
Thanks in advance.