I have a csv file which is more or less "semi-structured)
rowNumber;ColumnA;ColumnB;ColumnC;
1;START; b; c;
2;;;;
4;;;;
6;END;;;
7;START;q;x;
10;;;;
11;END;;;
Now I would like to get data of this row --> 1;START; b; c; populated until it finds a 'END' in columnA. Then it should take this row --> 7;START;q;x; and fill the cells below with the values until the next 'END' (here:11;END;;;)
I am a complete beginner and it is pretty tough for me, how I should start:
import au.com.bytecode.opencsv.CSVReader
import java.io.FileReader
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val masterList = new CSVReader(new FileReader("/root/csvfile.csv"),';')
var startList = new ListBuffer[String]()
var derivedList = new ListBuffer[String]()
for (row <- masterList.readAll) {
row(1) = row(1).trim
if (row(1) == "START")
startList += row(0)
}
var i = 0
for (i <- i to startList.length ) {
for(row <- masterList.readAll)
if (row(0) > startList(i) && row(0) < startList(i+1)) {
derivedList += row
}
}
I started to read the file with CSVReader and create a masterList. I created a loop and iterate und put all the START values into it (so I know the range from START to the next START). I created a second loop where I wanted to put in the datasets into a new ListBuffer. But this does not work
The next step would be to merge masterList + derived List.
I need some good ideas, or a push in the right direction, how I could proceed or how I could do this a bit easier? Help very much appreciated!!
I don't know, if there is a big difference in the first place, but I want to create a Apache Spark application. There is also the option to do this in Python (if it is easier)
Output should look like this: It should look like
1;START; b; c;
2;;b;c;
4;;b;c;
6;END;;;
7;START;q;x;
10;;q;x;
11;END;;;
You never touch the line with END. Just fill up the lines below START with ColumnB and ColumnC