2

I have a csv file which is more or less "semi-structured)

    rowNumber;ColumnA;ColumnB;ColumnC;
    1;START; b; c;
    2;;;;
    4;;;;
    6;END;;;
    7;START;q;x;
    10;;;;
    11;END;;;

Now I would like to get data of this row --> 1;START; b; c; populated until it finds a 'END' in columnA. Then it should take this row --> 7;START;q;x; and fill the cells below with the values until the next 'END' (here:11;END;;;)

I am a complete beginner and it is pretty tough for me, how I should start:

    import au.com.bytecode.opencsv.CSVReader
    import java.io.FileReader
    import scala.collection.JavaConversions._
    import scala.collection.mutable.ListBuffer


    val masterList = new CSVReader(new FileReader("/root/csvfile.csv"),';')
    var startList = new ListBuffer[String]()
    var derivedList = new ListBuffer[String]()



    for (row <- masterList.readAll) {
        row(1) = row(1).trim
        if (row(1) == "START")
          startList += row(0)      
    }
    var i = 0 
    for (i <- i to startList.length ) {
      for(row <- masterList.readAll)
      if (row(0) > startList(i) && row(0) < startList(i+1)) {
        derivedList += row
      }
    }

I started to read the file with CSVReader and create a masterList. I created a loop and iterate und put all the START values into it (so I know the range from START to the next START). I created a second loop where I wanted to put in the datasets into a new ListBuffer. But this does not work

The next step would be to merge masterList + derived List.

I need some good ideas, or a push in the right direction, how I could proceed or how I could do this a bit easier? Help very much appreciated!!

I don't know, if there is a big difference in the first place, but I want to create a Apache Spark application. There is also the option to do this in Python (if it is easier)

Output should look like this: It should look like

    1;START; b; c;
    2;;b;c;
    4;;b;c;
    6;END;;;
    7;START;q;x;
    10;;q;x;
    11;END;;;

You never touch the line with END. Just fill up the lines below START with ColumnB and ColumnC

meecrob
  • 48
  • 6

5 Answers5

0

Apparently you misuse "readAll" function for stream.

for (i <- i to startList.length ) {
  for(row <- masterList.readAll)
  if (row(0) > startList(i) && row(0) < startList(i+1)) {
    derivedList += row
  }
}

After first readAll function, the second function couldn't read anything, because first readAll made the stream pointer to the end of file, you need to reset the strampointer.

wherby
  • 704
  • 1
  • 5
  • 16
0

I recommend the solution in question/answer already given (using a library) and here's an answer on how to NOT do it :P

Code

val text = """rowNumber;ColumnA;ColumnB;ColumnC;
1;START; b; c;
2;;;;
4;;;;
6;END;;;
7;START;q;x;
10;;;;
11;END;;;"""

val groupsRe = """(?s)\S*?START.*?END\S*""".r
val valuesRe = """^.*?START;\s*(\S*);\s*(\S*);""".r

groupsRe.findAllIn(text)
  .map(_.split("\n"))
  .map({ lines =>
    lines.head match {
      case valuesRe(a, b) =>
        lines.map(_.replaceAll(";;;;", s";;$a;$b;"))
      case g =>
        println(s"Don't recognise $g")
        lines
    }
  }).map(_.mkString("\n"))
  .mkString("\n")

Output

res0: String = 1;START; b; c;
2;;b;c;
4;;b;c;
6;END;;;
7;START;q;x;
10;;q;x;
11;END;;;
bjfletcher
  • 11,168
  • 4
  • 52
  • 67
0

Kind of little functional using the scanleft[1]. drop(1) from the readAll because it reads the header too. Also drop(1) at the end because scanleft starts with start result.

ScanLeft kind of lets you work with previous value of the computation, It takes a function with two params, first the result of the previous computation and the second parameter the current value of the list (iterator..). It needs a sentinel kind of value for the first element of the list, and I am providing empty list in this case.

Now in the function which I am passing to scanleft ( prev,curr) => .... If the curr which is current row of the csv, starts with "START" or "END", don't need to do anything that is expected value. Other wise (case _) we need take the first two values of the current row and append previous row except first two columns ( i.e. drop 2 ). you can also append prev(2) and prev(3) too.

And finally there is a drop(1), because scan left returns the firstrow with sentinel i.e. the start value, which we don't need.

val reader = new CSVReader(new FileReader("test.in"),';')
val start = List("","","","")
val res = reader.readAll.drop(1).scanLeft(start)((prev,curr) => {
  curr(1) match { 
    case "START" => curr.toList
    case "END" => curr.toList
    case _ => curr(0) ::  curr(1) :: (prev drop 2)
  }
}).drop(1)

To view result

for(r <- res) {
  println(r)
}

This will output

List(1, START,  b,  c, )
List(2, ,  b,  c, )
List(4, ,  b,  c, )
List(6, END,  ,  , )
List(7, START, q, x, )
List(10, , q, x, )
List(11, END, , , )

[1]. Reduce, fold or scan (Left/Right)?

Community
  • 1
  • 1
Biswanath
  • 9,075
  • 12
  • 44
  • 58
  • This solutions is great, but could you please add a few comments to your lines, as I have a few problems with completely understand your code. Thanks in advance! – meecrob Jun 21 '15 at 19:38
  • @meecrob , Added more explanation, please let me know if some of the things are not clear. – Biswanath Jun 22 '15 at 07:43
0

Well if you want an easy already available solution , try Spark-csv https://github.com/databricks/spark-csv

It will load the file as DataFrame, so you will be better situation to handle data.

Use following command to launch spark-shell

$ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3


val doc1 = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","\t").option("header","true").load("/Users/abc/Documents/doc1.csv")

And content will look like-

+-----+------+----+------+----+
|  NOX|    RM| AGE|   DIS|CHAS|
+-----+------+----+------+----+
|0.538| 6.575|65.2|  4.09|   1|
|0.469| 6.421|78.9|4.9671|   2|
|0.469| 7.185|61.1|4.9671|   3|
|0.458| 6.998|45.8|6.0622|   4|
|0.458| 7.147|54.2|6.0622|   5|
|0.458|  6.43|58.7|6.0622|   6|
|0.524| 6.012|66.6|5.5605|   7|
|0.524| 6.172|96.1|5.9505|   1|
|0.524| 5.631| 100|6.0821|   2|
|0.524| 6.004|85.9|6.5921|   3|
|0.524| 6.377|94.3|6.3467|   4|
|0.524| 6.009|82.9|6.2267|   5|
|0.524| 7.009|  39|5.4509|   6|
|0.538| 8.009|61.8|4.7075|   7|
|0.538| 9.009|84.5|4.4619|   8|
|0.538|10.009|56.5|4.4986|   9|
|0.538|11.009|29.3|4.4986|  10|
|0.538|12.009|81.7|4.2579|  11|
|0.538|13.009|36.6|3.7965|  12|
|0.538|14.009|69.5|3.7965|   6|
+-----+------+----+------+----+

Now you can use Spark-Sql or dataframe operations like filtering to pre-process your data as you need.

Abhishek Choudhary
  • 8,255
  • 19
  • 69
  • 128
  • This looks awesome, but unfortunately Spark 1.3+ is needed and my cluster is running with 1.2.1 -- so not an option. I get this error "bad option: '--packages'" – meecrob Jun 21 '15 at 19:36
  • aah thats unfortunate ..... then map it to function to use iterative way ... using sequence or parallelize – Abhishek Choudhary Jun 21 '15 at 22:25
0

You could use the MasterDetailProcessor that is built into uniVocity-parsers - (uniVocity-parsers is used by spark-csv, not sure if they make the RowProcessor implementations readily available through spark-csv).

If you go with the library itself, you can have your problem solved with the following code (sorry it's in Java):

    CsvParserSettings settings = new CsvParserSettings(); //many more options here, check the documentation.
    settings.getFormat().setDelimiter(';');
    settings.getFormat().setLineSeparator("\n");
    settings.setHeaderExtractionEnabled(true); // extract headers from first row

    //A processor of master detail records. The master row is at the at top, so you have RowPlacement.TOP here.
    //The second arg is an instance of ObjectRowListProcessor - this allows you to configure data conversions - check the docs
    MasterDetailListProcessor processor = new MasterDetailListProcessor(RowPlacement.TOP, new ObjectRowListProcessor()) {
        @Override
        protected boolean isMasterRecord(String[] row, ParsingContext context) {
            return row.length > 1 && "START".equals(row[1]);
        }
    };
    //Configures the parser to to your master detail processor.
    settings.setRowProcessor(processor);

    //The parse method will process the input and submit the rows to your processor
    new CsvParser(settings).parse(new FileReader(new File("path/to/your.csv")));

    //All you need to do is to get the (master) records here:
    List<MasterDetailRecord> masterRecords = processor.getRecords();

    for(MasterDetailRecord master : masterRecords){
        System.out.println("Master row:" + Arrays.toString(master.getMasterRow()));
        // Each master record has detail rows (everything after your "START" row)
        for(Object[] childRow : master.getDetailRows()){
            //By default you will get NULLS instead of blank Strings - probably easier for you to work with
            //This can be configured in the parser settings if you need.
            System.out.println("\t" + Arrays.toString(childRow));
        }
    }

Output:

Master row:[1, START, b, c, null]
    [2, null, null, null, null]
    [4, null, null, null, null]
    [6, END, null, null, null]
Master row:[7, START, q, x, null]
    [10, null, null, null, null]
    [11, END, null, null, null]

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

josliber
  • 43,891
  • 12
  • 98
  • 133
Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29