How to skip line in spark rdd map action based on if condition

Question

I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:

val data = sc.textFile(my_file).
    map {line =>

        val parts = line.split(",");
        Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};

and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.

So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}

how can I do this skipLine action?

score 3 · Accepted Answer · answered May 23 '16 at 08:10

3

You can use filter function to filter out such lines:

val data = sc.textFile(my_file)
   .filter(_.split(",").length == cols)
   .map {line =>
        // your code
   };

Assuming variable cols holds number of columns in a valid row.

answered May 23 '16 at 08:10

justAbit

4,226
2
19
34

score 2 · Answer 2 · answered May 23 '16 at 11:15

You can use flatMap, Some and None for this:

def missingFeatures(stuff): Boolean = ??? // Determine if features is missing

val data = sc.textFile(my_file)
  .flatMap {line =>
    val parts = line.split(",");
    if(missingFeatures(parts)) None
    else Some(Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray))
};

This way you avoid mapping over the rdd more than once.

score 0 · Answer 3 · answered Apr 04 '19 at 01:01

Java code to skip empty lines / header from Spark RDD:

First the imports:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;

Now, the filter compares total columns to 17 or header column which starts with VendorID.

Function<String, Boolean> isInvalid = row -> (row.split(",").length == 17 && !(row.startsWith("VendorID")));
JavaRDD<String> taxis = sc.textFile("datasets/trip_yellow_taxi.data")
                        .filter(isInvalid);

How to skip line in spark rdd map action based on if condition

3 Answers3