Apache Spark Java - how to iterate through row dataset and remove null fields

Question

I'm trying to build the spark application which reads the data from Hive table and output will be written as JSON.

In below code, I have to iterate through row dataset and remove the null fields before output.

I'm expecting my output like, please suggest how can I achieve this?

{"personId":"101","personName":"Sam","email":"Sam@gmail.com"}
{"personId":"102","personName":"Smith"}  // as email is null or blank should not be included in output

Here is my code:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.fdc.model.Person;

public class ExtractionExample {

    public static void main(String[] args) throws Exception {
        SparkSession spark = SparkSession.builder().appName("ExtractionExample")
                .config("spark.sql.warehouse.dir", "/user/hive/warehouse/").enableHiveSupport().getOrCreate();
        Dataset<Row> sqlDF = spark.sql("SELECT person_id as personId, person_name as personName, email_id as emailId FROM person");
        Dataset<Person> person = sqlDF.as(Encoders.bean(Person.class));

        /*  
         * iterate through all the columns and identify the null value and drop
         * Looks like it will drop the column from entire table but when I tried it doesn't do anything.
         * String[] columns = sqlDF.columns();
        for (String column : columns) {
            String colValue = sqlDF.select(column).toString();
            System.out.println("printing the column: "+ column +" colvalue:"+colValue.toString());
            if(colValue != null && colValue.isEmpty() && (colValue).trim().length() == 0) {
                System.out.println("dropping the null value");
                sqlDF = sqlDF.drop(column);
            }

        }
        sqlDF.write().json("/data/testdb/test/person_json");
        */

        /* 
         * 
         * Unable to get the bottom of the solution 
         * also collect() is heavy operation is there any better way to do this?
         * List<Row> rowListDf = person.javaRDD().map(new Function<Row, Row>() {
                @Override
                public Row call(Row record) throws Exception {
                   String[] fieldNames =  record.schema().fieldNames();
                    Row modifiedRecord = new RowFactory().create();
                   for(int i=0; i < fieldNames.length; i++ ) {
                       String value = record.getAs(i).toString();
                      if (value!= null && !value.isEmpty() && value.trim().length() > 0) {
                          //   RowFactory.create(record.get(i)); ---> throwing this error
                      }
                   }
                    // return RowFactory object
                    return null;
                }
            }).collect();*/


        person.write().json("/data/testdb/test/person_json");

    }
}

There is nothing to be done here. JSON writer ignores `NULL` fields by default. If you have blank strings, you'll have to convert these to `NULL` as well. — Alper t. Turker, May 18 '18 at 17:50
Thank you; I was in assumption that, we need to iterate through the each row of dataset and remove the null values. — Srinivas, May 18 '18 at 19:34

score 0 · Answer 1 · answered May 18 '18 at 19:36

0

As suggested bu the user9613318, JSON writer ignores NULL fields by default.

answered May 18 '18 at 19:36

Srinivas

64
1
6

Apache Spark Java - how to iterate through row dataset and remove null fields

1 Answers1