Apache Spark Performance Issue

Question

We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement.
Using,
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("AIRecordLinkage").setMaster("local[*]"));<br> Dataset<Row> sourceFileContent = spark.read().jdbc("jdbc:oracle:thin:@//Connection_IP/stage", "Database_name.Table_name", connectionProperties);

We are able to import around 1.8 million records to spark environment which is stored in dataset object. Now using filter function targetFileContent.filter(col("TARGETUPC").equalTo(upcValue))

The above filter statement is in a loop where upcValue gets updated for approximately 46k IDs.

This program is executing for several hours, but we tried the same using sql IN operator where in we kept all the 46k UPC IDs which executed in less than a minute.

Configuration:
Spark-sql 2.11
Spark-core 2.11
JDK 8
Windows 10, Single node 4 cores 3Ghz, 16 GB RAM.
C drive -> 12 GB free space.
Eclipse -> Run configuration -> –Xms15000m.

Kindly help us to analyze and understand if there are any mistakes and suggest us what needs to be done to improve the performance.

    @Component("upcExactMatch")
    public class UPCExactMatch {
        @Autowired
        private Environment envirnoment;

        @Autowired
        private LoadCSV loadCSV;

        @Autowired
        private SQLHandler sqlHandler;

        public ArrayList<Row> perform(){

            ArrayList<Row> upcNonMatchedItemIDs=new ArrayList<Row>();
            ArrayList<Row> upcMatchedItemIDs=new ArrayList<Row>();

            JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
            SQLContext sqlContext = new SQLContext(javaSparkContext);
            SparkSession sparkSession = SparkSession.builder().appName("JavaStopWordshandlerTest").getOrCreate();

            try{
                Dataset<Row> sourceFileContent =loadCSV.load(sourceFileName,sourceFileLocation,javaSparkContext,sqlContext);

                // load target from database
                Dataset<Row> targetFileContent = spark.read().jdbc("jdbc:oracle:thin:@//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
                System.out.println("File counts :"+sourceFileContent.count()+"  :  "+targetFileContent.count());

                ArrayList<String> upcMatched = new ArrayList<String>();
                ArrayList<String> partNumberMatched = new ArrayList<String>();

                List<Row> sourceFileContents = sourceFileContent.collectAsList();

                int upcColumnIndex=-1;
                int itemIDColumnIndex=-1;
                int partNumberTargetIndex=-1;
                String upcValue="";

                StructType schema = targetFileContent.schema();
                List<Row> data = Arrays.asList();
                Dataset<Row> upcMatchedRows = sparkSession.createDataFrame(data, schema);

                for(Row rowSourceFileContent: sourceFileContents){

                    upcColumnIndex=rowSourceFileContent.fieldIndex("Vendor UPC");

                    if(!rowSourceFileContent.isNullAt(upcColumnIndex)){

                        upcValue=rowSourceFileContent.get(upcColumnIndex).toString();
                        upcMatchedRows=targetFileContent.filter(col("TARGETUPC").equalTo(upcValue));

                        if(upcMatchedRows.count() > 0){

                            for(Row upcMatchedRow: upcMatchedRows.collectAsList()){
                                partNumberTargetIndex=upcMatchedRow.fieldIndex("PART_NUMBER");

                                if(partNumberTargetIndex != -1){
                                    upcMatched.add(upcValue);
                                    partNumberMatched.add(upcMatchedRow.get(partNumberTargetIndex).toString());
                                    System.out.println("Source UPC : "+upcValue +"\tTarget part number :"+ upcMatchedRow.get(partNumberTargetIndex));

                                }
                            }

                        }

                    }

                }

                for(int i=0;i<upcMatched.size();i++){
                    System.out.println("Matched Exact UPC ids are :"+upcMatched.get(i) + "\t:Target\t"+partNumberMatched.get(i));

                }

            }catch(Exception e){
                e.printStackTrace();
            }finally{
                sparkSession.stop();
                sqlContext.clearCache();
                javaSparkContext.close();
            }

            return upcMatchedItemIDs;

        }

    }

score 0 · Accepted Answer · answered Feb 18 '17 at 06:02

0

Try doing inner join between two dataframes of data sets for matched records.

answered Feb 18 '17 at 06:02

Narsireddy

368
5
10

Thank you Inner join worked for matching the records – Nischay Mar 01 '17 at 10:33

Apache Spark Performance Issue

1 Answers1