We thought of using Apache Spark to match records faster, but we are finding it highly inefficient than SQL matching using select statement.
Using,
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("AIRecordLinkage").setMaster("local[*]"));<br>
Dataset<Row> sourceFileContent = spark.read().jdbc("jdbc:oracle:thin:@//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
We are able to import around 1.8 million records to spark environment which is stored in dataset object. Now using filter function targetFileContent.filter(col("TARGETUPC").equalTo(upcValue))
The above filter statement is in a loop where upcValue gets updated for approximately 46k IDs.
This program is executing for several hours, but we tried the same using sql IN operator where in we kept all the 46k UPC IDs which executed in less than a minute.
Configuration:
Spark-sql 2.11
Spark-core 2.11
JDK 8
Windows 10, Single node 4 cores 3Ghz, 16 GB RAM.
C drive -> 12 GB free space.
Eclipse -> Run configuration -> –Xms15000m.
Kindly help us to analyze and understand if there are any mistakes and suggest us what needs to be done to improve the performance.
@Component("upcExactMatch")
public class UPCExactMatch {
@Autowired
private Environment envirnoment;
@Autowired
private LoadCSV loadCSV;
@Autowired
private SQLHandler sqlHandler;
public ArrayList<Row> perform(){
ArrayList<Row> upcNonMatchedItemIDs=new ArrayList<Row>();
ArrayList<Row> upcMatchedItemIDs=new ArrayList<Row>();
JavaSparkContext javaSparkContext = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(javaSparkContext);
SparkSession sparkSession = SparkSession.builder().appName("JavaStopWordshandlerTest").getOrCreate();
try{
Dataset<Row> sourceFileContent =loadCSV.load(sourceFileName,sourceFileLocation,javaSparkContext,sqlContext);
// load target from database
Dataset<Row> targetFileContent = spark.read().jdbc("jdbc:oracle:thin:@//Connection_IP/stage", "Database_name.Table_name", connectionProperties);
System.out.println("File counts :"+sourceFileContent.count()+" : "+targetFileContent.count());
ArrayList<String> upcMatched = new ArrayList<String>();
ArrayList<String> partNumberMatched = new ArrayList<String>();
List<Row> sourceFileContents = sourceFileContent.collectAsList();
int upcColumnIndex=-1;
int itemIDColumnIndex=-1;
int partNumberTargetIndex=-1;
String upcValue="";
StructType schema = targetFileContent.schema();
List<Row> data = Arrays.asList();
Dataset<Row> upcMatchedRows = sparkSession.createDataFrame(data, schema);
for(Row rowSourceFileContent: sourceFileContents){
upcColumnIndex=rowSourceFileContent.fieldIndex("Vendor UPC");
if(!rowSourceFileContent.isNullAt(upcColumnIndex)){
upcValue=rowSourceFileContent.get(upcColumnIndex).toString();
upcMatchedRows=targetFileContent.filter(col("TARGETUPC").equalTo(upcValue));
if(upcMatchedRows.count() > 0){
for(Row upcMatchedRow: upcMatchedRows.collectAsList()){
partNumberTargetIndex=upcMatchedRow.fieldIndex("PART_NUMBER");
if(partNumberTargetIndex != -1){
upcMatched.add(upcValue);
partNumberMatched.add(upcMatchedRow.get(partNumberTargetIndex).toString());
System.out.println("Source UPC : "+upcValue +"\tTarget part number :"+ upcMatchedRow.get(partNumberTargetIndex));
}
}
}
}
}
for(int i=0;i<upcMatched.size();i++){
System.out.println("Matched Exact UPC ids are :"+upcMatched.get(i) + "\t:Target\t"+partNumberMatched.get(i));
}
}catch(Exception e){
e.printStackTrace();
}finally{
sparkSession.stop();
sqlContext.clearCache();
javaSparkContext.close();
}
return upcMatchedItemIDs;
}
}