I have looked at other similar questions asked already on this site, but did not get a satisfactory answer.
I am a total newbie to Apache spark and hadoop. My problem is that I have an input file(35GB) which contains multi-line reviews of merchandise of online shopping sites. The information is given in the file as shown below:
productId: C58500585F
product: Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.
This is one block of review. There are thousands of such blocks separated by blank lines. what I need from here is the productId, userId and score,so I have filtered the JavaRDD to have just the lines that I need. so it will look like following:
productId: C58500585F
userId: A3NM6WTIAE
score: 2.0
Code :
SparkConf conf = new SparkConf().setAppName("org.spark.program").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> input = context.textFile("path");
JavaRDD<String> requiredLines = input.filter(new Function<String, Boolean>() {
public Boolean call(String s) throws Exception {
if(s.contains("productId") || s.contains("UserId") || s.contains("score") || s.isEmpty() ) {
return false;
}
return true;
}
});
Now, I need to read these three lines as part of one (key, value) pair which I do not know how. There will only be a blank line between two blocks of reviews.
I have looked at several websites, but did not find solution to my problem. Can any one please help me with this ? Thanks a lot! Please let me know if you need more information.