Multi-line input in Apache Spark using java

Question

I have looked at other similar questions asked already on this site, but did not get a satisfactory answer.

I am a total newbie to Apache spark and hadoop. My problem is that I have an input file(35GB) which contains multi-line reviews of merchandise of online shopping sites. The information is given in the file as shown below:

productId: C58500585F
product:  Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.

This is one block of review. There are thousands of such blocks separated by blank lines. what I need from here is the productId, userId and score,so I have filtered the JavaRDD to have just the lines that I need. so it will look like following:

productId: C58500585F
userId: A3NM6WTIAE
score: 2.0

Code :

SparkConf conf = new SparkConf().setAppName("org.spark.program").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);

JavaRDD<String> input = context.textFile("path");

JavaRDD<String> requiredLines = input.filter(new Function<String, Boolean>() {
public Boolean call(String s) throws Exception {
if(s.contains("productId") ||  s.contains("UserId") || s.contains("score") ||  s.isEmpty() ) {
        return false;
    }
    return true;
}
});

Now, I need to read these three lines as part of one (key, value) pair which I do not know how. There will only be a blank line between two blocks of reviews.

I have looked at several websites, but did not find solution to my problem. Can any one please help me with this ? Thanks a lot! Please let me know if you need more information.

Have you looked into playing around with `textinputformat.record.delimiter`? Something like [this](http://stackoverflow.com/questions/27541637/how-to-process-multi-line-input-records-in-spark). Doing so would allow you to get an RDD where each record consists of the entire block of text. — Junjun Olympia, Oct 14 '16 at 08:46
@Student : Is the block fields(productid,product...etc) splitted by any delimiter? — Shankar, Oct 14 '16 at 09:24
@Shankar No the only delimiter they have is that they are on separate lines. so they are separated by only the new line delimiter and no other special delimiter. — Student, Oct 14 '16 at 15:23
@Junjun Olympia I have looked into it, but as I said in the question, there is no special delimiter. The Blocks are only separated by an empty line. — Student, Oct 14 '16 at 15:37
@Student : After `JavaRDD input = context.textFile("path");` line , can you try to foreach the input rdd and let me know is it printing the entire block as one record or each and every line in the block is coming as one record? — Shankar, Oct 14 '16 at 15:58

Junjun Olympia · Answer 1 · 2016-10-17T04:58:23.017

Continuing on from my previous comments, textinputformat.record.delimiter can be used here. If the only delimiter is a blank line, then the value should be set to "\n\n".

Consider this test data:

productId: C58500585F
product:  Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.

productId: ABCDEDFG
product:  Teddy Bear
product/price: 6.50
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Second comment.

productId: 12345689
product:  Hot Wheels
product/price: 12.00
userId: JJ
profileName: JJ
helpfulness: 1/1
score: 4.0
time: 1624609
summary: Summarized
text: Some text

Then the code (in Scala) would look something like:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
val raw = sc.newAPIHadoopFile("test.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)

val data = raw.map(e => {
  val m = e._2.toString
    .split("\n")
    .map(_.split(":", 2))
    .filter(_.size == 2)
    .map(e => (e(0), e(1).trim))
    .toMap

  (m("productId"), m("userId"), m("score").toDouble)
})

Output is:

data.foreach(println)
(C58500585F,A3NM6WTIAE,2.0)
(ABCDEDFG,A3NM6WTIAE,2.0)
(12345689,JJ,4.0)

Wasn't sure exactly what you wanted for output so I just turned it into a 3-element tuple. Also, the parsing logic could definitely be made more efficient if you need to but this should give you something to work on.

Multi-line input in Apache Spark using java

1 Answers1