1

I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?

Adding code to get started-

public static void biGram (JavaRDD<String> in)
{
    JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
    //get bigram from sentence w/o sliding - CODE HERE
}
insomniac
  • 155
  • 2
  • 15
  • Could you publish your code? Sliding only works on iteratble so you can use mapPartition with sliding. Once you upload the code I could write something up – z-star May 18 '16 at 14:59

2 Answers2

2

You can simply use n-gram transformation feature in spark.

public static void biGram (JavaRDD<String> in)
{
    //Converting string into row
    JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));

    StructType schema = new StructType(new StructField[] {
            new StructField("sentence", DataTypes.StringType, false, Metadata.empty())  
    });

    //Creating dataframe
    DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);

    //Tokenizing sentence into words
    RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
            .setMinTokenLength(4)
            .setPattern("\\s+");
    DataFrame rtDF = rt.transform(dataFrame);

    //Creating bigrams
    NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2);  //Here setN(2) means bigram
    DataFrame bigramDF = bigram.transform(rtDF);


    System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}
Seeni
  • 21
  • 5
1

sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.

val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))     
z-star
  • 680
  • 5
  • 6