1

I really stuck in my issue and I searched extensively in the Net, but I couldn't find a solution for that, and I'm new to Spark-shell (Scala). ngrams function works in Hive perfectly fine by the below command:

select ngrams(split(name, '\\W+'), 2, 3) from mytable

which returns top 3 bigram of column "name". When I call it in spark-shell by this command

val df = hiveContext.sql("select ngrams(split(name, '\\W+'), 2, 3) from mytable")    

I got these errors:

Spark 2

org.apache.spark.sql.AnalysisException: Undefined function: 'ngrams'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.

Spark 1.6

org.apache.spark.sql.AnalysisException: No handler for udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFnGrams

I also tried these ways with no success:

  1. I separated split from ngrams, i. e. I ran split first, then ran ngrams. Surprisingly, split works fine but ngrams does not.
  2. I tried sqlContext.register.udf("ngrams", ngrams) and received: error: not found: value ngrams
  3. I added 2 different Jar files versions (hive-exec-1.2.0.jar and hive-exec-3.0.0.jar) using this command:

    spark-shell --jars /hive-exec-1.2.0.jar

    spark-shell --jars /hive-exec-3.0.0.jar

and same errors.

I found the open source for ngrams function in this github, but it is in Java and I dont know if I could call it in Spark-shell (Scala).

Maybe this is a trivial issue, and I would really appreciate it if someone could help me.

I'm using Scala 2.11.8, Java 1.8, Spark 2.3.0 and Spark 1.6

Dio
  • 97
  • 1
  • 8
  • http://spark.apache.org/docs/latest/ml-features.html#n-gram. About the error - are you sure you've enabled Hive support? – Alper t. Turker Jul 25 '18 at 22:46
  • so how should I do that? – Dio Jul 25 '18 at 22:49
  • If you create `SparkSession` (Spark 2.x) `val spark = SparkSession.builder.enableHiveSupport().getOrCreate()` – Alper t. Turker Jul 25 '18 at 22:51
  • Using Spark 2, I got this: error: value builder is not a member of org.apache.spark.sql.SparkSession – Dio Jul 25 '18 at 22:56
  • you would want to use https://spark.apache.org/docs/2.3.0/ml-features.html#n-gram – m-bhole Jul 26 '18 at 08:33
  • @hadooper thank you. How can I get , say, top 3 of 1-gram or 2-gram using that Scala function? Also how can I apply it to a column, i. e. input>column, output>coulmn, like ngram function that Hive does? – Dio Jul 26 '18 at 13:28
  • @user8371915 I created SparkSession.builder like the one you mentioned, but no success, or I might run the wrong command. if you run it on your system, please let me know. Also you posted a code for ngram here:https://stackoverflow.com/questions/48461076/how-do-i-create-a-set-of-ngrams-in-spark this code would extremely help me if something changed. First I think your code does not handle a case where we have only one word in a string but looking for , say, 3-gram. Second, I don't know how to apply it to my table to receive all sets of ngrams for one specific column. – Dio Jul 26 '18 at 13:45

0 Answers0