-1

I'm trying to query MongoDB using Spark SQL shell. I have a limitation that I can only use SQL: no Scala, Python, etc. I intend to use Thrift, but for proof of concept, I am using spark-sql. I'm using EMR with Spark version 2.4.4. More info:

Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_242
Branch HEAD
Compiled by user ec2-user on 2019-12-14T00:54:30Z
Revision 5f788d5e8f90539ee331702c753fa250727128f4
Url git@aws157git.com:/pkg/Aws157BigTop
Type --help for more information.

I start my shell with a pointer to MongoDB Spark maven coordinates:

spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.12:2.4.1 --conf spark.mongodb.input.uri=mongodb://something.real/development?readPreference=secondary

Spark SQL seems to recognise the jar, via the logs:

org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency

Then I run

CREATE TEMPORARY VIEW mongo 
USING com.mongodb.spark.sql.DefaultSource
OPTIONS (
  collection 'accounts'
);

And I get the following error:

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
        at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
        at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
        at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
        at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<init>(DefaultMongoPartitioner.scala:64)
        at com.mongodb.spark.rdd.partitioner.DefaultMongoPartitioner$.<clinit>(DefaultMongoPartitioner.scala)
        at com.mongodb.spark.config.ReadConfig$.<init>(ReadConfig.scala:48)
        at com.mongodb.spark.config.ReadConfig$.<clinit>(ReadConfig.scala)
        at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:91)
        at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
        at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:93)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
        at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Any idea how to set up this view using SQL only, ideally without any startup command line options except the Maven coordinates would also be ace.

Scott Arbeitman
  • 311
  • 2
  • 15
  • Does this answer your question? [Exception in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)](https://stackoverflow.com/questions/46293697/exception-in-thread-main-java-lang-nosuchmethoderror-scala-product-initlsc) – mazaneicha Apr 05 '20 at 12:44
  • Not sure why, but this works: `spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1` and then ```CREATE TEMPORARY VIEW source USING mongo OPTIONS ( uri '{{mongo_uri}}', collection '{{ collection }}', partitioner 'MongoSamplePartitioner', samplesPerPartition '{{ SAMPLES_PER_PARTITION }}' ); ``` This is running outside of `spark-sql` so there are variables, but works with real values in spark-sql. – Scott Arbeitman Apr 05 '20 at 14:01
  • I believe the error is caused by incompatibility of Scala versions (2.11 vs 2.12) in connector and Spark. – mazaneicha Apr 05 '20 at 14:06

1 Answers1

0

Looks like a mismatch between Scala versions.

Starting spark-sql with spark-sql --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 worked alright.

To create the Mongo view using SQL only:

CREATE TEMPORARY VIEW source 
USING mongo
OPTIONS (
  uri 'mongodb://…',
  collection 'my_collection'
);
Scott Arbeitman
  • 311
  • 2
  • 15