1

My setup is a 3-nodes cluster running in AWS. I already ingested my data (30 millon rows) and have no problems when running queries using jupyter notebook. But now I am trying to run a query using spark and java, as seen in the following snippet.

public class SparkSqlTest {

    private static final Logger log = Logger.getLogger(SparkSqlTest.class);


    public static void main(String[] args) {
        Map<String, String> dsParams = new HashMap<>();
        dsParams.put("instanceId", "gis");
        dsParams.put("zookeepers", "server ip");
        dsParams.put("user", "root");
        dsParams.put("password", "secret");
        dsParams.put("tableName", "posiciones");

        try {
            DataStoreFinder.getDataStore(dsParams);
            SparkConf conf = new SparkConf();
            conf.setAppName("testSpark");
            conf.setMaster("yarn");
            SparkContext sc = SparkContext.getOrCreate(conf);
            SparkSession ss = SparkSession.builder().config(conf).getOrCreate();

            Dataset<Row> df = ss.read()
                .format("geomesa")
                .options(dsParams)
                .option("geomesa.feature", "posicion")
                .load();
            df.createOrReplaceTempView("posiciones");

            long t1 = System.currentTimeMillis();
            Dataset<Row> rows = ss.sql("select count(*) from posiciones where id_equipo = 148 and fecha_hora >= '2015-04-01' and fecha_hora <= '2015-04-30'");
            long t2 = System.currentTimeMillis();
            rows.show();

            log.info("Tiempo de la consulta: " + ((t2 - t1) / 1000) + " segundos.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I upload the code in my master EC2 box (inside the jupyter notebook image), and run it using the following commands:

docker cp myjar-0.1.0.jar jupyter:myjar-0.1.0.jar
docker exec jupyter sh -c '$SPARK_HOME/bin/spark-submit --master yarn --class mypackage.SparkSqlTest file:///myjar-0.1.0.jar --jars $GEOMESA_SPARK_JARS'

But I got the following error:

17/09/15 19:45:01 INFO HSQLDB4AD417742A.ENGINE: dataFileCache open start
17/09/15 19:45:02 INFO execution.SparkSqlParser: Parsing command: posiciones
17/09/15 19:45:02 INFO execution.SparkSqlParser: Parsing command: select count(*) from posiciones where id_equipo = 148 and fecha_hora >= '2015-04-01' and fecha_hora <= '2015-04-30'
java.lang.RuntimeException: Could not find a SpatialRDDProvider
at org.locationtech.geomesa.spark.GeoMesaSpark$$anonfun$apply$2.apply(GeoMesaSpark.scala:33)
at org.locationtech.geomesa.spark.GeoMesaSpark$$anonfun$apply$2.apply(GeoMesaSpark.scala:33)

Any ideas why this happens?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
jramirez
  • 155
  • 9
  • What all jars does $GEOMESA_SPARK_JARS include? If it doesn't include the geomesa-accumulo-spark-runtime_2.11-${version}.jar, then that might explain the issue. – GeoJim Sep 15 '17 at 21:27
  • Oh, the other suggestion/question would be to check the return value of "DataStoreFinder.getDataStore(dsParams);". If the GeoMesa AccumuloDataStore is not on the classpath, that line would happily require 'null'. – GeoJim Sep 15 '17 at 21:29
  • This is the value of $GEOMESA_SPARK_JARS file:///opt/geomesa/dist/spark/geomesa-accumulo-spark-runtime_2.11-1.3.2.jar,file:///opt/geomesa/dist/spark/geomesa-spark-converter_2.11-1.3.2.jar,file:///opt/geomesa/dist/spark/geomesa-spark-geotools_2.11-1.3.2.jar – jramirez Sep 15 '17 at 22:03
  • How about the value of the 'getDataStore' call? I'd guess that it is null (in which case, there might be an issue with the Accumulo dependencies not being on the classpath). – GeoJim Sep 16 '17 at 04:58
  • I've just checked and is not null, any other Idea why my SQls querys works when using jupyter, but not when using this approach? – jramirez Sep 16 '17 at 20:05
  • I changed my version of geomesa to 1.3.2 and that error went away, but now I am seeing: 17/09/17 12:23:46 INFO execution.SparkSqlParser: Parsing command: select count(*) from posiciones where id_equipo = 148 and fecha_hora >= '2015-04-01' and fecha_hora <= '2015-04-30' java.lang.RuntimeException: Could not find a SparkGISProvider at org.locationtech.geomesa.spark.GeoMesaSpark$$anonfun$apply$2.apply(GeoMesaSpark.scala:33) – jramirez Sep 17 '17 at 12:38

1 Answers1

1

I finally sorted out, my problem was that I did not include the following entries in my pom.xml

    <dependency>
        <groupId>org.locationtech.geomesa</groupId>
        <artifactId>geomesa-accumulo-spark_2.11</artifactId>
        <version>${geomesa.version}</version>
    </dependency>

    <dependency>
        <groupId>org.locationtech.geomesa</groupId>
        <artifactId>geomesa-spark-converter_2.11</artifactId>
        <version>${geomesa.version}</version>
    </dependency>
jramirez
  • 155
  • 9