I am getting different results when running the same query in Athena against an S3 source vs. doing it from within a pyspark script on an EMR ( 1 x 10) cluster. I get data back from Athena, but all I get are nulls with the script. Any suggestions/thoughts/guesses as to why?
Here is the Athena query:
SELECT <real_col1> as reg_plate, <real_col1> as model_num
FROM <my Athena table name>
WHERE partition_datetime LIKE '2019-01-01-14'
limit 10
Which returns this result:
reg_plate model_num
515355 961-824
515355 961-824
515355 961-824
515355 961-824
341243 047-891
727027 860-403
619656 948-977
576345 951-657
576345 951-657
113721 034-035
However when I run this query as a script, against the same S3 source with:
# Define SQL query
load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM s3_table
WHERE partition_datetime LIKE '2019-01-01-14'
limit 10 """
df1 = spark.read.parquet("<s3:path to my data>")
df1.createOrReplaceTempView("s3_table")
sqlDF = spark.sql(load_qry)
sqlDF.show(10)
I get nothing but nulls, like this
+---------+---------+
|reg_plate|model_num|
+---------+---------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
+---------+---------+
Here is the configuration on my cluster which is 1 master r3.xlarge and 10 r3.xlarge workers:
Here is the command string I use to initiate the spark job: PYSPARK_PYTHON=/opt/miniconda/bin/python nohup spark-submit --driver- memory 8g --executor-memory 30g --conf spark.executor.cores=8 --conf spark.driver.maxResultSize=8g --conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER --conf spark.debug.maxToStringFields=100 --conf spark.sql.hive.convertMetastoreParquet=false stk_overflow.py > stk_oflow0120.txt 2>&1