1

I am getting different results when running the same query in Athena against an S3 source vs. doing it from within a pyspark script on an EMR ( 1 x 10) cluster. I get data back from Athena, but all I get are nulls with the script. Any suggestions/thoughts/guesses as to why?

Here is the Athena query:

SELECT <real_col1> as reg_plate, <real_col1> as model_num
FROM <my Athena table name> 
WHERE partition_datetime LIKE '2019-01-01-14' 
limit 10

Which returns this result:

reg_plate   model_num
   515355  961-824
   515355  961-824
   515355  961-824
   515355  961-824
   341243  047-891
   727027  860-403
   619656  948-977
   576345  951-657
   576345  951-657
   113721  034-035

However when I run this query as a script, against the same S3 source with:

# Define SQL query
load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM s3_table
WHERE partition_datetime LIKE '2019-01-01-14'
limit 10  """

df1 = spark.read.parquet("<s3:path to my data>")
df1.createOrReplaceTempView("s3_table")

sqlDF = spark.sql(load_qry)
sqlDF.show(10)

I get nothing but nulls, like this

+---------+---------+
|reg_plate|model_num|
+---------+---------+
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
|     null|     null|
+---------+---------+

Here is the configuration on my cluster which is 1 master r3.xlarge and 10 r3.xlarge workers: Cluster config

Here is the command string I use to initiate the spark job: PYSPARK_PYTHON=/opt/miniconda/bin/python nohup spark-submit --driver- memory 8g --executor-memory 30g --conf spark.executor.cores=8 --conf spark.driver.maxResultSize=8g --conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER --conf spark.debug.maxToStringFields=100 --conf spark.sql.hive.convertMetastoreParquet=false stk_overflow.py > stk_oflow0120.txt 2>&1

Thom Rogers
  • 1,385
  • 2
  • 20
  • 33
  • I cannot replicate the error. Does `df.show()` show anything? – Foxan Ng Jan 21 '19 at 03:03
  • df.show shows 10 rows of 'null' for each column . The image in the OP is a screenshot. Have also edited my post to include the cluster config. – Thom Rogers Jan 21 '19 at 03:42
  • Or sorry I mean `df1.show()`. Just want to see if spark.read.parquet can read anything at all – Foxan Ng Jan 21 '19 at 04:08
  • I did a printSchema() on both df1 and sqlDF and they both showed the same columns and dataTypes, so I'm convinced that at least the schema is correct. I'll go back and see if df1.show() displays and post it here. Thanks for your help on this. – Thom Rogers Jan 21 '19 at 04:29
  • 1
    df1.show() gives me fully populated rows. I compared against the data returned by Athena and noticed that where the source data has a value of 'NULL', that value is correctly returned by the df1.show() as NULL, but where the source data is blank in a column/row, that column/row intersection is returned not as blank, but as 'null' in the df1.show(). I further confirmed that the data for these columns in the source data has actual values (as shown in the Athena qry above) and are not blank. – Thom Rogers Jan 21 '19 at 05:16
  • I'm a bit confused. You said "df1.show() gives me fully populated rows", but then "...but as 'null' in the df1.show()., ... has actual values and are not blank." 1. Do the row counts of df1 / sqlDF match with the query in Athena? 2. Does `df1.show()` shows the expected result? If so, we can focus on the part with `load_qry`; if not, we can contain the problem within df1. 3. I suspect it may have something to do with AWS Glue Data Catalog settings, but I'm not familiar with AWS EMR. – Foxan Ng Jan 21 '19 at 07:01
  • Can you show folder structure of the files you are trying to read? – Serge Harnyk Jan 21 '19 at 08:55
  • Folder structure: s3://///// I do not list the parquet files in my spark.read.parquet call, the path ends with the fwd slash right after prefix_04 – Thom Rogers Jan 21 '19 at 14:40
  • @ Foxan Ng I probably should have said 'correctly' populated instead of 'fully'. Meaning the data matches the source (except that blanks in source show up as 'null' in df), but those 'nulls' ARE in the right columns as the source. Regarding row counts, df1 is over 100 billion rows, while sqlDF has 10. – Thom Rogers Jan 21 '19 at 15:32
  • Here is the command string I use to initiate the spark job: PYSPARK_PYTHON=/opt/miniconda/bin/python nohup spark-submit --driver- memory 8g --executor-memory 30g --conf spark.executor.cores=8 --conf spark.driver.maxResultSize=8g --conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER --conf spark.debug.maxToStringFields=100 --conf spark.sql.hive.convertMetastoreParquet=false stk_overflow.py > stk_oflow0120.txt 2>&1 – Thom Rogers Jan 21 '19 at 15:39
  • Are the parquet files migrate from legacy systems? Or its generated using pyspark on EMR? Did you rename the columns later i.e. the column name differs in later partitions? Please look at here if it matches your scenario. https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Reading-old-parquet-tables-in-Hive-Spark-returns-NULL-in/m-p/54735 – Tanveer Uddin Jan 22 '19 at 10:13
  • Did you check if you point to a table with a path that doesn't exist the result is the same? The idea is to evaluate if your script is actually finding the table definition. Also could you please post the table definition using `SHOW CREATE TABLE`? – Roberto Jan 26 '19 at 10:07

1 Answers1

0

I found a simple solution.

Instead of

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num 
FROM s3_table WHERE partition_datetime LIKE '2019-01-01-14' limit 10 """ 
df1 = spark.read.parquet("<s3:path to my data>") 
df1.createOrReplaceTempView("s3_table") 

I used

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM <my_athena_db>.table WHERE partition_datetime LIKE '2019-01-01-14' 
df1 = spark.sql(load_qry)

Which works because Glue knows how to get to "my_athena_db.table"

Thom Rogers
  • 1,385
  • 2
  • 20
  • 33