I'm running Apache Spark 1.6.1 on a smaller yarn cluster. I'm attempting to pull data in from a hive table, using a query like:
df = hiveCtx.sql("""
SELECT *
FROM hive_database.gigantic_table
WHERE loaddate = '20170502'
""")
However, the resulting dataframe is the entire table, no matter what value I give for loaddate. The only odd thing I can think is that the hive table is partitioned by that loaddate column.
Hive alone appears to run this query correctly. I've tried casting to ints, using .filter()
, and various levels of quotation marks, but no luck on Spark.