I am querying data from HDFS using Impala in a python script using the python library Impyla. The specific data is proxy data and there is tons of it. I have a script that runs daily to pull the previous day and runs statistics. Currently I am using the devicereceipttime
field for this query which is stored as a timestamp.
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
#Pull desired features from the proxy_realtime_p table
cursor.execute('select request, count(*) as count \
from default.proxy_realtime_p \
where devicereceipttime BETWEEN concat(to_date(now() - interval 1 days), " 00:00:00") and concat(to_date(now() - interval 1 days), " 23:59:59") \
group by request \
order by count desc')
This query takes a little bit and would like to speed this up if possible. From the given fields below is my query the most efficient?
devicereceipttime (timestamp)
year (int)
month (int)
day (int)
hour (int)
minute (int)
seconds (int)