I'm experiencing extremely slow writing speed when trying to insert rows into a partitioned Hive table using impyla
.
This is an example of the code I wrote in python:
from impala.dbapi import connect
targets = ... # targets is a dictionary of objects of a specific class
yesterday = datetime.date.today() - datetime.timedelta(days=1)
log_datetime = datetime.datetime.now()
query = """
INSERT INTO my_database.mytable
PARTITION (year={year}, month={month}, day={day})
VALUES ('{yesterday}', '{log_ts}', %s, %s, %s, 1, 1)
""".format(yesterday=yesterday, log_ts=log_datetime,
year=yesterday.year, month=yesterday.month,
day=yesterday.day)
print(query)
rows = tuple([tuple([i.campaign, i.adgroup, i.adwordsid])
for i in targets.values()])
connection = connect(host=os.environ["HADOOP_IP"],
port=10000,
user=os.environ["HADOOP_USER"],
password=os.environ["HADOOP_PASSWD"],
auth_mechanism="PLAIN")
cursor = connection.cursor()
cursor.execute("SET hive.exec.dynamic.partition.mode=nonstrict")
cursor.executemany(query, rows)
Interestingly, even though I'm launching an executemany
command impyla
still resolve it into multiple MapReduce jobs. In fact I can see as many MapReduce jobs launched as many tuples included in the tuple of tuples object I'm passing to the impyla.executemany
method.
Do you see anything wrong? To give you an idea after more than an hour it wrote just 350 rows.