0

I'm experiencing extremely slow writing speed when trying to insert rows into a partitioned Hive table using impyla.

This is an example of the code I wrote in python:

from impala.dbapi import connect

targets = ... # targets is a dictionary of objects of a specific class
yesterday = datetime.date.today() - datetime.timedelta(days=1)
log_datetime = datetime.datetime.now()
query = """
        INSERT INTO my_database.mytable
        PARTITION (year={year}, month={month}, day={day})
        VALUES ('{yesterday}', '{log_ts}', %s, %s, %s, 1, 1)
        """.format(yesterday=yesterday, log_ts=log_datetime,
                   year=yesterday.year, month=yesterday.month,
                   day=yesterday.day)
print(query)
rows = tuple([tuple([i.campaign, i.adgroup, i.adwordsid])
              for i in targets.values()])

connection = connect(host=os.environ["HADOOP_IP"],
                     port=10000,
                     user=os.environ["HADOOP_USER"],
                     password=os.environ["HADOOP_PASSWD"],
                     auth_mechanism="PLAIN")
cursor = connection.cursor()
cursor.execute("SET hive.exec.dynamic.partition.mode=nonstrict")
cursor.executemany(query, rows)

Interestingly, even though I'm launching an executemany command impyla still resolve it into multiple MapReduce jobs. In fact I can see as many MapReduce jobs launched as many tuples included in the tuple of tuples object I'm passing to the impyla.executemany method.

Do you see anything wrong? To give you an idea after more than an hour it wrote just 350 rows.

Gianluca
  • 6,307
  • 19
  • 44
  • 65
  • 1
    IMHO, running *INSERT* commands in Hive *(using the "transaction mode" and ORC files plus a background compaction process)* is about as efficient as driving to the mall with a bulldozer. Personally, I would use an _EXTERNAL_ table backed by text files, then use an HTTP client to connect to WebHDFS and "create" new text files. Or, "append" data to a single, global text file. – Samson Scharfrichter Sep 23 '16 at 17:41
  • Can you argument a little more your answer @SamsonScharfrichter ? I'm not a Data Engineer so I'm not that familiar with the solution you suggested. Thanks in advance. – Gianluca Sep 23 '16 at 21:05
  • Just create a Hive *EXTERNAL* table *STORED AS TEXTFILE* with *LOCATION '/some/hdfs/dir'*, then drop a couple of TXT files in that directory and run some SQL, then experiment with the HDFS command line `hdfs dfs -appendToFile`, then look into WebHDFS https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#APPEND *(note that the 2-step process is totally unnecessary, unless you use a very old and buggy HTTP client)* – Samson Scharfrichter Sep 24 '16 at 17:20

0 Answers0