1

Context

I use Python (3.7) to run several queries on an Hadoop server.

After several tests, I think Impala is the most efficent engine to query the database. So I set up a connexion using Ibis framework in order to force the use of Impala (Hive is used by default).

Considering the number of queries, I'm trying to concurrently run these queries.

I think I'm getting close, but I'm stuck with a problem when trying to share the connection to the server using Ibis, and the multiple processes I start.

I'm quite new to Python, but I'm going to do my best to explain clearly my problem, and to use the right vocabulary. Please forgive me in advance for any mistake...!

How the queries are submitted

For submitting my queries, the code looks like :

  • Connection to the database:

hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y) client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)

  • Creation of the query (done several times):

query = "SELECT ... FROM ... WHERE ..."

  • Send the query and retrieve the results (done for each query):

query = self.client.sql(query) data = query.execute(limit = None)

What have been done to concurrently run these queries

For now, I've created a Process class using multiprocessing, and I'm passing to it the client parameter which would enable the connection (at least, I thought), and a list containing the informations required to configure the queries to run on the server:

import multiprocessing

class ParallelDataRetrieving(multiprocessing.Process):

    """Process in charge of retrieving the data."""
    
    def __init__(self,client,someInformations):
    
    multiprocessing.Process.__init__(self)
    
    self.client = client
    self.someInformations = someInformations
    
def run(self):
    
    """Code to run during the execution of the process."""
    
    cpt1 = 0
    
    while cpt1 < len(someInformations):
        
        query = Use someInformations[cpt1] to create the query.
    
    query = self.client.sql(query)
    data = query.execute(limit = None)

    Some work on the data...
    
    return 0

Then, from the main script, I (try to) establish the connection, and start several Processes using this connection:

hdfs = ibis.hdfs_connect(host='X.X.X.X', port=Y)
client = ibis.impala.connect(host='X.X.X.X',port=Y,hdfs_client=hdfs)

process_1 = ParallelDataRetrieving(client,someInformations)
process_1.start()
process_2 = ...

But this code does not work. I get the error "TypeError: can't pickle _thread.lock objects".

From what I understand, this comes from the fact that multiprocessing uses Pickle to "encapsulate" the parameters, and transfer them to the processes (whose memory runs separatly on Windows). And it does not seem to be possible to pickle the "client" parameter.

I then found several ideas on the internet which tried to solve this issue, but none of them seems to be applicable to my particular case (Ibis, Impala...):

  • I tried to create the connection directly in the run method of the Process object (which means one connexion per Process) : this leads to "BrokenPipeError: [Errno 32] Broken pipe"

  • I tried to use multiprocessing.sharedctypes.RawValue, but if this is the right solution, I'm not very confident I implemented it correctly in my code...

Here is pretty much my situation on the moment. I will keep on trying to solve this problem, but as a kind of a "new comer" to Python, and multiprocessing of database queries, I have thought a more advanced user could probably help me!

Thank you in advance for the time you will devote to this request!

Community
  • 1
  • 1
Philippe
  • 11
  • 3
  • I think ```pyspark``` and ```spark-submit``` command could be an answer. This way you can run the regular python script as an application on hdfs, with easy access to the database via ```SparkContext``` for instance. – Grzegorz Skibinski Oct 02 '19 at 15:14

0 Answers0