Need explanation on internal working of read_table method in pyarrow.parquet

Question

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After successful consumption, the first path is sent for reading and the following paths will be read once the preceding read processes are done. Now to read a specific table I am using the following line of code,

data_table = parquet.read_table(path_to_the_file)

Let's say I have five read tasks in the message. The first read process is being carried out and gets read successfully and now before the other reading tasks are yet to be performed I just manually stopped my server. This stop would not send a message execution successful acknowledgement to the queue as there are a four remaining read processes. Once I restart the server, the whole consumption and reading processes starts from the initial stage. And now when the read_table method is called upon the first path, it gets stuck totally.

Digging up inside the work flow of read_table method, I found out where it actually gets stuck. But further explanations of this method for reading a file inside a hadoop filesystem is required.

path = 'hdfs://173.21.3.116:9000/tempDir/test_dataset.parquet' 
data_table = parquet.read_table(path)

Can somebody please give me a picture of the internal implementation that happens after calling this method? So that I could find where the issue is actually occurred and a solution to it.

This might not be a suitable question for StackOverflow (I would try the mailing list of Apache Arrow instead: https://arrow.apache.org/community/). Also, you say "I found out where it actually gets stuck", so where did it get stuck? — joris, Sep 10 '20 at 08:37
I went inside the source code of the library and found that it gets stuck in ``` hdfs.connect() ``` function. Thanks for the direction :) — Blackdeath, Sep 11 '20 at 03:51

Need explanation on internal working of read_table method in pyarrow.parquet

0 Answers0