lxml.etree iterparse does not accept a HDFS file path

Question

I would like to process a huge xml file that is distributed across a HDFS file system, using the iterparse function from lxml.etree package.

I have tried it locally and on an Amazon's EMR cluster:

Locally : the address of my xml file is hdfs://localhost:9000/user/hadoop/history.xml
EMR cluster : the address is /user/hadoop/history.xml

In both cases, running my simple python program crashes with the following error

Traceback (most recent call last):
  File "xml_parser.py", line 18, in <module>
    main()
  File "xml_parser.py", line 12, in main
    for event, elem in ET.iterparse(inputFile, tag='page'):
  File "src/lxml/iterparse.pxi", line 78, in lxml.etree.iterparse.__init__
FileNotFoundError: [Errno 2] No such file or directory: <file_path>

Here is my python program

import sys
import lxml.etree as ET
from pyspark import SparkContext
from pyspark.sql import SparkSession



def main():
    sc = SparkContext()
    spark = SparkSession(sc)
    inputFile = sys.argv[1]
    for event, elem in ET.iterparse(inputFile, tag='page'):
        elem.clear()
    print('finished')


if __name__ == "__main__":
    main()

What if you just open the file yourself and pass the open file object to `iterparse`? — larsks, Sep 15 '20 at 12:33
I tried this. But, iterparse accepts only a file path, not a file content — MMasmoudi, Sep 15 '20 at 12:41
That doesn't appear to be true. I was able to pass a file object to `iterparse` with no errors. Note that you need to open the file in binary mode (`open("somefile.html", "rb")`). — larsks, Sep 15 '20 at 12:46
My bad. I did not think of this, passing a file object to iterparse. Indeed, we can use a file object instead of a file path. Thank you :) — MMasmoudi, Sep 15 '20 at 13:41

lxml.etree iterparse does not accept a HDFS file path

0 Answers0