0

I would like to process a huge xml file that is distributed across a HDFS file system, using the iterparse function from lxml.etree package.

I have tried it locally and on an Amazon's EMR cluster:

  • Locally : the address of my xml file is hdfs://localhost:9000/user/hadoop/history.xml
  • EMR cluster : the address is /user/hadoop/history.xml

In both cases, running my simple python program crashes with the following error

Traceback (most recent call last):
  File "xml_parser.py", line 18, in <module>
    main()
  File "xml_parser.py", line 12, in main
    for event, elem in ET.iterparse(inputFile, tag='page'):
  File "src/lxml/iterparse.pxi", line 78, in lxml.etree.iterparse.__init__
FileNotFoundError: [Errno 2] No such file or directory: <file_path>

Here is my python program

import sys
import lxml.etree as ET
from pyspark import SparkContext
from pyspark.sql import SparkSession



def main():
    sc = SparkContext()
    spark = SparkSession(sc)
    inputFile = sys.argv[1]
    for event, elem in ET.iterparse(inputFile, tag='page'):
        elem.clear()
    print('finished')


if __name__ == "__main__":
    main()
MMasmoudi
  • 508
  • 1
  • 5
  • 19
  • What if you just open the file yourself and pass the open file object to `iterparse`? – larsks Sep 15 '20 at 12:33
  • I tried this. But, iterparse accepts only a file path, not a file content – MMasmoudi Sep 15 '20 at 12:41
  • 1
    That doesn't appear to be true. I was able to pass a file object to `iterparse` with no errors. Note that you need to open the file in binary mode (`open("somefile.html", "rb")`). – larsks Sep 15 '20 at 12:46
  • My bad. I did not think of this, passing a file object to iterparse. Indeed, we can use a file object instead of a file path. Thank you :) – MMasmoudi Sep 15 '20 at 13:41

0 Answers0