I would like to process a huge xml file that is distributed across a HDFS file system, using the iterparse
function from lxml.etree
package.
I have tried it locally and on an Amazon's EMR cluster:
- Locally : the address of my xml file is
hdfs://localhost:9000/user/hadoop/history.xml
- EMR cluster : the address is
/user/hadoop/history.xml
In both cases, running my simple python program crashes with the following error
Traceback (most recent call last):
File "xml_parser.py", line 18, in <module>
main()
File "xml_parser.py", line 12, in main
for event, elem in ET.iterparse(inputFile, tag='page'):
File "src/lxml/iterparse.pxi", line 78, in lxml.etree.iterparse.__init__
FileNotFoundError: [Errno 2] No such file or directory: <file_path>
Here is my python program
import sys
import lxml.etree as ET
from pyspark import SparkContext
from pyspark.sql import SparkSession
def main():
sc = SparkContext()
spark = SparkSession(sc)
inputFile = sys.argv[1]
for event, elem in ET.iterparse(inputFile, tag='page'):
elem.clear()
print('finished')
if __name__ == "__main__":
main()