Load only first few .XML files (e.g. 10 xmls) from directory containing 100 files in Pyspark dataframe

Asked Apr 07 '21 at 13:47

Active Jul 17 '21 at 04:34

Viewed 95 times

I want to load the first 10 XML files in each iteration from a directory containing 100 files and remove that XML file that has already read, to another directory.

what I have tried so far in pyspark.

li = ["/mnt/dev/tmp/xml/100_file/M800143.xml","/mnt/dev/tmp/xml/100_file/M8001422.xml"]
df1 = spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load(li) 
df1.show()

But I am getting an error : IllegalArgumentException: 'path' must be specified for XML data.

Is there is any way to read files after storing the full path of XML files inside the list? Or please suggest another approach.

edited Jul 17 '21 at 04:34

paradocslover

2,932
3
18
44

asked Apr 07 '21 at 13:47

sizo_abe

Load only first few .XML files (e.g. 10 xmls) from directory containing 100 files in Pyspark dataframe

0 Answers0