Questions tagged [apache-spark-xml]
81 questions
2
votes
1 answer
javax.xml.stream.XMLStreamException: Trying to output second root Spark-XML Spark Program
I am trying to run this small spark-xml example and it fails with exception when i do a spark-submit.
Sample REPO : https://github.com/punithmailme/spark-xml-new
command : ./dse spark-submit --class MainDriver…

Punith Raj
- 2,164
- 3
- 27
- 45
2
votes
1 answer
Why does spark-xml fail with NoSuchMethodError with Spark 2.0.0 dependency?
Hi I am a noob to Scala and Intellij and I am just trying to do this on Scala:
import org.apache.spark
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml.XmlReader
object SparkSample {
def main(args: Array[String]): Unit = {
…

Solo
- 193
- 2
- 12
1
vote
0 answers
Unable to load xml files using spark-xml
Can someone please help me understand why I'm not able to successfully load my xml file from s3 using spark-xml.
I have downloaded spark-xml jar file and added the path within the job details under "Dependent JARs path" within aws glue. What I added…

AJR
- 569
- 3
- 12
- 30
1
vote
0 answers
Pyspark Dataframe - String Column having xml Data
I have a pyspark dataframe having a string column named "xml", but the column has nested xml data inside it.
This is the dataframe.
df = spark.createDataFrame([['

Rohan Kapoor
- 23
- 5
1
vote
1 answer
corrupt record while reading xml file using pyspark
I am trying to read an xml file in dataframe in pyspark.
Code :
df_xml=spark.read.format("com.databricks.spark.xml").option("rootTag","dataset").option("rowTag","AUTHOR").load(FilePath)
when i display the dataframe, it shows a single column…

Ankit Tyagi
- 175
- 2
- 17
1
vote
0 answers
Parsing XML processing instructions in PySpark
I am trying to parse one XML file that has processing instructions using databricks spark-xml.
Example XML
Spark Tutorial
Spark Tutorial for…

Mithu Tokder
- 11
- 1
1
vote
0 answers
Spark-XML; Reading from S3 using explicit schema on read. Problem with Array Type in XML
I am attempting to the spark-xml library through the Scala Spark API (https://github.com/databricks/spark-xml) in order to read a large amount of XML files from S3.
The schema across the XML files in S3 differs such that simply reading them all at…

swaythecat
- 11
- 2
1
vote
0 answers
Spyder IDE with PySpark shows keeps logging everything in spite of turning off INFO to WARN
I followed this SO link to turn off log4j INFO logging but I still see this huge level of logs in my Spyder IDE's console.
I don't want to see even these warnings. show just the errors generated by my script.
=> I am launching spyder from anaconda…

deathrace
- 908
- 4
- 22
- 48
1
vote
1 answer
(spark-xml) Receiving only null when parsing xml column using from_xml function
I'm trying to parse a very simple XML string column using spark-xml, but I only manage to receive null values, even when the XML is correctly populated.
The XSD that I'm using to parse the xml is:

Alejandro Arévalo
- 301
- 2
- 11
1
vote
1 answer
How to parse XML with XSD using spark-xml package?
I am trying to parse simple XML by supplying XSD schema. Using the approach given here.
https://github.com/databricks/spark-xml#xsd-support
XML is here:
My Readers
Chaitanya
A…

ungalVicky
- 53
- 1
- 9
1
vote
0 answers
Load only first few .XML files (e.g. 10 xmls) from directory containing 100 files in Pyspark dataframe
I want to load the first 10 XML files in each iteration from a directory containing 100 files and remove that XML file that has already read, to another directory.
what I have tried so far in pyspark.
li =…

sizo_abe
- 411
- 1
- 3
- 13
1
vote
1 answer
Load XML file to dataframe in PySpark using DBR 7.3.x+
I'm trying to load an XML file in to dataframe using PySpark in databricks notebook.
df = spark.read.format("xml").options(
rowTag="product" , mode="PERMISSIVE", columnNameOfCorruptRecord="error_record"
).load(filePath)
On doing so, I get…

Aman Sehgal
- 546
- 4
- 13
1
vote
2 answers
How to access array type value and set in two different columns spark?
I am learning Spark, I have below xml from which I want to read 2 values and create two different columns
8.52544
8.52537
…
happy
- 2,550
- 17
- 64
- 109
1
vote
1 answer
File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found
I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path.
I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration…

Satya Azure
- 459
- 7
- 22
1
vote
1 answer
Install com.databricks.spark.xml on emr cluster
Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster.
I succeeded to connect to master emr but don't know how to install packages on the emr cluster.
code
sc.install_pypi_package("com.databricks.spark.xml")

salsa
- 33
- 1
- 7