Questions tagged [apache-spark-xml]
81 questions
0
votes
1 answer
spark-xml problem with encoding windows-1251
I have a problem with parsing an XML document in pyspark using spark-xml API (pyspark 2.4.0). I have a file with cyryllic content with the following opening tag:
So when I try to open it with some text…

Владислав Черкасов
- 33
- 4
0
votes
1 answer
Load XML file to dataframe in PySpark using 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.types import _parse_datatype_json_string
def ext_from_xml(xml_column, schema, options={}):
java_column = _to_java_column(xml_column.cast('string'))
java_schema =…

Ujjal
- 11
- 2
0
votes
0 answers
How to split xml file into multiple xml files based on tag
I have one large XML file that looks like the following. I would like to split this large XML file into multiple XML files/chunks based on tag. I would like each XML file to have 1000 PRVDR. What is the best way to do this in pyspark? So,…

AJR
- 569
- 3
- 12
- 30
0
votes
1 answer
pyspark: org.xml.sax.SAXParseException Current config of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000
I am trying to parse xml files with XSD using spark-xml library in pyspark.
Below is the code :
xml_df = spark.read.format("com.databricks.spark.xml") \
.option("rootTag", "Document") \
.option("rowTag", "row01") \
…

newbee123
- 21
- 1
- 2
0
votes
1 answer
spark-xml: Crashing out of memory trying to parse single large XML file
I'm attempting to process bz2 compressed XML files with a nested XML schema into normalized tables where each level of the schema is stored as a row, and any child elements are stored as rows in a separate table with a foreign key relating back to…

Rimer
- 2,054
- 6
- 28
- 43
0
votes
1 answer
XML Parsing with Spark-XML
I have a XML like this:

Ryan
- 33
- 4
0
votes
2 answers
How to install spark-xml library using dbx
I am trying to install library spark-xml_2.12-0.15.0 using dbx.
The documentation I found is to include it on the conf/deployment.yml file like:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "10.4.x-cpu-ml-scala2.12"
…

jalazbe
- 1,801
- 3
- 19
- 40
0
votes
1 answer
Spark xpath function to return null if no value present for an attribute
I am using spark xpath to get the attribute values from an xml string. The xpath returns an array of values from the xml tag. If there are multiple rows present in a tag with one of the rows having an attribute with null, the xpath function is…

shaz_nwaz
- 11
- 2
0
votes
1 answer
How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark
I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:…

oneDerer
- 287
- 3
- 10
0
votes
1 answer
Getting empty dataframe on parsing XML with XSD using spark-xml package
I am trying to parse simple XML by supplying XSD schema. Using the approach given here.
https://github.com/databricks/spark-xml#xsd-support
XML is here:
aa
bb
cc
dd
XSD is…

Keds
- 1
- 1
0
votes
1 answer
Migrating Apache Spark xml from 2.11 to 2.12 gives the below warning.How to use The xmlReader directly
Code:
val xmlDf: DataFrame = spark.read
.format("xml")
.option("nullValue", "")
.xml(df.select("payload").map(x => x.getString(0)))
warning: method xml in class XmlDataFrameReader is deprecated (since 0.13.0): Use XmlReader…

Vikram Pawar
- 25
- 8
0
votes
0 answers
Explode simple XML file in pyspark (Not using databricks)
I have a XML file which is given below :
-
Cake
0.55
Regular
…

SIDDHARTHA SUMAN
- 11
- 2
0
votes
0 answers
How to write Spark XML writing with order by?
am trying to write xml file from my dataframe like below
myDf.orderBy("name")
.repartition(1).write
.format("com.databricks.spark.xml)
.option("rootTag","colname")
.option("rowTag","colname2")
.save("filename")
This is writing a file but not…

Karthik
- 1
- 1
0
votes
0 answers
Unable to insert the parsed xml data into delta tables in spark with a changing input schema
I am trying to insert data from a dataframe into a delta table. Initially, I am parsing an xml file based on a target schema and saving the result into a dataframe. Below is the code used for parsing.
def parseAsset (nodeSeqXml: scala.xml.NodeSeq) :…

SanjanaSanju
- 261
- 2
- 18
0
votes
1 answer
Exploding multiple array columns in spark for a changing input schema
Below is my sample schema.
|-- provider: string (nullable = true)
|-- product: string (nullable = true)
|-- asset_name: string (nullable = true)
|-- description: string (nullable = true)
|-- creation_date: string (nullable = true)
|--…

SanjanaSanju
- 261
- 2
- 18