Questions tagged [apache-spark-xml]

81 questions
0
votes
1 answer

spark-xml problem with encoding windows-1251

I have a problem with parsing an XML document in pyspark using spark-xml API (pyspark 2.4.0). I have a file with cyryllic content with the following opening tag: So when I try to open it with some text…
0
votes
1 answer

Load XML file to dataframe in PySpark using 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

from pyspark.sql.column import Column, _to_java_column from pyspark.sql.types import _parse_datatype_json_string def ext_from_xml(xml_column, schema, options={}): java_column = _to_java_column(xml_column.cast('string')) java_schema =…
0
votes
0 answers

How to split xml file into multiple xml files based on tag

I have one large XML file that looks like the following. I would like to split this large XML file into multiple XML files/chunks based on tag. I would like each XML file to have 1000 PRVDR. What is the best way to do this in pyspark? So,…
AJR
  • 569
  • 3
  • 12
  • 30
0
votes
1 answer

pyspark: org.xml.sax.SAXParseException Current config of the parser doesn't allow a maxOccurs attribute value to be set greater than the value 5,000

I am trying to parse xml files with XSD using spark-xml library in pyspark. Below is the code : xml_df = spark.read.format("com.databricks.spark.xml") \ .option("rootTag", "Document") \ .option("rowTag", "row01") \ …
newbee123
  • 21
  • 1
  • 2
0
votes
1 answer

spark-xml: Crashing out of memory trying to parse single large XML file

I'm attempting to process bz2 compressed XML files with a nested XML schema into normalized tables where each level of the schema is stored as a row, and any child elements are stored as rows in a separate table with a foreign key relating back to…
Rimer
  • 2,054
  • 6
  • 28
  • 43
0
votes
1 answer

XML Parsing with Spark-XML

I have a XML like this:
Ryan
  • 33
  • 4
0
votes
2 answers

How to install spark-xml library using dbx

I am trying to install library spark-xml_2.12-0.15.0 using dbx. The documentation I found is to include it on the conf/deployment.yml file like: custom: basic-cluster-props: &basic-cluster-props spark_version: "10.4.x-cpu-ml-scala2.12" …
jalazbe
  • 1,801
  • 3
  • 19
  • 40
0
votes
1 answer

Spark xpath function to return null if no value present for an attribute

I am using spark xpath to get the attribute values from an xml string. The xpath returns an array of values from the xml tag. If there are multiple rows present in a tag with one of the rows having an attribute with null, the xpath function is…
shaz_nwaz
  • 11
  • 2
0
votes
1 answer

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error: org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:…
oneDerer
  • 287
  • 3
  • 10
0
votes
1 answer

Getting empty dataframe on parsing XML with XSD using spark-xml package

I am trying to parse simple XML by supplying XSD schema. Using the approach given here. https://github.com/databricks/spark-xml#xsd-support XML is here: aa bb cc dd XSD is…
Keds
  • 1
  • 1
0
votes
1 answer

Migrating Apache Spark xml from 2.11 to 2.12 gives the below warning.How to use The xmlReader directly

Code: val xmlDf: DataFrame = spark.read .format("xml") .option("nullValue", "") .xml(df.select("payload").map(x => x.getString(0))) warning: method xml in class XmlDataFrameReader is deprecated (since 0.13.0): Use XmlReader…
0
votes
0 answers

Explode simple XML file in pyspark (Not using databricks)

I have a XML file which is given below : Cake 0.55 Regular
0
votes
0 answers

How to write Spark XML writing with order by?

am trying to write xml file from my dataframe like below myDf.orderBy("name") .repartition(1).write .format("com.databricks.spark.xml) .option("rootTag","colname") .option("rowTag","colname2") .save("filename") This is writing a file but not…
0
votes
0 answers

Unable to insert the parsed xml data into delta tables in spark with a changing input schema

I am trying to insert data from a dataframe into a delta table. Initially, I am parsing an xml file based on a target schema and saving the result into a dataframe. Below is the code used for parsing. def parseAsset (nodeSeqXml: scala.xml.NodeSeq) :…
0
votes
1 answer

Exploding multiple array columns in spark for a changing input schema

Below is my sample schema. |-- provider: string (nullable = true) |-- product: string (nullable = true) |-- asset_name: string (nullable = true) |-- description: string (nullable = true) |-- creation_date: string (nullable = true) |--…