Questions tagged [apache-spark-xml]

81 questions
2
votes
1 answer

javax.xml.stream.XMLStreamException: Trying to output second root Spark-XML Spark Program

I am trying to run this small spark-xml example and it fails with exception when i do a spark-submit. Sample REPO : https://github.com/punithmailme/spark-xml-new command : ./dse spark-submit --class MainDriver…
2
votes
1 answer

Why does spark-xml fail with NoSuchMethodError with Spark 2.0.0 dependency?

Hi I am a noob to Scala and Intellij and I am just trying to do this on Scala: import org.apache.spark import org.apache.spark.sql.SQLContext import com.databricks.spark.xml.XmlReader object SparkSample { def main(args: Array[String]): Unit = { …
Solo
  • 193
  • 2
  • 12
1
vote
0 answers

Unable to load xml files using spark-xml

Can someone please help me understand why I'm not able to successfully load my xml file from s3 using spark-xml. I have downloaded spark-xml jar file and added the path within the job details under "Dependent JARs path" within aws glue. What I added…
AJR
  • 569
  • 3
  • 12
  • 30
1
vote
0 answers

Pyspark Dataframe - String Column having xml Data

I have a pyspark dataframe having a string column named "xml", but the column has nested xml data inside it. This is the dataframe. df = spark.createDataFrame([['
1
vote
1 answer

corrupt record while reading xml file using pyspark

I am trying to read an xml file in dataframe in pyspark. Code : df_xml=spark.read.format("com.databricks.spark.xml").option("rootTag","dataset").option("rowTag","AUTHOR").load(FilePath) when i display the dataframe, it shows a single column…
Ankit Tyagi
  • 175
  • 2
  • 17
1
vote
0 answers

Parsing XML processing instructions in PySpark

I am trying to parse one XML file that has processing instructions using databricks spark-xml. Example XML Spark Tutorial Spark Tutorial for…
1
vote
0 answers

Spark-XML; Reading from S3 using explicit schema on read. Problem with Array Type in XML

I am attempting to the spark-xml library through the Scala Spark API (https://github.com/databricks/spark-xml) in order to read a large amount of XML files from S3. The schema across the XML files in S3 differs such that simply reading them all at…
swaythecat
  • 11
  • 2
1
vote
0 answers

Spyder IDE with PySpark shows keeps logging everything in spite of turning off INFO to WARN

I followed this SO link to turn off log4j INFO logging but I still see this huge level of logs in my Spyder IDE's console. I don't want to see even these warnings. show just the errors generated by my script. => I am launching spyder from anaconda…
deathrace
  • 908
  • 4
  • 22
  • 48
1
vote
1 answer

(spark-xml) Receiving only null when parsing xml column using from_xml function

I'm trying to parse a very simple XML string column using spark-xml, but I only manage to receive null values, even when the XML is correctly populated. The XSD that I'm using to parse the xml is:
1
vote
1 answer

How to parse XML with XSD using spark-xml package?

I am trying to parse simple XML by supplying XSD schema. Using the approach given here. https://github.com/databricks/spark-xml#xsd-support XML is here: My Readers Chaitanya A…
ungalVicky
  • 53
  • 1
  • 9
1
vote
0 answers

Load only first few .XML files (e.g. 10 xmls) from directory containing 100 files in Pyspark dataframe

I want to load the first 10 XML files in each iteration from a directory containing 100 files and remove that XML file that has already read, to another directory. what I have tried so far in pyspark. li =…
sizo_abe
  • 411
  • 1
  • 3
  • 13
1
vote
1 answer

Load XML file to dataframe in PySpark using DBR 7.3.x+

I'm trying to load an XML file in to dataframe using PySpark in databricks notebook. df = spark.read.format("xml").options( rowTag="product" , mode="PERMISSIVE", columnNameOfCorruptRecord="error_record" ).load(filePath) On doing so, I get…
1
vote
2 answers

How to access array type value and set in two different columns spark?

I am learning Spark, I have below xml from which I want to read 2 values and create two different columns 8.52544 8.52537
happy
  • 2,550
  • 17
  • 64
  • 109
1
vote
1 answer

File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found

I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path. I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration…
1
vote
1 answer

Install com.databricks.spark.xml on emr cluster

Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster. I succeeded to connect to master emr but don't know how to install packages on the emr cluster. code sc.install_pypi_package("com.databricks.spark.xml")