Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
1
vote
1 answer

How to upload a dataframe as a stream without saving on disc?

I want to upload a dataframe to a server as csv file with Gzip encoding without saving it on the disc. It is easy to build some csv file with Gzip encoding using spark-csv lib: df.write .format("com.databricks.spark.csv") .option("header",…
Makrushin Evgenii
  • 953
  • 2
  • 9
  • 20
1
vote
0 answers

How to prevent read() method from splitting based on delimiter inside quotation marks in csv?

For example, if I have something like "I like, cookies" I do NOT want Spark's read() to csv method to split between the like and cookies. I want it to be parsed as "I like, cookies". I thought this was common csv practice, but this is being…
aberlasters
  • 83
  • 1
  • 7
1
vote
0 answers

Incorrect count while reading csv file using spark with multiline true option

I am facing issue while reading the csv file using spark with multiline option as true. Is there any criteria when we should set multiline as true or false? Using windows 10, scala 2.11.11 and spark 2.2.0 version. Dataset that I am using to test…
Dwarrior
  • 687
  • 2
  • 10
  • 26
1
vote
1 answer

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread…
Sunil Kumar
  • 390
  • 1
  • 7
  • 25
1
vote
0 answers

spark-csv fails parsing with embedded html and quotes

I have this csv file which contains description of several cities: Cities_information_extract.csv I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns. I am trying,…
revy
  • 3,945
  • 7
  • 40
  • 85
1
vote
0 answers

Getting java.lang.NumberformatException on a dataframe created by spark-csv

When I am reading the CSV file with spark-csv, inferschema=true, I am able to get the count on dataframe (df.count). But after when I removed spaces in column names and created a new schema and created new dataframe with the help of the first…
A srinivas
  • 791
  • 2
  • 9
  • 25
1
vote
0 answers

How to convert spark streaming Dataset[String] to DataFrame[Row]

I have a non-standard kafka format messages so the code looks like as following val df:Dataset[String] = spark .readStream .format("kafka") .option("subscribe", topic) .options(kafkaParams) .load() .select($"value".as[Array[Byte]]) …
Julias
  • 5,752
  • 17
  • 59
  • 84
1
vote
2 answers

Spark: Error While Writing DataFrame to CSV

I'm trying to write a dataframe to a *.csv file to HDFS using Databricks' spark-csv_2.10 dependency. The dependency seems to work fine as I'm able to read a .csv file to a DataFrame. But when I perform a write, I get the following error. The…
Amber
  • 914
  • 6
  • 20
  • 51
1
vote
0 answers

Write Spark DF to csv file with array data type

I am trying to write a spark DF with Array of string to a csv file, I followed the instructions provided in the site here But my column also contains nulls, How can i handle the nulls and write the DF to file
Arjun
  • 271
  • 1
  • 6
  • 20
1
vote
0 answers

Spark-Xml Root Tag is Generated in every part file

So I am trying to generate a XML which is of below structure. 234 34 234 34 Now I…
Punith Raj
  • 2,164
  • 3
  • 27
  • 45
1
vote
2 answers

Csv custom schema in spark

I have a csv file 1577,true,false,false,false,true I tried to load the csv file with custom schema, val customSchema = StructType(Array( StructField("id", StringType, nullable = false), StructField("flag1", BooleanType, nullable =…
John
  • 1,531
  • 6
  • 18
  • 30
1
vote
0 answers

Spark Csv specify new line character

I am writing a data frame using spark csv library. I am using spark 1.6. I was wondering if there is a way to specify the new line character. Usually, I think it is \n. Or if not, is there a good solution to changing the new line character?…
Defcon
  • 807
  • 3
  • 15
  • 36
1
vote
1 answer

Spark: java.io.FileNotFoundException: File does not exist in copyMerge

I am trying to merge all spark output part files in a directory and create a single file in Scala. Here is my code: import org.apache.spark.sql.functions.input_file_name import org.apache.spark.sql.functions.regexp_extract def merge(srcPath:…
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
1
vote
0 answers

Shuffle Read and Write makes Spark job finish very slow

I am doing a join on two data frame having data 280 GB and 1 GB respectively. My actual spark job which is computing join is fast but shuffle read and write takes very long time and that makes overall spark job very slow. I am using m3.2xlarge 10…
user7547751
1
vote
1 answer

Write records per partition in spark data frame to a xml file

I have to do the records count in a file per partition in spark data frame and then I have to write output to XML file. Here is my data frame. dfMainOutputFinalWithoutNull.coalesce(1).write.partitionBy("DataPartition","StatementTypeCode") …