Highest Voted 'spark-csv' Questions

1

vote

1 answer

How to upload a dataframe as a stream without saving on disc?

I want to upload a dataframe to a server as csv file with Gzip encoding without saving it on the disc. It is easy to build some csv file with Gzip encoding using spark-csv lib: df.write .format("com.databricks.spark.csv") .option("header",…

scala apache-spark spark-csv

asked Sep 09 '19 at 12:58

Makrushin Evgenii

953
2
9
20

1

vote

0 answers

How to prevent read() method from splitting based on delimiter inside quotation marks in csv?

For example, if I have something like "I like, cookies" I do NOT want Spark's read() to csv method to split between the like and cookies. I want it to be parsed as "I like, cookies". I thought this was common csv practice, but this is being…

apache-spark spark-csv

asked Aug 15 '19 at 16:09

aberlasters

83
1
7

1

vote

0 answers

Incorrect count while reading csv file using spark with multiline true option

I am facing issue while reading the csv file using spark with multiline option as true. Is there any criteria when we should set multiline as true or false? Using windows 10, scala 2.11.11 and spark 2.2.0 version. Dataset that I am using to test…

scala apache-spark databricks spark-csv

asked Apr 06 '19 at 22:11

Dwarrior

687
2
10
26

1

vote

1 answer

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread…

scala apache-spark apache-spark-sql spark-csv

asked Oct 12 '18 at 06:16

Sunil Kumar

390
1
7
25

1

vote

0 answers

spark-csv fails parsing with embedded html and quotes

I have this csv file which contains description of several cities: Cities_information_extract.csv I can parse this file just fine using python pandas.read_csv or R read.csv methods. They both return 693 rows for 25 columns. I am trying,…

scala csv apache-spark databricks spark-csv

asked Aug 21 '18 at 18:43

revy

3,945
7
40
85

1

vote

0 answers

Getting java.lang.NumberformatException on a dataframe created by spark-csv

When I am reading the CSV file with spark-csv, inferschema=true, I am able to get the count on dataframe (df.count). But after when I removed spaces in column names and created a new schema and created new dataframe with the help of the first…

apache-spark spark-csv

asked Jul 03 '18 at 20:03

A srinivas

791
2
9
25

1

vote

0 answers

How to convert spark streaming Dataset[String] to DataFrame[Row]

I have a non-standard kafka format messages so the code looks like as following val df:Dataset[String] = spark .readStream .format("kafka") .option("subscribe", topic) .options(kafkaParams) .load() .select($"value".as[Array[Byte]]) …

apache-spark spark-streaming spark-csv spark-avro

asked Jun 28 '18 at 19:33

Julias

5,752
17
59
84

1

vote

2 answers

Spark: Error While Writing DataFrame to CSV

I'm trying to write a dataframe to a *.csv file to HDFS using Databricks' spark-csv_2.10 dependency. The dependency seems to work fine as I'm able to read a .csv file to a DataFrame. But when I perform a write, I get the following error. The…

java apache-spark apache-spark-sql hdfs spark-csv

asked Jun 21 '18 at 07:30

Amber

914
6
20
51

1

vote

0 answers

Write Spark DF to csv file with array data type

I am trying to write a spark DF with Array of string to a csv file, I followed the instructions provided in the site here But my column also contains nulls, How can i handle the nulls and write the DF to file

apache-spark spark-csv

asked Jun 05 '18 at 12:15

Arjun

271
1
6
20

1

vote

0 answers

Spark-Xml Root Tag is Generated in every part file

So I am trying to generate a XML which is of below structure. 234 34 234 34 Now I…

apache-spark apache-spark-sql spark-csv apache-spark-xml

asked Apr 26 '18 at 12:00

Punith Raj

2,164
3
27
45

1

vote

2 answers

Csv custom schema in spark

I have a csv file 1577,true,false,false,false,true I tried to load the csv file with custom schema, val customSchema = StructType(Array( StructField("id", StringType, nullable = false), StructField("flag1", BooleanType, nullable =…

scala apache-spark spark-csv

asked Apr 09 '18 at 07:26

John

1,531
6
18
30

1

vote

0 answers

Spark Csv specify new line character

I am writing a data frame using spark csv library. I am using spark 1.6. I was wondering if there is a way to specify the new line character. Usually, I think it is \n. Or if not, is there a good solution to changing the new line character?…

apache-spark newline delimited spark-csv

asked Feb 05 '18 at 22:34

Defcon

807
3
15
36

1

vote

1 answer

Spark: java.io.FileNotFoundException: File does not exist in copyMerge

I am trying to merge all spark output part files in a directory and create a single file in Scala. Here is my code: import org.apache.spark.sql.functions.input_file_name import org.apache.spark.sql.functions.regexp_extract def merge(srcPath:…

scala hadoop apache-spark hdfs spark-csv

asked Oct 23 '17 at 06:01

Sudarshan kumar

1,503
4
36
83

1

vote

0 answers

Shuffle Read and Write makes Spark job finish very slow

I am doing a join on two data frame having data 280 GB and 1 GB respectively. My actual spark job which is computing join is fast but shuffle read and write takes very long time and that makes overall spark job very slow. I am using m3.2xlarge 10…

apache-spark amazon-s3 apache-spark-sql emr spark-csv

asked Oct 23 '17 at 05:06

user7547751

1

vote

1 answer

Write records per partition in spark data frame to a xml file

I have to do the records count in a file per partition in spark data frame and then I have to write output to XML file. Here is my data frame. dfMainOutputFinalWithoutNull.coalesce(1).write.partitionBy("DataPartition","StatementTypeCode") …

scala apache-spark-sql apache-zeppelin spark-csv apache-spark-xml

asked Oct 10 '17 at 10:16

Sudarshan kumar

1,503
4
36
83

Questions tagged [spark-csv]