Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
6
votes
2 answers

How to save CSV with all fields quoted?

The below code does not add the double quotes which is the default. I also tried adding # and single quote using option quote with no success. I also used quoteMode with ALL and NON_NUMERIC options, still no change in the…
Arvind Kandaswamy
  • 1,821
  • 3
  • 21
  • 30
5
votes
2 answers

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text. s3_single_file_inpath='s3a://bucket-name/file_name' indata =…
makansij
  • 9,303
  • 37
  • 105
  • 183
5
votes
1 answer

Scala: Spark SQL to_date(unix_timestamp) returning NULL

Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes…
Sai Wai Maung
  • 1,607
  • 6
  • 18
  • 28
4
votes
2 answers

inferSchema=true isn't working for csv file reading n Spark Structured Streaming

I'm getting the error message java.lang.IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static…
Eljah
  • 4,188
  • 4
  • 41
  • 85
4
votes
3 answers

Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows "XYZ", "Test Data", "TestNew\nline", "OtherData" "XYZ", "Test…
Umesh K
  • 13,436
  • 25
  • 87
  • 129
4
votes
1 answer

Spark CSV 2.1 File Names

i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec",…
Avi P
  • 53
  • 4
4
votes
2 answers

Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a. We can see from the Spark UI, the first stages reading and merging to…
twiz911
  • 634
  • 1
  • 9
  • 18
4
votes
0 answers

Spark-csv returns and empty DataFrame when passed a compressed file

I'm looking to consume some compressed csv files into DataFrames so that I can eventually query them using SparkSQL. I would normally just use sc.textFile() to consume the file and use various map() transformations to parse and transform the data…
justafisch
  • 41
  • 3
3
votes
1 answer

Spark - CSV - Write Options - Quotes

Hope everyone is doing well. While going through the spark csv datasource options for the question I am quite confused on the difference between the various quote related options available. Do we have any detailed differences between them ? Does…
rainingdistros
  • 450
  • 3
  • 11
3
votes
0 answers

Streaming from CSV files with Spark

I am trying to use Spark Streaming to collect data from CSV files located on NFS. The code I have is very simple, and so far I have been running it only in spark-shell, but even there I am running into some issues. I am running spark-shell with a…
Dan Markhasin
  • 752
  • 2
  • 8
  • 20
3
votes
1 answer

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV case class Event(map: Map[String,String]) def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation val eventDataset: Dataset[Event] = spark .readStream …
3
votes
2 answers

Spark 2.1 cannot write Vector field on CSV

I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df =…
CARREAU Clément
  • 657
  • 5
  • 16
3
votes
1 answer

Spark CSV issue with new line (LF) character in the field of file imported using scala

I am trying to load a CSV (tab delimited) using spark csv - by scala. what I observed is , if a column contains new line character LF (\n) spark considering it as end of the line even though we have double quotes on both sides of the column in the…
3
votes
2 answers

How to add header and column to dataframe spark?

I have got a dataframe, on which I want to add a header and a first column manually. Here is the dataframe : import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() val df =…
user3637823
  • 95
  • 1
  • 1
  • 6
3
votes
2 answers

filter and save first X lines of a dataframe

I'm using pySpark to read and calculate statistics for a dataframe. The dataframe looks like: TRANSACTION_URL START_TIME END_TIME SIZE FLAG COL6 COL7 ... www.google.com 20170113093210 20170113093210 150 1 …
Adiel
  • 1,203
  • 3
  • 18
  • 31
1
2
3
9 10