Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
2
votes
1 answer

How to split input file name and add specific value in the spark data frame column

This is how i load my csv file in spark data frame val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import org.apache.spark.{ SparkConf, SparkContext } import java.sql.{Date, Timestamp} import…
user7547751
2
votes
1 answer

Error while reading very large files with spark csv package

We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario…
Rajat Mishra
  • 3,635
  • 4
  • 27
  • 41
2
votes
2 answers

Error while exporting spark sql dataframe to csv

I have referred the following links in order to understand how to export spark sql dataframe in python https://github.com/databricks/spark-csv How to export data from Spark SQL to CSV My code: df = sqlContext.createDataFrame(routeRDD,…
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
2
votes
1 answer

Index out of bounds error when doing dataframe union on bzip2 csv data

Problem is pretty weird. If I work with the uncompressed file, there is no issue. But, if I work with the compressed bz2 file, I get an index out of bounds error. From what I've read, it apparently is spark-csv parser that doesn't detect the end of…
flipper2gv
  • 41
  • 3
2
votes
1 answer

Dynamically loading com.databricks:spark-csv spark package to my application

I need to load com.csv spark packages dynamically to my application, using spark submit , it works spark-submit --class "DataLoaderApp" --master yarn --deploy-mode client --packages com.databricks:spark-csv_2.11:1.4.0 …
Mahdi
  • 787
  • 1
  • 8
  • 33
2
votes
0 answers

Spark: spark-csv partitioning and parallelism in subsequent DataFrames

I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv. Summary: spark-csv doesn't seem to support explicit partitioning on import like sc.textFile() does.…
chucknelson
  • 2,328
  • 3
  • 24
  • 31
2
votes
1 answer

Spark saving df as csv throws error

I am using pyspark and have a dataframe loaded. When I try to save it as a CSV file, I get the error below. I initialize spark like this: ./pyspark --master local[4] --executor-memory 14g --driver-memory 14g --conf…
skunkwerk
  • 2,920
  • 2
  • 37
  • 55
2
votes
0 answers

Python-Spark IllegalArgumentException when load CSV to DataFrame with DateType by spark-csv_2.10-1.3.0

I am trying to load csv file to dataframe by using spark-csv_2.10-1.3.0 sqlContext.read.format('com.databricks.spark.csv') .options(header='true',dateFormat='dd/MM/YYYY hh:mm') .load('test.csv',schema =…
Bo Wan
  • 35
  • 5
2
votes
4 answers

Decimal data type not storing the values correctly in both spark and Hive

I am having a problem storing with the decimal data type and not sure if it is a bug or I am doing something wrong The data in the file looks like this Column1 column2 column3 steve 100 100.23 ronald 500 20.369 maria 600 …
newSparkbabie
  • 73
  • 1
  • 1
  • 9
1
vote
2 answers

DateType column read as StringType from CSV file even when appropriate schema provided

I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output…
Monami Sen
  • 119
  • 1
  • 1
  • 12
1
vote
1 answer

Why I'm getting CSVHeaderChecker:69 - CSV header does not conform to the schema.?

When reading the csv data I'm getting the warning like that and no data is picked to the dataFrame batches. The schema is exactly as exists in the csv. What could be the reason of the worning and the wrong behavior?
Eljah
  • 4,188
  • 4
  • 41
  • 85
1
vote
2 answers

Spark CSV reader : garbled Japanese text and handling multilines

In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline. First I used following code to read CSV files: implicit class…
1
vote
0 answers

Spark CSV : Parse files deliminated by Ascii æ (Hex E6)

I have large data files deliminated by ASCII character æ (Hex E6). My code snipped for parsing the file is as follows ,but seems the parser does not slit values properly (I use Spark 2.4.1) implicit class DataFrameReadImplicits (dataFrameReader:…
Ashika Umanga Umagiliya
  • 8,988
  • 28
  • 102
  • 185
1
vote
3 answers

CSV format is not loading in spark-shell

Using spark 1.6 I tried following code: val diamonds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/got_own/com_sep_fil.csv") which caused the error error: not found: value spark
abdul sattar
  • 11
  • 1
  • 2
1
vote
1 answer

Spark - handle blank values in CSV file

Let's say I've got a simple pipe delimited file, with missing values: A|B||D I read that into a dataframe: val foo = spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt") The missing third column, instead of being a null…
Andrew
  • 8,445
  • 3
  • 28
  • 46