Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
3
votes
1 answer

Using spark to merge data in sorted order to csv files

I have a data set like this: name time val ---- ----- --- fred 04:00 111 greg 03:00 123 fred 01:00 411 fred 05:00 921 fred 11:00 157 greg 12:00 333 And csv files in some folder, one for each unique name from the data set: fred.csv greg.csv The…
Greg Clinton
  • 365
  • 1
  • 7
  • 18
3
votes
1 answer

Programmatically generate the schema AND the data for a dataframe in Apache Spark

I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema =…
3
votes
2 answers

About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically

i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file. This is how it looks the property file: //ruta al esquema…
aironman
  • 837
  • 5
  • 26
  • 55
3
votes
2 answers

Adding spark-csv package in PyCharm IDE

I have successfully loaded spark-csv library in python standalone mode through $ --packages com.databricks:spark-csv_2.10:1.4.0 Running the above command While running the above command, it creates two folders(jars and cache) at this…
mahima
  • 1,875
  • 1
  • 11
  • 15
3
votes
0 answers

Saving a dataframe using spark-csv package throws exceptions and crashes (pyspark)

I am running a script on spark 1.5.2 in standalone mode (using 8 cores), and at the end of the script I attempt to serialize a very large dataframe to disk, using the spark-csv package. The code snippet that throws the exception is: numfileparts =…
Magnus
  • 371
  • 1
  • 14
3
votes
1 answer

how to avoid spark NumberFormatException: null

I have a general question derived from the specific exception I have encountered. I'm querying data with dataproc using spark 1.6. I need to get 1 day of data (~10000 files) from 2 logs and then do some transformations. However, my data may (or may…
Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30
3
votes
2 answers

Characters get corrupt if spark.executor.memory is not set properly when importing CSV to DataFrame

UPDATE: Please hold on to this question. I found this might be a problem of Spark 1.5 itself, for I am not using the official version of Spark. I'll keep updating this question. Thank you! I noticed a strange bug recently when using Spark-CSV to…
DarkZero
  • 2,259
  • 3
  • 25
  • 36
2
votes
1 answer

How to split an array structure to csv in PysPark

Here is an exemple data and schema : mySchema = StructType([ StructField('firstname', StringType()), StructField('lastname', StringType()), StructField('langages', ArrayType(StructType([ StructField('lang1', StringType()), …
Fabrice
  • 355
  • 4
  • 9
2
votes
1 answer

Reading a file in Spark with newline(\n) in fields, escaped with backslash(\) and not quoted

I have an input file that has following structure, col1, col2, col3 line1filed1,line1filed2.1\ line1filed2.2, line1filed3 line2filed1,line2filed2.1\ line2filed2.2, line2filed3 line3filed1,…
Kishore Indraganti
  • 1,296
  • 3
  • 17
  • 34
2
votes
2 answers

Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results

I'm trying to read a csv file which has timestamps till nano seconds. sample content of file TestTimestamp.csv- spark- 2.4.0, scala - 2.11.11 /** * TestTimestamp.csv - * 101,2019-SEP-23 11.42.35.456789123 AM * */ Tried to…
ValaravausBlack
  • 691
  • 5
  • 12
2
votes
2 answers

Spark - loading many small csv takes very long

Description At my work place we have a large amount of data that needs processing. It concerns a rapidly growing amount of instances (currently ~3000) which all have a few megabytes worth of data stored in gzipped csv files on S3. I have setup a…
Jeroen Bos
  • 87
  • 9
2
votes
1 answer

Spark 2.4 CSV Load Issue with option "nullvalue"

We were using Spark 2.3 before, now we're on 2.4: Spark version 2.4.0 Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) We had a piece of code running in production that converted csv files to parquet format. One of the options…
KK2486
  • 353
  • 2
  • 3
  • 13
2
votes
1 answer

Strange behavior on CSV parser of Spark 2 when multiLine option is enabled

When creating a DataFrame from a CSV file, if multiLine option is enabled, some file columns are parsed incorrectly. Here goes the code execution. I'll try to show the strange behaviors as the code goes. First, I load the file in two variables:…
2
votes
1 answer

Prevent delimiter collision while reading csv in Spark

I'm trying to create an RDD using a CSV dataset. The problem is that I have a column location that has a structure like (11112,222222) that I dont use. So when I use the map function with split(",") its resulting in two columns. Here is my code : …
2
votes
3 answers

How to write data as single (normal) csv file in Spark?

I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ? My Requirement: To get a normal csv file with…
Ramkumar
  • 444
  • 1
  • 7
  • 22
1 2
3
9 10