Questions tagged [spark-csv]

A library for handling CSV files in Apache Spark.

External links:

139 questions
1
vote
2 answers

Drop column(s) in spark csv data frame

I have a dataframe to which i do concatenation to its all fields. After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns. One of its column is present in first dataframe…
user7547751
1
vote
0 answers

Spark CSV Handle Corrupt GZip Files

I have a spark 2.0 java application that is using sparks csv reading utilities to read a CSV file into a dataframe. The problem is that sometimes 1 out of 100 input files may be invalid ( corrupt gzip ) which causes the job to fail…
Nathan Case
  • 655
  • 1
  • 6
  • 15
1
vote
1 answer

Loading nested csv files from S3 with Spark

I have hundreds of gzipped csv files in s3 that I am trying to load. The directory structure resembles the following: bucket -- level1 ---- level2.1 -------- level3.1 ------------ many files -------- level3.2 ------------ many files ----…
Nathan Case
  • 655
  • 1
  • 6
  • 15
1
vote
1 answer

Parquet schema and Spark

I am trying to convert CSV files to parquet and i am using Spark to accomplish this. SparkSession spark = SparkSession .builder() .appName(appName) .config("spark.master", master) .getOrCreate(); Dataset logFile =…
changepicture
  • 466
  • 1
  • 4
  • 10
1
vote
2 answers

Not able to read text file from local file path - Spark CSV reader

We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode. We are submitting the spark job in edge node. But when we place the file in local file path instead…
Shankar
  • 8,529
  • 26
  • 90
  • 159
1
vote
1 answer

NumberFormatException when I try to create a parquet file with a custom schema and BigDecimal types

I need to create a parquet file from csv files using a customized json schema file, like this one: {"type" : "struct","fields" : [ {"name" : "tenor_bank","type" : "string","nullable" : false}, {"name":"tenor_frtb", "type":"string",…
aironman
  • 837
  • 5
  • 26
  • 55
1
vote
2 answers

Would spark dataframe read from external source on every action?

On a spark shell I use the below code to read from a csv file val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session df.show() Assuming…
Andy Dufresne
  • 6,022
  • 7
  • 63
  • 113
1
vote
1 answer

Spark CSV Escape Not Working

I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read a csv file which has \ escapes. val myDA = spark.read .option("quote",null) .schema(mySchema) .csv(filePath) As per documentation \ is default escape for…
JNish
  • 145
  • 2
  • 10
1
vote
1 answer

How to provide parserLib and inferSchema options together for spark-csv

sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("parserLib", "UNIVOCITY").option("escape","\"").load("file.csv") When I create dataframe using above code I am getting following…
nirali.gandhi
  • 221
  • 1
  • 11
1
vote
0 answers

Databircks.CSV.Write after applying UDF - spark 2.0.0, scala 2.11.8

I have stadalone instance of: - Hadoop 2.7.3 - Scala 2.11.8 - Spark 2.0.0 - SBT 0.13.11 Everything build locally. The code is developed in Intellij and I run it by clicking debug. Everything works fine, unless I try to use a udf def testGeolocation…
MPękalski
  • 6,873
  • 4
  • 26
  • 36
1
vote
2 answers

How to convert column type from str to date when the str is of format dd/mm/yyyy?

I have a large table in sql I imported from a large csv file. A column is recognized as a str when it contains date information of format dd/mm/yyyy. I tried select TO_DATE('12/31/2015') as date but that does not work because TO_DATE function needs…
Semihcan Doken
  • 776
  • 3
  • 10
  • 23
1
vote
0 answers

Whats the best way to read multiline input format to one record in spark?

Below is the input file(csv) looks like: Carrier_create_date,Message,REF_SHEET_CREATEDATE,7/1/2008 Carrier_create_time,Message,REF_SHEET_CREATETIME,8:53:57 Carrier_campaign,Analog,REF_SHEET_CAMPAIGN,25 Carrier_run_no,Analog,REF_SHEET_RUNNO,7 Below…
Gangadhar Kadam
  • 536
  • 1
  • 4
  • 15
1
vote
2 answers

spark-csv falls apart with SparkR & RStudio

I've tried several permutations of the suggestions in How to load csv file into SparkR on RStudio? but I am only able to get the inmemory to Spark solution to…
Chris
  • 1,219
  • 2
  • 11
  • 21
1
vote
2 answers

Using Sparksql and SparkCSV with SparkJob Server

Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file. I am getting error…
1
vote
1 answer

PySpark: How to compare two dataframes

I have two dataframes which I've loaded from two csv files. Examples: old +--------+---------+----------+ |HOTEL ID|GB |US | +--------+---------+----------+ | 80341| 0.78| 0.7| | 255836| 0.6| 0.6| | 245281| …
Rafael
  • 572
  • 5
  • 9