I have a dataframe to which i do concatenation to its all fields.
After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns. One of its column is present in first dataframe…
I have a spark 2.0 java application that is using sparks csv reading utilities to read a CSV file into a dataframe. The problem is that sometimes 1 out of 100 input files may be invalid ( corrupt gzip ) which causes the job to fail…
I have hundreds of gzipped csv files in s3 that I am trying to load. The directory structure resembles the following:
bucket
-- level1
---- level2.1
-------- level3.1
------------ many files
-------- level3.2
------------ many files
----…
I am trying to convert CSV files to parquet and i am using Spark to accomplish this.
SparkSession spark = SparkSession
.builder()
.appName(appName)
.config("spark.master", master)
.getOrCreate();
Dataset logFile =…
We are using Spark CSV reader to read the csv file to convert as DataFrame and we are running the job on yarn-client, its working fine in local mode.
We are submitting the spark job in edge node.
But when we place the file in local file path instead…
I need to create a parquet file from csv files using a customized json schema file, like this one:
{"type" : "struct","fields" : [ {"name" : "tenor_bank","type" : "string","nullable" : false}, {"name":"tenor_frtb", "type":"string",…
On a spark shell I use the below code to read from a csv file
val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()
Assuming…
I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read a csv file which has \ escapes.
val myDA = spark.read
.option("quote",null)
.schema(mySchema)
.csv(filePath)
As per documentation \ is default escape for…
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("parserLib", "UNIVOCITY").option("escape","\"").load("file.csv")
When I create dataframe using above code I am getting following…
I have stadalone instance of:
- Hadoop 2.7.3
- Scala 2.11.8
- Spark 2.0.0
- SBT 0.13.11
Everything build locally. The code is developed in Intellij and I run it by clicking debug.
Everything works fine, unless I try to use a udf
def testGeolocation…
I have a large table in sql I imported from a large csv file.
A column is recognized as a str when it contains date information of format dd/mm/yyyy.
I tried select TO_DATE('12/31/2015') as date but that does not work because TO_DATE function needs…
I've tried several permutations of the suggestions in How to load csv file into SparkR on RStudio? but I am only able to get the inmemory to Spark solution to…
Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file.
I am getting error…
I have two dataframes which I've loaded from two csv files. Examples:
old
+--------+---------+----------+
|HOTEL ID|GB |US |
+--------+---------+----------+
| 80341| 0.78| 0.7|
| 255836| 0.6| 0.6|
| 245281| …