I have a data set like this:
name time val
---- ----- ---
fred 04:00 111
greg 03:00 123
fred 01:00 411
fred 05:00 921
fred 11:00 157
greg 12:00 333
And csv files in some folder, one for each unique name from the data set:
fred.csv
greg.csv
The…
I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below:
val headerDescs : String = "Name,Age,Location"
val headerSchema =…
i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file.
This is how it looks the property file:
//ruta al esquema…
I have successfully loaded spark-csv library in python standalone mode through
$ --packages com.databricks:spark-csv_2.10:1.4.0
Running the above command
While running the above command, it creates two folders(jars and cache) at this…
I am running a script on spark 1.5.2 in standalone mode (using 8 cores), and at the end of the script I attempt to serialize a very large dataframe to disk, using the spark-csv package. The code snippet that throws the exception is:
numfileparts =…
I have a general question derived from the specific exception I have encountered.
I'm querying data with dataproc using spark 1.6. I need to get 1 day of data (~10000 files) from 2 logs and then do some transformations.
However, my data may (or may…
UPDATE: Please hold on to this question. I found this might be a problem of Spark 1.5 itself, for I am not using the official version of Spark. I'll keep updating this question. Thank you!
I noticed a strange bug recently when using Spark-CSV to…
Here is an exemple data and schema :
mySchema = StructType([
StructField('firstname', StringType()),
StructField('lastname', StringType()),
StructField('langages', ArrayType(StructType([
StructField('lang1', StringType()),
…
I have an input file that has following structure,
col1, col2, col3
line1filed1,line1filed2.1\
line1filed2.2, line1filed3
line2filed1,line2filed2.1\
line2filed2.2, line2filed3
line3filed1,…
I'm trying to read a csv file which has timestamps till nano seconds.
sample content of file TestTimestamp.csv-
spark- 2.4.0, scala - 2.11.11
/**
* TestTimestamp.csv -
* 101,2019-SEP-23 11.42.35.456789123 AM
*
*/
Tried to…
Description
At my work place we have a large amount of data that needs processing. It concerns a rapidly growing amount of instances (currently ~3000) which all have a few megabytes worth of data stored in gzipped csv files on S3.
I have setup a…
We were using Spark 2.3 before, now we're on 2.4:
Spark version 2.4.0
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
We had a piece of code running in production that converted csv files to parquet format.
One of the options…
When creating a DataFrame from a CSV file, if multiLine option is enabled, some file columns are parsed incorrectly.
Here goes the code execution. I'll try to show the strange behaviors as the code goes.
First, I load the file in two variables:…
I'm trying to create an RDD using a CSV dataset.
The problem is that I have a column location that has a structure like (11112,222222) that I dont use.
So when I use the map function with split(",") its resulting in two columns.
Here is my code :
…
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with…