Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
4
votes
1 answer
How to read a CSV file with commas within a field using pyspark?
I have a csv file containing commas within a column value. For example,
Column1,Column2,Column3
123,"45,6",789
The values are wrapped in double quotes when they have extra commas in the data. In the above example, the values are Column1=123,…

Bob
- 335
- 1
- 4
- 16
4
votes
2 answers
NullPointerException while reading a column from the row
The following Scala (Spark 1.6) code for reading a value from a Row fails with a NullPointerException when the value is null.
val test = row.getAs[Int]("ColumnName").toString
while this works fine
val test1 = row.getAs[Int]("ColumnName") // returns…

Anurag Sharma
- 2,409
- 2
- 16
- 34
4
votes
1 answer
How to join on binary field?
In Scala/Spark, I am trying to do the following:
val portCalls_Ports =
portCalls.join(ports, portCalls("port_id") === ports("id"), "inner")
However I am getting the following error:
Exception in thread "main"…

Paul Reiners
- 8,576
- 33
- 117
- 202
4
votes
3 answers
Why does single test fail with "Error XSDB6: Another instance of Derby may have already booted the database"?
I use Spark 1.6.
We have a HDFS write method that wrote to HDFS using SqlContext. Now we needed to switch over to using HiveContext. When we did that existing unit tests do not run and give the error
Error XSDB6: Another instance of Derby may have…

Satyam
- 645
- 2
- 7
- 20
4
votes
3 answers
Spark CSV package not able to handle \n within fields
I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test…

Umesh K
- 13,436
- 25
- 87
- 129
4
votes
3 answers
How to change hdfs block size in pyspark?
I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
Does this have to be set before starting…

Sean Nguyen
- 12,528
- 22
- 74
- 113
4
votes
1 answer
Apache Spark: setting executor instances
I run my Spark application on YARN with parameters:
in spark-defaults.conf:
spark.master yarn-client
spark.driver.cores 1
spark.driver.memory 1g
spark.executor.instances 6
spark.executor.memory 1g
in…

Anna
- 98
- 1
- 7
3
votes
1 answer
Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?
I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.
df.persist(StorageLevel.MEMORY_AND_DISK)
Whenever we use such persist on a HBase read - the same data is…

Dasarathy D R
- 335
- 2
- 7
- 20
3
votes
2 answers
How to replace nulls in Vector column?
I have a column of type [vector] and I have null values in it that I can't get rid of, here's an example
import org.apache.spark.mllib.linalg.Vectors
val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0))
val df_1 =…

Alexvonrass
- 330
- 2
- 12
3
votes
1 answer
How to load spark.mllib model without SparkContext to predict?
With Spark1.6.0 MLLib, I'd build a model (like RandomForest) and save to hdfs,and then is was possible to load the randomforest model from hdfs to predict without SparkContext.Now,load the model we can use like this:
val loadModel =…

shaojie
- 121
- 1
- 11
3
votes
1 answer
scala dataframe filter array of strings
Spark 1.6.2 and Scala 2.10 here.
I want to filter the spark dataframe column with an array of strings.
val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2")
+---+-------+
| c1| …

Ramesh
- 1,563
- 9
- 25
- 39
3
votes
1 answer
Where can I find the jars folder in Spark 1.6?
From the Spark downloads page, if I download the tar file for v2.0.1, I see that it contains some jars that I find useful to include in my app.
If I download the tar file for v1.6.2 instead, I don't find the jars folder in there. Is there an…

sudheeshix
- 1,541
- 2
- 17
- 28
3
votes
2 answers
Combining Spark schema without duplicates?
To process the data I have, I am extracting the schema before, so that when I read the dataset, I provide the schema instead of going through the expensive step of inferring schema.
In order to construct the schema, I need to merge in several…

THIS USER NEEDS HELP
- 3,136
- 4
- 30
- 55
3
votes
1 answer
Spark Streaming application fails with KafkaException: String exceeds the maximum size or with IllegalArgumentException
TL;DR:
My very simple Spark Streaming application fails in the driver with the "KafkaException: String exceeds the maximum size". I see the same exception in the executor but I also found somewhere down the executor's logs an…

Gideon
- 2,211
- 5
- 29
- 47
3
votes
2 answers
How to control number of partition while reading data from Cassandra?
I use:
cassandra 2.1.12 - 3 nodes
spark 1.6 - 3 nodes
spark cassandra connector 1.6
I use tokens in Cassandra (not vnodes).
I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70…

deenbandhu
- 599
- 5
- 18