Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
4
votes
1 answer

How to read a CSV file with commas within a field using pyspark?

I have a csv file containing commas within a column value. For example, Column1,Column2,Column3 123,"45,6",789 The values are wrapped in double quotes when they have extra commas in the data. In the above example, the values are Column1=123,…
Bob
  • 335
  • 1
  • 4
  • 16
4
votes
2 answers

NullPointerException while reading a column from the row

The following Scala (Spark 1.6) code for reading a value from a Row fails with a NullPointerException when the value is null. val test = row.getAs[Int]("ColumnName").toString while this works fine val test1 = row.getAs[Int]("ColumnName") // returns…
Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34
4
votes
1 answer

How to join on binary field?

In Scala/Spark, I am trying to do the following: val portCalls_Ports = portCalls.join(ports, portCalls("port_id") === ports("id"), "inner") However I am getting the following error: Exception in thread "main"…
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
4
votes
3 answers

Why does single test fail with "Error XSDB6: Another instance of Derby may have already booted the database"?

I use Spark 1.6. We have a HDFS write method that wrote to HDFS using SqlContext. Now we needed to switch over to using HiveContext. When we did that existing unit tests do not run and give the error Error XSDB6: Another instance of Derby may have…
Satyam
  • 645
  • 2
  • 7
  • 20
4
votes
3 answers

Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows "XYZ", "Test Data", "TestNew\nline", "OtherData" "XYZ", "Test…
Umesh K
  • 13,436
  • 25
  • 87
  • 129
4
votes
3 answers

How to change hdfs block size in pyspark?

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work: sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m") Does this have to be set before starting…
Sean Nguyen
  • 12,528
  • 22
  • 74
  • 113
4
votes
1 answer

Apache Spark: setting executor instances

I run my Spark application on YARN with parameters: in spark-defaults.conf: spark.master yarn-client spark.driver.cores 1 spark.driver.memory 1g spark.executor.instances 6 spark.executor.memory 1g in…
Anna
  • 98
  • 1
  • 7
3
votes
1 answer

Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it. df.persist(StorageLevel.MEMORY_AND_DISK) Whenever we use such persist on a HBase read - the same data is…
3
votes
2 answers

How to replace nulls in Vector column?

I have a column of type [vector] and I have null values in it that I can't get rid of, here's an example import org.apache.spark.mllib.linalg.Vectors val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0)) val df_1 =…
3
votes
1 answer

How to load spark.mllib model without SparkContext to predict?

With Spark1.6.0 MLLib, I'd build a model (like RandomForest) and save to hdfs,and then is was possible to load the randomforest model from hdfs to predict without SparkContext.Now,load the model we can use like this: val loadModel =…
shaojie
  • 121
  • 1
  • 11
3
votes
1 answer

scala dataframe filter array of strings

Spark 1.6.2 and Scala 2.10 here. I want to filter the spark dataframe column with an array of strings. val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2") +---+-------+ | c1| …
Ramesh
  • 1,563
  • 9
  • 25
  • 39
3
votes
1 answer

Where can I find the jars folder in Spark 1.6?

From the Spark downloads page, if I download the tar file for v2.0.1, I see that it contains some jars that I find useful to include in my app. If I download the tar file for v1.6.2 instead, I don't find the jars folder in there. Is there an…
sudheeshix
  • 1,541
  • 2
  • 17
  • 28
3
votes
2 answers

Combining Spark schema without duplicates?

To process the data I have, I am extracting the schema before, so that when I read the dataset, I provide the schema instead of going through the expensive step of inferring schema. In order to construct the schema, I need to merge in several…
THIS USER NEEDS HELP
  • 3,136
  • 4
  • 30
  • 55
3
votes
1 answer

Spark Streaming application fails with KafkaException: String exceeds the maximum size or with IllegalArgumentException

TL;DR: My very simple Spark Streaming application fails in the driver with the "KafkaException: String exceeds the maximum size". I see the same exception in the executor but I also found somewhere down the executor's logs an…
3
votes
2 answers

How to control number of partition while reading data from Cassandra?

I use: cassandra 2.1.12 - 3 nodes spark 1.6 - 3 nodes spark cassandra connector 1.6 I use tokens in Cassandra (not vnodes). I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70…