Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
1
vote
2 answers
Specify default value for rowsBetween and rangeBetween in Spark
I have a question concerning a window operation in Sparks Dataframe 1.6.
Let's say I have the following table:
id|MONTH |number
1 201703 2
1 201704 3
1 201705 7
1 201706 6
At moment I'm using the rowsBetween function:
val window =…

RudyVerboven
- 1,204
- 1
- 14
- 31
1
vote
1 answer
Oracle to Spark/Hive: how to convert use of "greatest" function to Spark 1.6 dataframe
Table in oracle has 37 columns. names of columns are: year,month,d1,d2....d34. Data in d1..d34 are all integers. There is one more column called maxd which is blank.
For each row, I have to find the greatest value out of d1,d2....d34 and put that in…

oracletohive
- 15
- 3
1
vote
1 answer
PySpark: Compute row minimum ignoring zeros and null values
I'd like to create a new column (v5) based on the existing subset of columns in the dataframe.
Sample dataframe:
+---+---+---+---+
| v1| v2| v3| v4|
+---+---+---+---+
| 2| 4|7.0|4.0|
| 99| 0|2.0|0.0|
|189| 0|2.4|0.0|
+---+---+---+---+
providing…

Mia21
- 119
- 2
- 10
1
vote
1 answer
Programmatically specifying the schema in PySpark
I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried.
from pyspark.sql.types import StructField, StructType , LongType, StringType
stringJsonRdd_new = sc.parallelize(('{"id":…

Sumit
- 1,360
- 3
- 16
- 29
1
vote
2 answers
Removing NULL , NAN, empty space from PySpark DataFrame
I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the…

Sumit
- 1,360
- 3
- 16
- 29
1
vote
1 answer
udf No TypeTag available for type string
I don't understand a behavior of spark.
I create an udf which returns an Integer like below
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Show {
def main(args: Array[String]): Unit = {
val…

a.moussa
- 2,977
- 7
- 34
- 56
1
vote
1 answer
Spark dataframe insertinto hive table fails since some of the staging part files created with username mapr
I am using Spark dataframe to insert into a hive table. Even though the application is being submitted using the username 'myuser', some of the hive staging part files gets created with username 'mapr'. So the final write into the hive table fails…

Shasankar
- 672
- 6
- 16
1
vote
1 answer
Error while running PageRank and BFS functions on Graphframes in PySpark
I'm new to Spark, and am learning it on the Cloudera Distr for Hadoop (CDH). I'm trying to execute the PageRank and BFS functions through Jupyter Notebook, which was initiated using the following command:
pyspark --packages…

Sasi
- 67
- 1
- 5
1
vote
2 answers
Convert Excel file to csv in Spark 1.X
Is there a tool to convert Excel files into csv using Spark 1.X ?
got this issue when executing this tuto
https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x
Exception in thread "main" java.lang.NoClassDefFoundError:…

Ronald Segan
- 215
- 2
- 11
1
vote
1 answer
Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark
So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD.
I try the following, sc is my SparkContext:
import…

osk
- 790
- 2
- 10
- 31
1
vote
0 answers
SparkSQL JDBC writer fails with "Cannot acquire locks error"
I'm trying to insert 50 million rows from hive table into a SQLServer table using SparkSQL JDBC Writer.Below is the line of code that I'm using to insert the data
mdf1.coalesce(4).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE",…

sunny
- 11
- 2
1
vote
0 answers
HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF
I am trying to use a custom HashMap implementation as UserDefinedType instead of MapType in spark. The code is working fine in spark 1.5.2 but giving java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 cannot be cast to…

Izhar Ahmed
- 185
- 8
1
vote
2 answers
flatMap doesn't preserve order when creating lists from pyspark dataframe columns
I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
|…

Mia21
- 119
- 2
- 10
1
vote
0 answers
Similar java code for following scala code
I have trouble using concat_list on a spark dataframe in spark 1.6.So I am planing to go with UDAF for concating multiple row values to one single row with comma seperated values
Here's link! for the original post,I need the same code in java SE…

Yashwanth Kambala
- 412
- 1
- 5
- 14
1
vote
1 answer
How to optimize spark sql operations on large data frame?
I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to…

aladeen
- 297
- 7
- 18