Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
1
vote
2 answers

Specify default value for rowsBetween and rangeBetween in Spark

I have a question concerning a window operation in Sparks Dataframe 1.6. Let's say I have the following table: id|MONTH |number 1 201703 2 1 201704 3 1 201705 7 1 201706 6 At moment I'm using the rowsBetween function: val window =…
RudyVerboven
  • 1,204
  • 1
  • 14
  • 31
1
vote
1 answer

Oracle to Spark/Hive: how to convert use of "greatest" function to Spark 1.6 dataframe

Table in oracle has 37 columns. names of columns are: year,month,d1,d2....d34. Data in d1..d34 are all integers. There is one more column called maxd which is blank. For each row, I have to find the greatest value out of d1,d2....d34 and put that in…
1
vote
1 answer

PySpark: Compute row minimum ignoring zeros and null values

I'd like to create a new column (v5) based on the existing subset of columns in the dataframe. Sample dataframe: +---+---+---+---+ | v1| v2| v3| v4| +---+---+---+---+ | 2| 4|7.0|4.0| | 99| 0|2.0|0.0| |189| 0|2.4|0.0| +---+---+---+---+ providing…
Mia21
  • 119
  • 2
  • 10
1
vote
1 answer

Programmatically specifying the schema in PySpark

I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried. from pyspark.sql.types import StructField, StructType , LongType, StringType stringJsonRdd_new = sc.parallelize(('{"id":…
Sumit
  • 1,360
  • 3
  • 16
  • 29
1
vote
2 answers

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work. myDF.na.drop().show() myDF.na.drop(how='any').show() Below is the…
Sumit
  • 1,360
  • 3
  • 16
  • 29
1
vote
1 answer

udf No TypeTag available for type string

I don't understand a behavior of spark. I create an udf which returns an Integer like below import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} object Show { def main(args: Array[String]): Unit = { val…
a.moussa
  • 2,977
  • 7
  • 34
  • 56
1
vote
1 answer

Spark dataframe insertinto hive table fails since some of the staging part files created with username mapr

I am using Spark dataframe to insert into a hive table. Even though the application is being submitted using the username 'myuser', some of the hive staging part files gets created with username 'mapr'. So the final write into the hive table fails…
Shasankar
  • 672
  • 6
  • 16
1
vote
1 answer

Error while running PageRank and BFS functions on Graphframes in PySpark

I'm new to Spark, and am learning it on the Cloudera Distr for Hadoop (CDH). I'm trying to execute the PageRank and BFS functions through Jupyter Notebook, which was initiated using the following command: pyspark --packages…
1
vote
2 answers

Convert Excel file to csv in Spark 1.X

Is there a tool to convert Excel files into csv using Spark 1.X ? got this issue when executing this tuto https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x Exception in thread "main" java.lang.NoClassDefFoundError:…
1
vote
1 answer

Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark

So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD. I try the following, sc is my SparkContext: import…
osk
  • 790
  • 2
  • 10
  • 31
1
vote
0 answers

SparkSQL JDBC writer fails with "Cannot acquire locks error"

I'm trying to insert 50 million rows from hive table into a SQLServer table using SparkSQL JDBC Writer.Below is the line of code that I'm using to insert the data mdf1.coalesce(4).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE",…
1
vote
0 answers

HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

I am trying to use a custom HashMap implementation as UserDefinedType instead of MapType in spark. The code is working fine in spark 1.5.2 but giving java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 cannot be cast to…
1
vote
2 answers

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

I have a PySpark dataframe df: +---------+------------------+ |ceil_temp| test2| +---------+------------------+ | -1|[6397024, 6425417]| | 0|[6397024, 6425417]| | 0|[6397024, 6425417]| | 0|[6469640, 6531963]| |…
1
vote
0 answers

Similar java code for following scala code

I have trouble using concat_list on a spark dataframe in spark 1.6.So I am planing to go with UDAF for concating multiple row values to one single row with comma seperated values Here's link! for the original post,I need the same code in java SE…
Yashwanth Kambala
  • 412
  • 1
  • 5
  • 14
1
vote
1 answer

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to…