Highest Voted 'apache-spark-1.6' Questions

1

vote

2 answers

Specify default value for rowsBetween and rangeBetween in Spark

I have a question concerning a window operation in Sparks Dataframe 1.6. Let's say I have the following table: id|MONTH |number 1 201703 2 1 201704 3 1 201705 7 1 201706 6 At moment I'm using the rowsBetween function: val window =…

apache-spark apache-spark-sql apache-spark-1.6

asked Feb 15 '18 at 10:52

RudyVerboven

1,204
1
14
31

1

vote

1 answer

Oracle to Spark/Hive: how to convert use of "greatest" function to Spark 1.6 dataframe

Table in oracle has 37 columns. names of columns are: year,month,d1,d2....d34. Data in d1..d34 are all integers. There is one more column called maxd which is blank. For each row, I have to find the greatest value out of d1,d2....d34 and put that in…

apache-spark-1.6

asked Feb 08 '18 at 09:51

oracletohive

15
3

1

vote

1 answer

PySpark: Compute row minimum ignoring zeros and null values

I'd like to create a new column (v5) based on the existing subset of columns in the dataframe. Sample dataframe: +---+---+---+---+ | v1| v2| v3| v4| +---+---+---+---+ | 2| 4|7.0|4.0| | 99| 0|2.0|0.0| |189| 0|2.4|0.0| +---+---+---+---+ providing…

apache-spark pyspark apache-spark-sql apache-spark-1.6

asked Feb 06 '18 at 19:16

Mia21

119
2
10

1

vote

1 answer

Programmatically specifying the schema in PySpark

I'm trying to create a dataframe from an rdd. I want to specify schema explicitly. Below is the code snippet which I tried. from pyspark.sql.types import StructField, StructType , LongType, StringType stringJsonRdd_new = sc.parallelize(('{"id":…

pyspark apache-spark-1.6

asked Feb 01 '18 at 11:04

Sumit

1,360
3
16
29

1

vote

2 answers

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan. I want to remove rows which have any of those. I tried below commands, but, nothing seems to work. myDF.na.drop().show() myDF.na.drop(how='any').show() Below is the…

apache-spark pyspark apache-spark-1.6

asked Jan 24 '18 at 11:40

Sumit

1,360
3
16
29

1

vote

1 answer

udf No TypeTag available for type string

I don't understand a behavior of spark. I create an udf which returns an Integer like below import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} object Show { def main(args: Array[String]): Unit = { val…

scala apache-spark apache-spark-1.6

asked Jan 09 '18 at 16:54

a.moussa

2,977
7
34
56

1

vote

1 answer

Spark dataframe insertinto hive table fails since some of the staging part files created with username mapr

I am using Spark dataframe to insert into a hive table. Even though the application is being submitted using the username 'myuser', some of the hive staging part files gets created with username 'mapr'. So the final write into the hive table fails…

hadoop apache-spark hive mapr apache-spark-1.6

asked Dec 29 '17 at 10:31

Shasankar

672
6
16

1

vote

1 answer

Error while running PageRank and BFS functions on Graphframes in PySpark

I'm new to Spark, and am learning it on the Cloudera Distr for Hadoop (CDH). I'm trying to execute the PageRank and BFS functions through Jupyter Notebook, which was initiated using the following command: pyspark --packages…

apache-spark pyspark cloudera-cdh apache-spark-1.6 graphframes

asked Dec 17 '17 at 09:04

Sasi

67
1
5

1

vote

2 answers

Convert Excel file to csv in Spark 1.X

Is there a tool to convert Excel files into csv using Spark 1.X ? got this issue when executing this tuto https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x Exception in thread "main" java.lang.NoClassDefFoundError:…

excel scala apache-spark apache-spark-1.6 spark-excel

asked Dec 13 '17 at 15:41

Ronald Segan

215
2
11

1

vote

1 answer

Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark

So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD. I try the following, sc is my SparkContext: import…

scala apache-spark dataset rdd apache-spark-1.6

asked Dec 04 '17 at 08:13

osk

790
2
10
31

1

vote

0 answers

SparkSQL JDBC writer fails with "Cannot acquire locks error"

I'm trying to insert 50 million rows from hive table into a SQLServer table using SparkSQL JDBC Writer.Below is the line of code that I'm using to insert the data mdf1.coalesce(4).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE",…

hive apache-spark-sql sql-server-2016 apache-spark-1.6 spark-jdbc

asked Dec 02 '17 at 05:14

sunny

11
2

1

vote

0 answers

HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

I am trying to use a custom HashMap implementation as UserDefinedType instead of MapType in spark. The code is working fine in spark 1.5.2 but giving java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 cannot be cast to…

scala apache-spark apache-spark-sql user-defined-types apache-spark-1.6

asked Oct 11 '17 at 10:01

Izhar Ahmed

185
8

1

vote

2 answers

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

I have a PySpark dataframe df: +---------+------------------+ |ceil_temp| test2| +---------+------------------+ | -1|[6397024, 6425417]| | 0|[6397024, 6425417]| | 0|[6397024, 6425417]| | 0|[6469640, 6531963]| |…

python apache-spark pyspark apache-spark-sql apache-spark-1.6

asked Aug 10 '17 at 17:21

Mia21

119
2
10

1

vote

0 answers

Similar java code for following scala code

I have trouble using concat_list on a spark dataframe in spark 1.6.So I am planing to go with UDAF for concating multiple row values to one single row with comma seperated values Here's link! for the original post,I need the same code in java SE…

java scala java-7 apache-spark-1.6

asked Jul 29 '17 at 05:54

Yashwanth Kambala

412
1
5
14

1

vote

1 answer

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to…

apache-spark apache-spark-sql apache-spark-1.6 spark-hive

asked Jul 10 '17 at 14:13

aladeen

297
7
18

Questions tagged [apache-spark-1.6]