Highest Voted 'apache-spark-2.0' Questions

10

votes

1 answer

Avoid starting HiveThriftServer2 with created context programmatically

We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like…

asked Sep 27 '16 at 07:50

VladoDemcak

4,893
4
35
42

9

votes

1 answer

Efficiently running a "for" loop in Apache spark so that execution is parallel

How can we parallelize a loop in Spark so that the processing is not sequential and its parallel. To take an example - I have the following data contained in a csv file (called 'bill_item.csv')that contains the following data: …

python apache-spark bigdata apache-spark-dataset apache-spark-2.0

asked Dec 26 '19 at 11:33

Kamal Nandan

233
1
5
11

9

votes

1 answer

spark job keep showing TaskCommitDenied (Driver denied task commit)

Environment: We are using EMR, with Spark 2.1 and EMR FS. Process we are doing: We are running a PySpark job to join 2 Hive tables and creating a another hive table based on this result using saveAsTable and storing it as a ORC with…

apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Jan 29 '18 at 06:36

Venkata Sudheer Kumar M

109
1
7

9

votes

1 answer

SparkSession initialization error - Unable to use spark.read

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code: from pyspark import SparkConf, SparkContext from pyspark.sql…

python apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Oct 24 '17 at 08:42

Michail N

3,647
2
32
51

9

votes

3 answers

Spark 2.0 Timestamp Difference in Milliseconds using Scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala: Need the time-stamp difference in milliseconds between two Data-frame column values. Value_1 = 06/13/2017 16:44:20.044 Value_2 = 06/13/2017 16:44:21.067 Data-types for…

scala timestamp apache-spark-sql user-defined-functions apache-spark-2.0

asked Oct 03 '17 at 08:27

Roshan

101
1
3

9

votes

2 answers

Split dataset based on column values in spark

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code. List lsts=…

java apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

asked Mar 07 '17 at 10:30

Shreeharsha

914
1
10
21

9

votes

3 answers

Livy Server: return a dataframe as JSON?

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body { "code": "spark.sql(\"select * from test_table limit 10\")" } I would like an answer in the following…

json apache-spark cloudera apache-spark-2.0 livy

asked Dec 13 '16 at 17:23

matheusr

567
9
29

9

votes

4 answers

Spark 2.0.0 Error: PartitioningCollection requires all of its partitionings have the same numPartitions

I'm joining some DataFrames together in Spark and I keep getting the following error: PartitioningCollection requires all of its partitionings have the same numPartitions. It seems to happen after I join two DataFrames together that each seem…

join apache-spark apache-spark-sql apache-spark-2.0

asked Sep 29 '16 at 22:08

Clemente Cuevas

486
3
6

8

votes

3 answers

How to pivot streaming dataset?

I am trying to pivot a Spark streaming dataset (structured streaming) but I get an AnalysisException (excerpt below). Could someone confirm pivoting is indeed not supported in structured streams (Spark 2.0), perhaps suggest alternative…

apache-spark spark-structured-streaming apache-spark-2.0

asked Dec 01 '17 at 13:12

rodders

229
1
6
9

8

votes

3 answers

Out of Memory Error when Reading large file in Spark 2.1.0

I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file. But I always get a…

xml scala apache-spark apache-spark-2.0 apache-spark-xml

asked May 05 '17 at 04:18

Felipe

11,557
7
56
103

8

votes

2 answers

How to traverse/iterate a Dataset in Spark Java?

I am trying to traverse a Dataset to do some string similarity calculations like Jaro winkler or Cosine Similarity. I convert my Dataset to list of rows and then traverse with for statement which is not efficient spark way to do it. So I am looking…

java apache-spark iterator apache-spark-2.0 apache-spark-dataset

asked Mar 13 '17 at 06:09

Abhishek Vk

107
1
2
11

8

votes

2 answers

How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)

Im using Spark 2.0. I have a column of my dataframe containing a WrappedArray of WrappedArrays of Float. An example of a row would be: [[1.0 2.0 2.0][6.0 5.0 2.0][4.0 2.0 3.0]] Im trying to transform this column into an Array[Array[Float]]. What I…

arrays scala casting apache-spark-sql apache-spark-2.0

asked Jan 27 '17 at 23:37

bobo32

992
2
9
21

8

votes

1 answer

How to do non-random Dataset splitting on Apache Spark?

I know I can do random splitting with randomSplit method: val splittedData: Array[Dataset[Row]] = preparedData.randomSplit(Array(0.5, 0.3, 0.2)) Can I split the data into consecutive parts with some 'nonRandomSplit method'? Apache Spark…

apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

asked Dec 02 '16 at 14:50

Anton

494
5
19

8

votes

0 answers

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing. The message is one I've not seen before and implies that the workers are not…

amazon-web-services apache-spark emr apache-spark-2.0

asked Nov 30 '16 at 14:57

null

3,469
7
41
90

8

votes

0 answers

Spark 2.0: Moving from RDD to Dataset

I want to adapt my Java Spark app (which actually uses RDDs for some calculations) to use Datasets instead of RDDs. I'm new to Datasets and not sure how to map which transaction to a corresponding Dataset operation. At the moment I map them like…

dataset rdd apache-spark-dataset apache-spark-2.0

asked Sep 12 '16 at 12:06

D. Müller

3,336
4
36
84

Questions tagged [apache-spark-2.0]