Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

464 questions
10
votes
1 answer

Avoid starting HiveThriftServer2 with created context programmatically

We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0. First, we have created sparkSession with enabled Hive Support. Currently, we start ThriftServer with sqlContext like…
VladoDemcak
  • 4,893
  • 4
  • 35
  • 42
9
votes
1 answer

Efficiently running a "for" loop in Apache spark so that execution is parallel

How can we parallelize a loop in Spark so that the processing is not sequential and its parallel. To take an example - I have the following data contained in a csv file (called 'bill_item.csv')that contains the following data: …
9
votes
1 answer

spark job keep showing TaskCommitDenied (Driver denied task commit)

Environment: We are using EMR, with Spark 2.1 and EMR FS. Process we are doing: We are running a PySpark job to join 2 Hive tables and creating a another hive table based on this result using saveAsTable and storing it as a ORC with…
9
votes
1 answer

SparkSession initialization error - Unable to use spark.read

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code: from pyspark import SparkConf, SparkContext from pyspark.sql…
Michail N
  • 3,647
  • 2
  • 32
  • 51
9
votes
3 answers

Spark 2.0 Timestamp Difference in Milliseconds using Scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala: Need the time-stamp difference in milliseconds between two Data-frame column values. Value_1 = 06/13/2017 16:44:20.044 Value_2 = 06/13/2017 16:44:21.067 Data-types for…
9
votes
2 answers

Split dataset based on column values in spark

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code. List lsts=…
9
votes
3 answers

Livy Server: return a dataframe as JSON?

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body { "code": "spark.sql(\"select * from test_table limit 10\")" } I would like an answer in the following…
matheusr
  • 567
  • 9
  • 29
9
votes
4 answers

Spark 2.0.0 Error: PartitioningCollection requires all of its partitionings have the same numPartitions

I'm joining some DataFrames together in Spark and I keep getting the following error: PartitioningCollection requires all of its partitionings have the same numPartitions. It seems to happen after I join two DataFrames together that each seem…
8
votes
3 answers

How to pivot streaming dataset?

I am trying to pivot a Spark streaming dataset (structured streaming) but I get an AnalysisException (excerpt below). Could someone confirm pivoting is indeed not supported in structured streams (Spark 2.0), perhaps suggest alternative…
8
votes
3 answers

Out of Memory Error when Reading large file in Spark 2.1.0

I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file. But I always get a…
Felipe
  • 11,557
  • 7
  • 56
  • 103
8
votes
2 answers

How to traverse/iterate a Dataset in Spark Java?

I am trying to traverse a Dataset to do some string similarity calculations like Jaro winkler or Cosine Similarity. I convert my Dataset to list of rows and then traverse with for statement which is not efficient spark way to do it. So I am looking…
8
votes
2 answers

How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)

Im using Spark 2.0. I have a column of my dataframe containing a WrappedArray of WrappedArrays of Float. An example of a row would be: [[1.0 2.0 2.0][6.0 5.0 2.0][4.0 2.0 3.0]] Im trying to transform this column into an Array[Array[Float]]. What I…
bobo32
  • 992
  • 2
  • 9
  • 21
8
votes
1 answer

How to do non-random Dataset splitting on Apache Spark?

I know I can do random splitting with randomSplit method: val splittedData: Array[Dataset[Row]] = preparedData.randomSplit(Array(0.5, 0.3, 0.2)) Can I split the data into consecutive parts with some 'nonRandomSplit method'? Apache Spark…
8
votes
0 answers

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing. The message is one I've not seen before and implies that the workers are not…
null
  • 3,469
  • 7
  • 41
  • 90
8
votes
0 answers

Spark 2.0: Moving from RDD to Dataset

I want to adapt my Java Spark app (which actually uses RDDs for some calculations) to use Datasets instead of RDDs. I'm new to Datasets and not sure how to map which transaction to a corresponding Dataset operation. At the moment I map them like…
D. Müller
  • 3,336
  • 4
  • 36
  • 84
1
2
3
30 31