Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-2.0]
464 questions
0
votes
1 answer
Apache Spark 2.0 - date_add function
I have a simple schema with a date and an int. I want to use date_add to add the int to the date.
scala> val ds1 = spark.read.option("inferSchema",true).csv("samp.csv")
ds1.printSchema();
root
|-- _c0: timestamp (nullable = true)
|-- _c1:…

coder AJ
- 1
- 4
0
votes
1 answer
Spark 2.0: A named function inside mapGroups for sql.KeyValueGroupedDataset cause java.io.NotSerializableException
Anonymous function work fine.
For following code set up the problem:
import sparkSession.implicits._
val sparkSession = SparkSession.builder.appName("demo").getOrCreate()
val sc = sparkSession.sparkContext
case class DemoRow(keyId: Int, evenOddId:…

Y.G.
- 661
- 7
- 7
0
votes
1 answer
Spark-java multithreading vs running individual spark jobs
I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can…

user2895589
- 1,010
- 4
- 20
- 33
0
votes
2 answers
Spark 2.0 CSV Error
I am upgrading to spark 2 from 1.6 and am having an issue reading in CSV files. In spark 1.6 I would have something like this to read in a CSV file.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header",…

st33l3rf4n
- 11
- 2
- 5
0
votes
0 answers
spark-sql - using nested query to filter data
I have huge .csv file which has several columns but the columns of importance to me are USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.).
So what I am trying to do is : replace all null…

sensitive_piece_of_horseflesh
- 909
- 4
- 16
- 40
0
votes
1 answer
Apache Spark isn't playing nice with Jersey dependency injection
I'm trying to use the com.github.sps.metrics.metrics-opentsdb library to log metrics from my spark job to my OpenTSDB server. I'm running into an issue where I get a strange NPE down in the jersey code that deals with EncodingFilters.
Here is the…

Hardy
- 477
- 9
- 19
0
votes
1 answer
is there any google/aws services to move data from google store to s3
In my usecase all google related app and ads data generation is going to store in google store.but my processing engine runs on Spark on AWS cloud.
can some one please help how i can move this GS data S3 to process.
Thank You in advance
0
votes
1 answer
How to persist a DataFrame to a Hive table?
I use CentOS on Cloudera QuickStart VM. I created a sbt-managed Spark application following the other question How to save DataFrame directly to Hive?.
build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies…

sdinesh94
- 1,138
- 15
- 32
0
votes
1 answer
createOrReplaceTempView does not work on empty dataframe in pyspark2.0.0
I am trying to define a sql view on a pyspark dataframe(2.0.0) and getting errors like "Table or View Not found". What I am doing : 1. Create an empty dataframe 2. load data from different location into a temp dataframe 3. append the temp data frame…

braj
- 2,545
- 2
- 29
- 40
0
votes
1 answer
Cassandra select query multiple params
Using casssandra 2.28, java-connector3, sparks2.0.
I am trying to write a simple query with multiple select params- unable to get the syntax right.
Single param works
CassandraJavaRDD rdd = javaFunc
…

Sam-T
- 1,877
- 6
- 23
- 51
0
votes
2 answers
What is the behavior of transformations and actions in Spark?
We're performing some tests to evaluate the behavior of transformations and actions in Spark with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and 1 action:
LOAD (result: df_1) > SELECT ALL FROM df_1 (result:…

Brccosta
- 39
- 1
- 6
0
votes
1 answer
Apache spark join with dynamic re-partitionion
I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception.
I noticed the task is stuck on the last partition 199/200 and eventually crashes.…

Philip K. Adetiloye
- 3,102
- 4
- 37
- 63
0
votes
0 answers
How to create two columns from a single column in a dataframe using pyspark
I have a transform a dataframe which look like this
+---------+------+
| Country|Status|
+---------+------+
|[AW,null]| 14|
|[UG,null]| 47|
|[CY,null]| 1324|
|[AO,null]| 20|
|[US,null]|325242|
|[KE,null]| 246|
|[DK,true]| …

Mukesh Jha
- 125
- 1
- 1
- 6
0
votes
0 answers
How to transform a Dataset of a known type to one with a generic type
So I've got this example code where I have a Dataset[Event] which I would like to group based on a key of generic type computed on the fly.
import org.apache.spark.sql.{ Dataset, KeyValueGroupedDataset }
case class Event(id: Int, name:…

aa8y
- 3,854
- 4
- 37
- 62
0
votes
0 answers
Dataframe save to Redshift from Spark2 job running on dataproc cluster stalls
I have a dataframe (Dataset) and want to save this dataframe to Redshift.
df.write()
.format("com.databricks.spark.redshift")
.option("url", url)
.option("dbtable", dbTable)
.option("tempdir", tempDir)
.mode("append")
…

Christian
- 23
- 5