Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

464 questions
0
votes
1 answer

Apache Spark 2.0 - date_add function

I have a simple schema with a date and an int. I want to use date_add to add the int to the date. scala> val ds1 = spark.read.option("inferSchema",true).csv("samp.csv") ds1.printSchema(); root |-- _c0: timestamp (nullable = true) |-- _c1:…
coder AJ
  • 1
  • 4
0
votes
1 answer

Spark 2.0: A named function inside mapGroups for sql.KeyValueGroupedDataset cause java.io.NotSerializableException

Anonymous function work fine. For following code set up the problem: import sparkSession.implicits._ val sparkSession = SparkSession.builder.appName("demo").getOrCreate() val sc = sparkSession.sparkContext case class DemoRow(keyId: Int, evenOddId:…
Y.G.
  • 661
  • 7
  • 7
0
votes
1 answer

Spark-java multithreading vs running individual spark jobs

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop) Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential. - First approach All of queries can…
user2895589
  • 1,010
  • 4
  • 20
  • 33
0
votes
2 answers

Spark 2.0 CSV Error

I am upgrading to spark 2 from 1.6 and am having an issue reading in CSV files. In spark 1.6 I would have something like this to read in a CSV file. val df = sqlContext.read.format("com.databricks.spark.csv") .option("header",…
st33l3rf4n
  • 11
  • 2
  • 5
0
votes
0 answers

spark-sql - using nested query to filter data

I have huge .csv file which has several columns but the columns of importance to me are USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.). So what I am trying to do is : replace all null…
0
votes
1 answer

Apache Spark isn't playing nice with Jersey dependency injection

I'm trying to use the com.github.sps.metrics.metrics-opentsdb library to log metrics from my spark job to my OpenTSDB server. I'm running into an issue where I get a strange NPE down in the jersey code that deals with EncodingFilters. Here is the…
Hardy
  • 477
  • 9
  • 19
0
votes
1 answer

is there any google/aws services to move data from google store to s3

In my usecase all google related app and ads data generation is going to store in google store.but my processing engine runs on Spark on AWS cloud. can some one please help how i can move this GS data S3 to process. Thank You in advance
0
votes
1 answer

How to persist a DataFrame to a Hive table?

I use CentOS on Cloudera QuickStart VM. I created a sbt-managed Spark application following the other question How to save DataFrame directly to Hive?. build.sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" libraryDependencies…
sdinesh94
  • 1,138
  • 15
  • 32
0
votes
1 answer

createOrReplaceTempView does not work on empty dataframe in pyspark2.0.0

I am trying to define a sql view on a pyspark dataframe(2.0.0) and getting errors like "Table or View Not found". What I am doing : 1. Create an empty dataframe 2. load data from different location into a temp dataframe 3. append the temp data frame…
braj
  • 2,545
  • 2
  • 29
  • 40
0
votes
1 answer

Cassandra select query multiple params

Using casssandra 2.28, java-connector3, sparks2.0. I am trying to write a simple query with multiple select params- unable to get the syntax right. Single param works CassandraJavaRDD rdd = javaFunc …
0
votes
2 answers

What is the behavior of transformations and actions in Spark?

We're performing some tests to evaluate the behavior of transformations and actions in Spark with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and 1 action: LOAD (result: df_1) > SELECT ALL FROM df_1 (result:…
Brccosta
  • 39
  • 1
  • 6
0
votes
1 answer

Apache spark join with dynamic re-partitionion

I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception. I noticed the task is stuck on the last partition 199/200 and eventually crashes.…
0
votes
0 answers

How to create two columns from a single column in a dataframe using pyspark

I have a transform a dataframe which look like this +---------+------+ | Country|Status| +---------+------+ |[AW,null]| 14| |[UG,null]| 47| |[CY,null]| 1324| |[AO,null]| 20| |[US,null]|325242| |[KE,null]| 246| |[DK,true]| …
0
votes
0 answers

How to transform a Dataset of a known type to one with a generic type

So I've got this example code where I have a Dataset[Event] which I would like to group based on a key of generic type computed on the fly. import org.apache.spark.sql.{ Dataset, KeyValueGroupedDataset } case class Event(id: Int, name:…
aa8y
  • 3,854
  • 4
  • 37
  • 62
0
votes
0 answers

Dataframe save to Redshift from Spark2 job running on dataproc cluster stalls

I have a dataframe (Dataset) and want to save this dataframe to Redshift. df.write() .format("com.databricks.spark.redshift") .option("url", url) .option("dbtable", dbTable) .option("tempdir", tempDir) .mode("append") …
1 2 3
30
31