Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
1 answer

How to prevent predicate pushdown?

Recently I was working with Spark with JDBC data source. Consider following snippet: val df = spark.read.(options).format("jdbc").load(); val newDF = df.where(PRED) PRED is a list of predicates. If PRED is a simple predicate, like x = 10, query…
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
7
votes
3 answers

How do I groupby and concat a list in a Dataframe Spark Scala

I have a dataframe with two columns with data as below +----+-----------------+ |acct| device| +----+-----------------+ | B| List(3, 4)| | C| List(3, 5)| | A| List(2, 6)| | B|List(3, 11, 4, 9)| | C| …
Babu
  • 861
  • 3
  • 13
  • 36
7
votes
2 answers

Spark SQL - Encoders for Tuple Containing a List or Array as an Element

Using Spark 2.2 + Java 1.8 I have two custom data types "Foo" and "Bar". Each one implements serializable.'Foo' has a one to many relationship with 'Bar' so their relationship is represented as a Tuple: Tuple2> Typically, when I have…
HansGruber
  • 71
  • 1
  • 5
7
votes
2 answers

PySpark.sql.filter not performing as it should

I am running into the problem when executing below codes: from pyspark.sql import functions as F from pyspark.sql import Row, HiveContext hc = HiveContext() rows1 = [Row(id1 = '2', id2 = '1', id3 = 'a'), Row(id1 = '3', id2 = '2', id3 =…
7
votes
3 answers

Saving a dataframe result value to a string variable?

I created a dataframe in spark when find the max date I want to save it to the variable. Just trying to figure out how to get the result, which is a string, and save it to a variable. code so far: sqlDF = spark.sql("SELECT MAX(date) FROM…
oharr
  • 163
  • 1
  • 3
  • 12
7
votes
4 answers

How to add days (as values of a column) to date?

I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer: date_add(date startdate, tinyint/smallint/int days) I'd like to use a column…
Mrgr8m4
  • 477
  • 3
  • 9
  • 29
7
votes
1 answer

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group. The data looks like: id time name 132 12 Lucy 132 10 John 132 15 Sam 78 11 Kate 78 7 Julia 78 2 Vivien 245 22 Tom I would like to get this: id time name 132 …
MLam
  • 161
  • 1
  • 2
  • 10
7
votes
3 answers

How to sort array of struct type in Spark DataFrame by particular field?

Given following code: import java.sql.Date import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object SortQuestion extends App{ val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate() …
addmeaning
  • 1,358
  • 1
  • 13
  • 36
7
votes
1 answer

Structured Streaming and Splitting nested data into multiple datasets

I'm working with Spark's Structured Streaming (2.2.1), using Kafka to receive data from sensors every 60 seconds. I'm having troubles wrapping my head around how to package this Kafka Data to be able to process is correctly as it comes. I need to be…
7
votes
1 answer

How to pushdown limit predicate for Cassandra when you use dataframes?

I have large Cassandra table. I want to load only 50 rows from Cassandra. Following code val ds = sparkSession.read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> s"$Aggregates", "keyspace" -> s"$KeySpace")) …
addmeaning
  • 1,358
  • 1
  • 13
  • 36
7
votes
1 answer

Are the join types defined as constants somewhere accessible in Apache Spark?

I haven't found them after having a cursory glance at the Spark codebase. In most documentation and tutorial examples, people seem to be using 'naked' string literals to specify join types. Does Spark provide an object or class defining "leftouter",…
Tobias Roland
  • 1,182
  • 1
  • 13
  • 35
7
votes
1 answer

Pyspark- Subquery in a case statement

I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table. Is this even possible in pyspark? temp_df=spark.sql("select *,…
kkumar
  • 173
  • 2
  • 5
  • 15
7
votes
1 answer

Does the SparkSQL Dataframe function explode preserve order?

I have a Scala spark DataFrame: df.select($"row_id", $"array_of_data").show +----------+--------------------+ | row_id | array_of_data | +----------+--------------------+ | 0 | [123, ABC, G12] | | 1 | [100, 410] | | …
Kyle Heuton
  • 9,318
  • 4
  • 40
  • 52
7
votes
1 answer

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example: A | B | C ---------- A | 1 | 6 A | 1 | 7 B | 2 | 8 B | 2 | 4 I wish to group by A , present any of B…
Dimgold
  • 2,748
  • 5
  • 26
  • 49
7
votes
4 answers

How to create a Row from a given case class?

Imagine that you have the following case classes: case class B(key: String, value: Int) case class A(name: String, data: B) Given an instance of A, how do I create a Spark Row? e.g. val a = A("a", B("b", 0)) val row = ??? NOTE: Given row I need to…
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
1 2 3
99
100