Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

1 answer

How to prevent predicate pushdown?

Recently I was working with Spark with JDBC data source. Consider following snippet: val df = spark.read.(options).format("jdbc").load(); val newDF = df.where(PRED) PRED is a list of predicates. If PRED is a simple predicate, like x = 10, query…

apache-spark apache-spark-sql

asked May 14 '18 at 17:57

T. Gawęda

15,706
4
46
61

votes

3 answers

How do I groupby and concat a list in a Dataframe Spark Scala

I have a dataframe with two columns with data as below +----+-----------------+ |acct| device| +----+-----------------+ | B| List(3, 4)| | C| List(3, 5)| | A| List(2, 6)| | B|List(3, 11, 4, 9)| | C| …

scala apache-spark dataframe apache-spark-sql

asked May 08 '18 at 19:41

Babu

votes

2 answers

Spark SQL - Encoders for Tuple Containing a List or Array as an Element

Using Spark 2.2 + Java 1.8 I have two custom data types "Foo" and "Bar". Each one implements serializable.'Foo' has a one to many relationship with 'Bar' so their relationship is represented as a Tuple: Tuple2> Typically, when I have…

java apache-spark apache-spark-sql

asked May 02 '18 at 01:47

HansGruber

votes

2 answers

PySpark.sql.filter not performing as it should

I am running into the problem when executing below codes: from pyspark.sql import functions as F from pyspark.sql import Row, HiveContext hc = HiveContext() rows1 = [Row(id1 = '2', id2 = '1', id3 = 'a'), Row(id1 = '3', id2 = '2', id3 =…

python-2.7 apache-spark pyspark apache-spark-sql

asked Apr 24 '18 at 07:16

WEIHANG LIU

votes

3 answers

Saving a dataframe result value to a string variable?

I created a dataframe in spark when find the max date I want to save it to the variable. Just trying to figure out how to get the result, which is a string, and save it to a variable. code so far: sqlDF = spark.sql("SELECT MAX(date) FROM…

python dataframe apache-spark-sql databricks

asked Apr 20 '18 at 18:36

oharr

votes

4 answers

How to add days (as values of a column) to date?

I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer: date_add(date startdate, tinyint/smallint/int days) I'd like to use a column…

scala apache-spark apache-spark-sql

asked Apr 12 '18 at 15:15

Mrgr8m4

votes

1 answer

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group. The data looks like: id time name 132 12 Lucy 132 10 John 132 15 Sam 78 11 Kate 78 7 Julia 78 2 Vivien 245 22 Tom I would like to get this: id time name 132 …

apache-spark pyspark apache-spark-sql

asked Apr 10 '18 at 14:31

MLam

votes

3 answers

How to sort array of struct type in Spark DataFrame by particular field?

Given following code: import java.sql.Date import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object SortQuestion extends App{ val spark = SparkSession.builder().appName("local").master("local[*]").getOrCreate() …

dataframe scala apache-spark apache-spark-sql

asked Apr 05 '18 at 11:34

addmeaning

1,358
1
13
36

votes

1 answer

Structured Streaming and Splitting nested data into multiple datasets

I'm working with Spark's Structured Streaming (2.2.1), using Kafka to receive data from sensors every 60 seconds. I'm having troubles wrapping my head around how to package this Kafka Data to be able to process is correctly as it comes. I need to be…

apache-spark apache-kafka apache-spark-sql spark-structured-streaming

asked Apr 01 '18 at 10:59

Martin

votes

1 answer

How to pushdown limit predicate for Cassandra when you use dataframes?

I have large Cassandra table. I want to load only 50 rows from Cassandra. Following code val ds = sparkSession.read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> s"$Aggregates", "keyspace" -> s"$KeySpace")) …

scala cassandra apache-spark-sql spark-cassandra-connector

asked Mar 28 '18 at 12:28

addmeaning

1,358
1
13
36

votes

1 answer

Are the join types defined as constants somewhere accessible in Apache Spark?

I haven't found them after having a cursory glance at the Spark codebase. In most documentation and tutorial examples, people seem to be using 'naked' string literals to specify join types. Does Spark provide an object or class defining "leftouter",…

scala apache-spark apache-spark-sql

asked Mar 27 '18 at 17:24

Tobias Roland

1,182
1
13
35

votes

1 answer

Pyspark- Subquery in a case statement

I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table. Is this even possible in pyspark? temp_df=spark.sql("select *,…

python pyspark apache-spark-sql

asked Mar 15 '18 at 00:06

kkumar

votes

1 answer

Does the SparkSQL Dataframe function explode preserve order?

I have a Scala spark DataFrame: df.select($"row_id", $"array_of_data").show +----------+--------------------+ | row_id | array_of_data | +----------+--------------------+ | 0 | [123, ABC, G12] | | 1 | [100, 410] | | …

scala apache-spark apache-spark-sql

asked Mar 07 '18 at 16:41

Kyle Heuton

9,318
4
40
52

votes

1 answer

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example: A | B | C ---------- A | 1 | 6 A | 1 | 7 B | 2 | 8 B | 2 | 4 I wish to group by A , present any of B…

python apache-spark pyspark apache-spark-sql coalesce

asked Feb 25 '18 at 12:49

Dimgold

2,748
5
26
49

votes

4 answers

How to create a Row from a given case class?

Imagine that you have the following case classes: case class B(key: String, value: Int) case class A(name: String, data: B) Given an instance of A, how do I create a Spark Row? e.g. val a = A("a", B("b", 0)) val row = ??? NOTE: Given row I need to…

scala apache-spark apache-spark-sql

asked Feb 12 '18 at 20:31

Marsellus Wallace

17,991
25
90
154

Prev 1 2 3

…

100 Next