Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

2 answers

How to dynamically slice an Array column in Spark?

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick…

python apache-spark pyspark apache-spark-sql

asked Sep 02 '19 at 14:32

harppu

votes

2 answers

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information: trip-id timestamp speed The data looks like this: trip-id | timestamp | speed 001 | 1538204192 | 44.55 001 | 1538204193 | 47.20 <-- start of brake 001 |…

dataframe apache-spark pyspark apache-spark-sql rdd

asked Aug 16 '19 at 13:01

Shumail

3,103
4
28
35

votes

1 answer

Run a sql query on a PySpark DataFrame

I am using Databricks and I already have loaded some DataTables. However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in pyspark. Is that possible? To give an example: In…

apache-spark-sql

asked Aug 07 '19 at 10:43

George Sotiropoulos

1,864
1
22
32

votes

3 answers

How to save dataframe to Elasticsearch in PySpark?

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES, from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ES_indexer').getOrCreate() df =…

apache-spark elasticsearch pyspark apache-spark-sql

asked Jul 17 '19 at 23:54

Cyber_Tron

votes

0 answers

Create a single schema dataframe when reading multiple csv files under a directory

I have thousands of CSV files that have similar but non-identical headers under a single directory. The structure is as follow: path/to/files/unique_parent_directory/*.csv One csv file can be: |Column_A|Column_B|Column_C|Column_D| |V1 |V2 …

scala csv io apache-spark-sql

asked Jun 24 '19 at 15:32

SaadK

votes

4 answers

How to apply large python model to pyspark-dataframe?

I have: Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features) Several huge ML models (each one takes 5-15GB of RAM) Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if…

python apache-spark machine-learning pyspark apache-spark-sql

asked May 15 '19 at 15:51

Ivan Menshikh

votes

1 answer

How to remove duplicates from a spark data frame while retaining the latest?

I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be…

pyspark apache-spark-sql

asked Apr 12 '19 at 22:18

lalatnayak

votes

1 answer

select latest record from spark dataframe

i have DataDrame looks like this: +-------+---------+ |email |timestamp| +-------+---------+ |x@y.com| 1| |y@m.net| 2| |z@c.org| 3| |x@y.com| 4| |y@m.net| 5| | .. | ..| +-------+---------+ for each…

apache-spark-sql

asked Apr 10 '19 at 14:57

user468587

4,799
24
67
124

votes

2 answers

filter only not empty arrays dataframe spark

How can i filter only not empty arrays import org.apache.spark.sql.types.ArrayType val arrayFields = secondDF.schema.filter(st => st.dataType.isInstanceOf[ArrayType]) val names = arrayFields.map(_.name) Or is this code val…

scala apache-spark apache-spark-sql

asked Apr 01 '19 at 18:59

Carlos

votes

2 answers

Spark decimal type precision loss

I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example…

scala apache-spark apache-spark-sql

asked Mar 07 '19 at 14:43

Jared

2,904
6
33
37

votes

1 answer

Concat multiple columns of a dataframe using pyspark

Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2. I don't want to hard code the…

apache-spark pyspark apache-spark-sql

asked Feb 28 '19 at 08:32

Amita Rawat

votes

2 answers

Multiple WHEN condition implementation in Pyspark

I've my T-SQL code below which I've converted in Pyspark but is giving me error CASE WHEN time_on_site.eventaction = 'IN' AND time_on_site.next_action = 'OUT' AND time_on_site.timespent_sec < 72000 THEN 1 -- 20 hours WHEN…

tsql pyspark apache-spark-sql case-when .when

asked Feb 21 '19 at 21:59

Katelyn Raphael

votes

5 answers

Spark Advanced Window with dynamic last

Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark. Session Definition Session expires after inactivity of 1 hour Session remains active for a total…

sql scala apache-spark apache-spark-sql

asked Feb 13 '19 at 03:48

Arghya Saha

votes

2 answers

Efficient string suffix detection

I am working with PySpark on a huge dataset, where I want to filter the data frame based on strings in another data frame. For example, dd =…

python apache-spark pyspark apache-spark-sql string-matching

asked Feb 01 '19 at 14:39

Sotos

51,121
6
32
66

votes

2 answers

Spark - How to add an element to an array of structs

Having this schema: root |-- Elems: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Elem: integer (nullable = true) | | |-- Desc: string (nullable = true) How can we add a new field like that? root …

arrays dataframe apache-spark struct apache-spark-sql

asked Jan 14 '19 at 19:57

rvilla

Prev 1 2 3

…

99 100 Next