Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
2 answers

How to dynamically slice an Array column in Spark?

Spark 2.4 introduced the new SQL function slice, which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick…
harppu
  • 384
  • 4
  • 13
7
votes
2 answers

Spark: How to aggregate/reduce records based on time difference?

I have time series data in CSV from vehicle with following information: trip-id timestamp speed The data looks like this: trip-id | timestamp | speed 001 | 1538204192 | 44.55 001 | 1538204193 | 47.20 <-- start of brake 001 |…
Shumail
  • 3,103
  • 4
  • 28
  • 35
7
votes
1 answer

Run a sql query on a PySpark DataFrame

I am using Databricks and I already have loaded some DataTables. However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in pyspark. Is that possible? To give an example: In…
George Sotiropoulos
  • 1,864
  • 1
  • 22
  • 32
7
votes
3 answers

How to save dataframe to Elasticsearch in PySpark?

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES, from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ES_indexer').getOrCreate() df =…
Cyber_Tron
  • 299
  • 1
  • 6
  • 17
7
votes
0 answers

Create a single schema dataframe when reading multiple csv files under a directory

I have thousands of CSV files that have similar but non-identical headers under a single directory. The structure is as follow: path/to/files/unique_parent_directory/*.csv One csv file can be: |Column_A|Column_B|Column_C|Column_D| |V1 |V2 …
SaadK
  • 256
  • 2
  • 10
7
votes
4 answers

How to apply large python model to pyspark-dataframe?

I have: Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features) Several huge ML models (each one takes 5-15GB of RAM) Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if…
7
votes
1 answer

How to remove duplicates from a spark data frame while retaining the latest?

I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be…
lalatnayak
  • 160
  • 1
  • 6
  • 21
7
votes
1 answer

select latest record from spark dataframe

i have DataDrame looks like this: +-------+---------+ |email |timestamp| +-------+---------+ |x@y.com| 1| |y@m.net| 2| |z@c.org| 3| |x@y.com| 4| |y@m.net| 5| | .. | ..| +-------+---------+ for each…
user468587
  • 4,799
  • 24
  • 67
  • 124
7
votes
2 answers

filter only not empty arrays dataframe spark

How can i filter only not empty arrays import org.apache.spark.sql.types.ArrayType val arrayFields = secondDF.schema.filter(st => st.dataType.isInstanceOf[ArrayType]) val names = arrayFields.map(_.name) Or is this code val…
Carlos
  • 357
  • 2
  • 3
  • 14
7
votes
2 answers

Spark decimal type precision loss

I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example…
Jared
  • 2,904
  • 6
  • 33
  • 37
7
votes
1 answer

Concat multiple columns of a dataframe using pyspark

Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2. I don't want to hard code the…
Amita Rawat
  • 153
  • 1
  • 2
  • 6
7
votes
2 answers

Multiple WHEN condition implementation in Pyspark

I've my T-SQL code below which I've converted in Pyspark but is giving me error CASE WHEN time_on_site.eventaction = 'IN' AND time_on_site.next_action = 'OUT' AND time_on_site.timespent_sec < 72000 THEN 1 -- 20 hours WHEN…
Katelyn Raphael
  • 253
  • 2
  • 4
  • 16
7
votes
5 answers

Spark Advanced Window with dynamic last

Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark. Session Definition Session expires after inactivity of 1 hour Session remains active for a total…
Arghya Saha
  • 227
  • 1
  • 4
  • 17
7
votes
2 answers

Efficient string suffix detection

I am working with PySpark on a huge dataset, where I want to filter the data frame based on strings in another data frame. For example, dd =…
Sotos
  • 51,121
  • 6
  • 32
  • 66
7
votes
2 answers

Spark - How to add an element to an array of structs

Having this schema: root |-- Elems: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Elem: integer (nullable = true) | | |-- Desc: string (nullable = true) How can we add a new field like that? root …
rvilla
  • 165
  • 1
  • 2
  • 10