Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to . It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

Related tags: , ,

26508 questions
7
votes
2 answers

Does spark read the same file twice, if two stages are using the same DataFrame?

The following code reads the same csv twice even though only one action is called End to end runnable example: import pandas as pd import numpy as np df1= pd.DataFrame(np.arange(1_000).reshape(-1,1)) df1.index =…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
7
votes
3 answers

Converting timestamp to epoch milliseconds in pyspark

I have a dataset like the below: epoch_seconds eq_time 1636663343887 2021-11-12 02:12:23 Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my…
whatsinthename
  • 1,828
  • 20
  • 59
7
votes
2 answers

AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures

I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 records. "some_function" posts data into a website…
7
votes
2 answers

Why joining structure-identic dataframes gives different results?

Update: the root issue was a bug which was fixed in Spark 3.2.0. Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return…
ZygD
  • 22,092
  • 39
  • 79
  • 102
7
votes
2 answers

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce. I know repartition distributes data evenly across…
nagraj036
  • 165
  • 1
  • 6
7
votes
1 answer

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it. My questions are: Should I specifiy the checkpoint dir? Is it possible to do it like this: df.checkpoint() Are there any optional params…
7
votes
5 answers

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck. %sql ALTER TABLE…
chaitra k
  • 371
  • 1
  • 4
  • 18
7
votes
5 answers

How to use variables in SQL queries?

Since in SQL Server ,we can declare variables like declare @sparksql='' but in spark sql what alternative can be used . So that we don't need to hard code any values/query/strings.
Shrince
  • 101
  • 1
  • 1
  • 3
7
votes
0 answers

Is there an Enum type in PySpark?

I just wondered if there is an EnumType in PySpark/Spark. I want to add constraints on StringTypes (or other types as well) to have certain values only in my DataFrame's schema.
7
votes
1 answer

How to use Pyspark equivalent for reset_index() in python

I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows: data.reset_index() I get an error: "DataFrame' object has no attribute 'reset_index' error"
7
votes
1 answer

Pyspark to_timestamp with timezone

I am trying to convert datetime strings with timezone to timestamp using to_timestamp. Sample dataframe: df = spark.createDataFrame([("a", '2020-09-08 14:00:00.917+02:00'), ("b", '2020-09-08 14:00:00.900+01:00')], …
Christian Sloper
  • 7,440
  • 3
  • 15
  • 28
7
votes
2 answers

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time…
Lauren Leder
  • 276
  • 1
  • 3
  • 15
7
votes
3 answers

Fetch week start date and week end date from Date

I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday. I referred this post but this takes monday as starting day of week. Is there any inbuilt function in…
ben
  • 1,404
  • 8
  • 25
  • 43
7
votes
1 answer

What is the differences between spark.table("TABLE A") and spark.read.("TABLE A")

Question as the title,I am learning sparkSQL,but I can't get a good understanding of the difference between them. Thanks.
Sean
  • 87
  • 1
  • 6
7
votes
2 answers

PySpark - pass a value from another column as the parameter of spark function

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression. +-----------------------------------------+ |expr |var1 |var2 | +-------------------------+---------+-----+ |var1 > 7 |9…
UtkarshSahu
  • 93
  • 2
  • 10