Questions tagged [apache-spark-sql]

Apache Spark SQL is a tool for "SQL and structured data processing" on Spark, a fast and general-purpose cluster computing system. It can be used to retrieve data from Hive, Parquet etc. and run SQL queries over existing RDDs and Datasets.

Apache Spark SQL is a tool that brings native support for SQL to apache-spark. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

Resources

26508 questions

votes

2 answers

Does spark read the same file twice, if two stages are using the same DataFrame?

The following code reads the same csv twice even though only one action is called End to end runnable example: import pandas as pd import numpy as np df1= pd.DataFrame(np.arange(1_000).reshape(-1,1)) df1.index =…

apache-spark pyspark apache-spark-sql

asked May 06 '23 at 14:25

figs_and_nuts

4,870
2
31
56

votes

3 answers

Converting timestamp to epoch milliseconds in pyspark

I have a dataset like the below: epoch_seconds eq_time 1636663343887 2021-11-12 02:12:23 Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my…

python apache-spark pyspark apache-spark-sql

asked Nov 13 '21 at 18:43

whatsinthename

1,828
20
59

votes

2 answers

AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures

I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 records. "some_function" posts data into a website…

apache-spark pyspark apache-spark-sql python-requests amazon-emr

asked Oct 16 '21 at 02:14

user7343922

votes

2 answers

Why joining structure-identic dataframes gives different results?

Update: the root issue was a bug which was fixed in Spark 3.2.0. Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return…

apache-spark join pyspark apache-spark-sql

asked Sep 24 '21 at 13:58

ZygD

22,092
39
79
102

votes

2 answers

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce. I know repartition distributes data evenly across…

apache-spark apache-spark-sql

asked Sep 12 '21 at 06:23

nagraj036

votes

1 answer

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it. My questions are: Should I specifiy the checkpoint dir? Is it possible to do it like this: df.checkpoint() Are there any optional params…

apache-spark pyspark apache-spark-sql spark-checkpoint

asked Jun 10 '21 at 07:27

dasilva555

votes

5 answers

Change the datatype of a column in delta table

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck. %sql ALTER TABLE…

apache-spark apache-spark-sql delta-lake

asked May 31 '21 at 06:59

chaitra k

votes

5 answers

How to use variables in SQL queries?

Since in SQL Server ,we can declare variables like declare @sparksql='' but in spark sql what alternative can be used . So that we don't need to hard code any values/query/strings.

apache-spark apache-spark-sql databricks

asked Nov 26 '20 at 10:09

Shrince

votes

0 answers

Is there an Enum type in PySpark?

I just wondered if there is an EnumType in PySpark/Spark. I want to add constraints on StringTypes (or other types as well) to have certain values only in my DataFrame's schema.

apache-spark apache-spark-sql pyspark

asked Nov 26 '20 at 09:36

Ehsan Poursaeed

votes

1 answer

How to use Pyspark equivalent for reset_index() in python

I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows: data.reset_index() I get an error: "DataFrame' object has no attribute 'reset_index' error"

python python-3.x apache-spark pyspark apache-spark-sql

asked Nov 06 '20 at 04:07

pruthviraj

votes

1 answer

Pyspark to_timestamp with timezone

I am trying to convert datetime strings with timezone to timestamp using to_timestamp. Sample dataframe: df = spark.createDataFrame([("a", '2020-09-08 14:00:00.917+02:00'), ("b", '2020-09-08 14:00:00.900+01:00')], …

python-3.x pyspark apache-spark-sql

asked Sep 08 '20 at 16:20

Christian Sloper

7,440
3
15
28

votes

2 answers

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time…

python apache-spark pyspark apache-spark-sql

asked Jul 29 '20 at 18:56

Lauren Leder

votes

3 answers

Fetch week start date and week end date from Date

I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday. I referred this post but this takes monday as starting day of week. Is there any inbuilt function in…

pyspark apache-spark-sql

asked Jul 15 '20 at 10:02

ben

1,404
8
25
43

votes

1 answer

What is the differences between spark.table("TABLE A") and spark.read.("TABLE A")

Question as the title,I am learning sparkSQL,but I can't get a good understanding of the difference between them. Thanks.

apache-spark pyspark apache-spark-sql

asked Jul 14 '20 at 03:01

Sean

votes

2 answers

PySpark - pass a value from another column as the parameter of spark function

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression. +-----------------------------------------+ |expr |var1 |var2 | +-------------------------+---------+-----+ |var1 > 7 |9…

apache-spark pyspark apache-spark-sql

asked Jun 19 '20 at 21:34

UtkarshSahu

Prev 1 2 3

…

99 100 Next