Questions tagged [apache-spark-2.3]
39 questions
0
votes
0 answers
use corelated subquery in pyspark sql
Tab1 Columns [F,S,E]
F1 S1 R
F1 S2 R2
F1 S3 R1
F2 S1 R2
F2 S4 R4
F1 S4 R
Tab2 Columns [F,S]
F1 S1
F1 S3
F2 S1
F2 S4
TAKE ROWS FROM TAB1 FOR ONLY IF F->S RELATION IS PRESENT IN Tab2
RESULT Columns [F,S,E]
F1 S1 R
F1 S3 R
F2 S4 R4
I have the query…

saahil shah
- 1
- 3
0
votes
2 answers
Read specific file from multiple .gz file in Spark
I am trying to read a file with a specific name which exists in multiple .gz files within a folder. For example
D:/sample_datasets/gzfiles
|-my_file_1.tar.gz
|-my_file_1.tar
|-file1.csv
|-file2.csv
|-file3.csv
…

Neeleshkumar S
- 746
- 11
- 19
0
votes
1 answer
create new column in pyspark dataframe using existing columns
I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.
Lets say I have a dataframe that looks like this:
+-----+---+---+
| _1| _2| _3|
+-----+---+---+
|x1-y1| 3|…

Shashank BR
- 65
- 1
- 6
0
votes
1 answer
Repartitioning a pyspark dataframe fails and how to avoid the initial partition size
I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code:
file_path1 = spark.read.parquet(*paths[:15])
df = file_path1.select(columns) \
.where((func.col("organization") == organization))
df…

SarahData
- 769
- 1
- 12
- 38
0
votes
1 answer
Casting string like "[1, 2, 3]" to array
Pretty straightforward. I have an array-like column encoded as a string (varchar) and want to cast it to array (so I can then explode it and manipulate the elements in "long" format).
The two most natural approaches don't seem to work:
-- just…

MichaelChirico
- 33,841
- 14
- 113
- 198
0
votes
1 answer
Unable to query/select data those inserted through Spark SQL
I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| …

rajusem
- 79
- 7
0
votes
1 answer
How to build zeppelin 0.8.0 with spark 2.3.2 inbuilt
I want build zeppelin 0.8.0 with spark 2.3.2 inbuilt and run it against the same version of spark running not locally without setting SPARK_HOME so that I do not require to have a SPARK installation in the zeppelin node. I have tried the build…

AlphaWolf
- 319
- 1
- 3
- 12
0
votes
1 answer
Sharing data across executors in Apache spark
My SPARK project (written in Java) requires to access (SELECT query results) different tables across executors.
One solution to this problem is :
I create a tempView
select required columns
using forEach convert DataFrame to Map.
pass that map as…

A Learner
- 157
- 1
- 5
- 16
-1
votes
1 answer
Spark shuffle disk spill increase when upgrading versions
When upgrading from spark 2.3 to spark 2.4.3, I saw a 20-30% increase in the amount of shuffle disk spill one of my stages generated.
The same code is being executed in both environments.
All configurations are identical between both environments

Barak Freiman
- 21
- 3