Questions tagged [apache-spark-2.3]

39 questions
0
votes
0 answers

use corelated subquery in pyspark sql

Tab1 Columns [F,S,E] F1 S1 R F1 S2 R2 F1 S3 R1 F2 S1 R2 F2 S4 R4 F1 S4 R Tab2 Columns [F,S] F1 S1 F1 S3 F2 S1 F2 S4 TAKE ROWS FROM TAB1 FOR ONLY IF F->S RELATION IS PRESENT IN Tab2 RESULT Columns [F,S,E] F1 S1 R F1 S3 R F2 S4 R4 I have the query…
0
votes
2 answers

Read specific file from multiple .gz file in Spark

I am trying to read a file with a specific name which exists in multiple .gz files within a folder. For example D:/sample_datasets/gzfiles |-my_file_1.tar.gz |-my_file_1.tar |-file1.csv |-file2.csv |-file3.csv …
0
votes
1 answer

create new column in pyspark dataframe using existing columns

I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns. Lets say I have a dataframe that looks like this: +-----+---+---+ | _1| _2| _3| +-----+---+---+ |x1-y1| 3|…
0
votes
1 answer

Repartitioning a pyspark dataframe fails and how to avoid the initial partition size

I'm trying to tune the performance of spark, by the use of partitioning on a spark dataframe. Here is the code: file_path1 = spark.read.parquet(*paths[:15]) df = file_path1.select(columns) \ .where((func.col("organization") == organization)) df…
SarahData
  • 769
  • 1
  • 12
  • 38
0
votes
1 answer

Casting string like "[1, 2, 3]" to array

Pretty straightforward. I have an array-like column encoded as a string (varchar) and want to cast it to array (so I can then explode it and manipulate the elements in "long" format). The two most natural approaches don't seem to work: -- just…
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
0
votes
1 answer

Unable to query/select data those inserted through Spark SQL

I am trying to insert data into a Hive Managed table that has a partition. Show create table output for reference. +--------------------------------------------------------------------------------------------------+--+ | …
0
votes
1 answer

How to build zeppelin 0.8.0 with spark 2.3.2 inbuilt

I want build zeppelin 0.8.0 with spark 2.3.2 inbuilt and run it against the same version of spark running not locally without setting SPARK_HOME so that I do not require to have a SPARK installation in the zeppelin node. I have tried the build…
AlphaWolf
  • 319
  • 1
  • 3
  • 12
0
votes
1 answer

Sharing data across executors in Apache spark

My SPARK project (written in Java) requires to access (SELECT query results) different tables across executors. One solution to this problem is : I create a tempView select required columns using forEach convert DataFrame to Map. pass that map as…
-1
votes
1 answer

Spark shuffle disk spill increase when upgrading versions

When upgrading from spark 2.3 to spark 2.4.3, I saw a 20-30% increase in the amount of shuffle disk spill one of my stages generated. The same code is being executed in both environments. All configurations are identical between both environments
1 2
3