Highest Voted 'pyspark-schema' Questions

0

votes

1 answer

Azure Synapse PySpark - Load Schema from a Schema Definition File

I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process. I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to…

pyspark azure-synapse-analytics pyspark-schema

asked Aug 27 '23 at 17:47

LordRofticus

181
1
2
13

0

votes

2 answers

Mulitple Row in spark dataframe using schema

This is my schema my_schema = StructType([ StructField('uid', StringType(), True), StructField('test_id', StringType(), True), StructField("struct_ids", ArrayType( StructType([ StructField("st", IntegerType(),…

pyspark pyspark-schema

asked Aug 23 '23 at 17:43

Blue Clouds

7,295
4
71
112

0

votes

1 answer

Pyspark - How can I reset a value and increment in an array of structs?

still new to pyspark but I have a pyspark dataframe that I am trying to manipulate. The data consists of users logging into a device and I'm creating sessions for each user. I want to reset the id for all the array of structs to 0 and increment the…

arrays pyspark struct pyspark-schema

asked Jul 24 '23 at 18:08

meepmepp

21
2

0

votes

0 answers

Retrieving timestamp data from kafka using pyspark

I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column. Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…

python pyspark pyspark-pandas pyspark-schema

asked Jul 12 '23 at 03:44

Ali Moayed

33
5

0

votes

1 answer

Parsing JSON data using PySpark. exclode returns nulls

I have a problem using PySpark. There is a stream data generated by Kafka and I'm supposed to parse it using Spark. The JSON format is like…

pyspark spark-streaming pyspark-schema

asked Jul 11 '23 at 10:42

Ali Moayed

33
5

0

votes

1 answer

Comparing extracted dataframe Schema to targeted schema

I'm working on some data governance that will take the schema of an extracted query and validate it before doing any transformations on it. It's broken up in two classes. One class that will extract the data and another that will validate the data…

python pyspark pyspark-schema

asked Jul 07 '23 at 19:17

BloodKid01

111
14

0

votes

0 answers

SparkException: Python worker failed to connect back

When executing from within my Jupyter Notebook some cells containing some Spark commands (eg., some DataFrame.show() methods or some spark.sql select commands involving 6 million row DataFrames), I get the following sequence of message…

apache-spark pyspark pyspark-schema

asked Jun 29 '23 at 18:26

Antonio Piemontese

107
5

0

votes

0 answers

Pyspark dataframe insertion into oracle table

I have an issue in a pyspark dataframe u_final below, when I show the dataframe it looks correct, but when I insert in the table, the insertion shows me a different dataframe with less data, and as ou can see I have nothing between the show command…

pyspark apache-spark-sql pyspark-schema

asked Jun 27 '23 at 13:30

sghiar

1
3

0

votes

1 answer

File Not Found Error while reading from S3 - PySpark

I am trying to read a .csv file on s3 into a PySpark dataframe in Glue. However it keeps on failing with "AnalysisException: Path does not exist: s3://kp-landing-dev/input/kp/kp/export_incr_20230611183316.csv" error. I have verified the path and the…

pyspark apache-spark-sql pyspark-schema

asked Jun 11 '23 at 19:09

marie20

723
11
30

0

votes

0 answers

Pyspark read existing parquet file with new schema return NULL content

Firstly, I read a csv file and save it to parquet format dataframe. After that, when I read this parquet data file with my new custom schema, returned dataframe has new schema but all NULL content. Comparing with old schema, I only change column…

parquet pyspark-schema

asked May 28 '23 at 15:37

Cao Phuong

1
1

0

votes

1 answer

Pyspark: Adding row/column with single value of row counts

I have a pyspark dataframe that I'd like to get the row count for. Once I get the row count, I'd like to add it to the top left corner of the data frame, as shown below. I've tried creating the row first and doing a union on the empty row and the…

python dataframe pyspark rowcount pyspark-schema

asked May 11 '23 at 03:25

drymolasses

73
6

0

votes

1 answer

Pyspark dataframe dynamic select clause error

I am trying to parse the select clause dynamically to pyspark dataframe and I keep getting an error saying 'cannot resolve ... given input columns: [value];; split_col = split(df[column_name], delimiter) file = open(schema_file, 'r') data =…

dataframe select pyspark apache-spark-sql pyspark-schema

asked May 10 '23 at 19:08

OhMoh24

41
6

0

votes

1 answer

Querying a big data table using Py-spark

I have two tables that I'm working with using Py-spark File 1: Schema: CustomerName:STRING, DOB:STRING, UIN:STRING, MailID:STRING, PhoneNumber:LONG, City:STRING, State:STRING, LivingStatus:STRING, PinCode:STRING, LoanAmount:LONG Sample Data: Sakshi,…

pyspark pyspark-schema

asked May 01 '23 at 15:13

EdwardFunnyHands

103
2
11

0

votes

1 answer

Pyspark trimming all columns in a dataframe to 100 characters

I am reading csv file with 350 columns all columns are type string. After reading into the Dataframe I want to substring all the columns values read form the csv file to a max of 1 to 100 characters while writing to a delta table. Can someone kindly…

python pyspark pyspark-schema

asked Apr 27 '23 at 18:54

Wasim Syed

11
1

0

votes

1 answer

Reading multiple files using pyspark with same columns but different ordering

Suppose I have two files. file0.txt field1 field2 1 2 1 2 file1.txt field2 field1 2 1 2 1 Now, if I write: spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show() the following…

python dataframe scala pyspark pyspark-schema

asked Apr 06 '23 at 06:31

Saad Mohammad Abrar

3
3

Questions tagged [pyspark-schema]