Questions tagged [pyspark-schema]
68 questions
0
votes
1 answer
Azure Synapse PySpark - Load Schema from a Schema Definition File
I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process.
I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to…

LordRofticus
- 181
- 1
- 2
- 13
0
votes
2 answers
Mulitple Row in spark dataframe using schema
This is my schema
my_schema = StructType([
StructField('uid', StringType(), True),
StructField('test_id', StringType(), True),
StructField("struct_ids", ArrayType(
StructType([
StructField("st", IntegerType(),…

Blue Clouds
- 7,295
- 4
- 71
- 112
0
votes
1 answer
Pyspark - How can I reset a value and increment in an array of structs?
still new to pyspark but I have a pyspark dataframe that I am trying to manipulate. The data consists of users logging into a device and I'm creating sessions for each user. I want to reset the id for all the array of structs to 0 and increment the…

meepmepp
- 21
- 2
0
votes
0 answers
Retrieving timestamp data from kafka using pyspark
I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column.
Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…

Ali Moayed
- 33
- 5
0
votes
1 answer
Parsing JSON data using PySpark. exclode returns nulls
I have a problem using PySpark.
There is a stream data generated by Kafka and I'm supposed to parse it using Spark.
The JSON format is like…

Ali Moayed
- 33
- 5
0
votes
1 answer
Comparing extracted dataframe Schema to targeted schema
I'm working on some data governance that will take the schema of an extracted query and validate it before doing any transformations on it. It's broken up in two classes. One class that will extract the data and another that will validate the data…

BloodKid01
- 111
- 14
0
votes
0 answers
SparkException: Python worker failed to connect back
When executing from within my Jupyter Notebook some cells containing some Spark commands (eg., some DataFrame.show() methods or some spark.sql select commands involving 6 million row DataFrames), I get the following sequence of message…

Antonio Piemontese
- 107
- 5
0
votes
0 answers
Pyspark dataframe insertion into oracle table
I have an issue in a pyspark dataframe u_final below, when I show the dataframe it looks correct, but when I insert in the table, the insertion shows me a different dataframe with less data, and as ou can see I have nothing between the show command…

sghiar
- 1
- 3
0
votes
1 answer
File Not Found Error while reading from S3 - PySpark
I am trying to read a .csv file on s3 into a PySpark dataframe in Glue. However it keeps on failing with "AnalysisException: Path does not exist: s3://kp-landing-dev/input/kp/kp/export_incr_20230611183316.csv" error. I have verified the path and the…

marie20
- 723
- 11
- 30
0
votes
0 answers
Pyspark read existing parquet file with new schema return NULL content
Firstly, I read a csv file and save it to parquet format dataframe.
After that, when I read this parquet data file with my new custom schema, returned dataframe has new schema but all NULL content.
Comparing with old schema, I only change column…

Cao Phuong
- 1
- 1
0
votes
1 answer
Pyspark: Adding row/column with single value of row counts
I have a pyspark dataframe that I'd like to get the row count for. Once I get the row count, I'd like to add it to the top left corner of the data frame, as shown below.
I've tried creating the row first and doing a union on the empty row and the…

drymolasses
- 73
- 6
0
votes
1 answer
Pyspark dataframe dynamic select clause error
I am trying to parse the select clause dynamically to pyspark dataframe and I keep getting an error saying 'cannot resolve ... given input columns: [value];;
split_col = split(df[column_name], delimiter)
file = open(schema_file, 'r')
data =…

OhMoh24
- 41
- 6
0
votes
1 answer
Querying a big data table using Py-spark
I have two tables that I'm working with using Py-spark
File 1:
Schema: CustomerName:STRING, DOB:STRING, UIN:STRING, MailID:STRING, PhoneNumber:LONG, City:STRING, State:STRING, LivingStatus:STRING, PinCode:STRING, LoanAmount:LONG
Sample Data:
Sakshi,…

EdwardFunnyHands
- 103
- 2
- 11
0
votes
1 answer
Pyspark trimming all columns in a dataframe to 100 characters
I am reading csv file with 350 columns all columns are type string. After reading into the Dataframe I want to substring all the columns values read form the csv file to a max of 1 to 100 characters while writing to a delta table.
Can someone kindly…

Wasim Syed
- 11
- 1
0
votes
1 answer
Reading multiple files using pyspark with same columns but different ordering
Suppose I have two files.
file0.txt
field1
field2
1
2
1
2
file1.txt
field2
field1
2
1
2
1
Now, if I write:
spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show()
the following…