Questions tagged [pyspark-schema]

68 questions
0
votes
1 answer

Azure Synapse PySpark - Load Schema from a Schema Definition File

I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process. I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to…
LordRofticus
  • 181
  • 1
  • 2
  • 13
0
votes
2 answers

Mulitple Row in spark dataframe using schema

This is my schema my_schema = StructType([ StructField('uid', StringType(), True), StructField('test_id', StringType(), True), StructField("struct_ids", ArrayType( StructType([ StructField("st", IntegerType(),…
Blue Clouds
  • 7,295
  • 4
  • 71
  • 112
0
votes
1 answer

Pyspark - How can I reset a value and increment in an array of structs?

still new to pyspark but I have a pyspark dataframe that I am trying to manipulate. The data consists of users logging into a device and I'm creating sessions for each user. I want to reset the id for all the array of structs to 0 and increment the…
meepmepp
  • 21
  • 2
0
votes
0 answers

Retrieving timestamp data from kafka using pyspark

I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column. Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…
Ali Moayed
  • 33
  • 5
0
votes
1 answer

Parsing JSON data using PySpark. exclode returns nulls

I have a problem using PySpark. There is a stream data generated by Kafka and I'm supposed to parse it using Spark. The JSON format is like…
Ali Moayed
  • 33
  • 5
0
votes
1 answer

Comparing extracted dataframe Schema to targeted schema

I'm working on some data governance that will take the schema of an extracted query and validate it before doing any transformations on it. It's broken up in two classes. One class that will extract the data and another that will validate the data…
BloodKid01
  • 111
  • 14
0
votes
0 answers

SparkException: Python worker failed to connect back

When executing from within my Jupyter Notebook some cells containing some Spark commands (eg., some DataFrame.show() methods or some spark.sql select commands involving 6 million row DataFrames), I get the following sequence of message…
0
votes
0 answers

Pyspark dataframe insertion into oracle table

I have an issue in a pyspark dataframe u_final below, when I show the dataframe it looks correct, but when I insert in the table, the insertion shows me a different dataframe with less data, and as ou can see I have nothing between the show command…
sghiar
  • 1
  • 3
0
votes
1 answer

File Not Found Error while reading from S3 - PySpark

I am trying to read a .csv file on s3 into a PySpark dataframe in Glue. However it keeps on failing with "AnalysisException: Path does not exist: s3://kp-landing-dev/input/kp/kp/export_incr_20230611183316.csv" error. I have verified the path and the…
marie20
  • 723
  • 11
  • 30
0
votes
0 answers

Pyspark read existing parquet file with new schema return NULL content

Firstly, I read a csv file and save it to parquet format dataframe. After that, when I read this parquet data file with my new custom schema, returned dataframe has new schema but all NULL content. Comparing with old schema, I only change column…
0
votes
1 answer

Pyspark: Adding row/column with single value of row counts

I have a pyspark dataframe that I'd like to get the row count for. Once I get the row count, I'd like to add it to the top left corner of the data frame, as shown below. I've tried creating the row first and doing a union on the empty row and the…
0
votes
1 answer

Pyspark dataframe dynamic select clause error

I am trying to parse the select clause dynamically to pyspark dataframe and I keep getting an error saying 'cannot resolve ... given input columns: [value];; split_col = split(df[column_name], delimiter) file = open(schema_file, 'r') data =…
0
votes
1 answer

Querying a big data table using Py-spark

I have two tables that I'm working with using Py-spark File 1: Schema: CustomerName:STRING, DOB:STRING, UIN:STRING, MailID:STRING, PhoneNumber:LONG, City:STRING, State:STRING, LivingStatus:STRING, PinCode:STRING, LoanAmount:LONG Sample Data: Sakshi,…
EdwardFunnyHands
  • 103
  • 2
  • 11
0
votes
1 answer

Pyspark trimming all columns in a dataframe to 100 characters

I am reading csv file with 350 columns all columns are type string. After reading into the Dataframe I want to substring all the columns values read form the csv file to a max of 1 to 100 characters while writing to a delta table. Can someone kindly…
Wasim Syed
  • 11
  • 1
0
votes
1 answer

Reading multiple files using pyspark with same columns but different ordering

Suppose I have two files. file0.txt field1 field2 1 2 1 2 file1.txt field2 field1 2 1 2 1 Now, if I write: spark.read.csv(["./file0.txt""./file1.txt"], sep=',', header=True, inferSchema=True).show() the following…