Questions tagged [pyspark-schema]
68 questions
0
votes
0 answers
Is there a way in Spark SQL to do mergeschema option for parquet file?
I have a parquet table for which I get an error:
FileReadException: Error while reading file dbfs:/mnt/gold/catalog.parquet/part-00120-tid-1146522170304013652-7e167102-3a27-46d7-b674-901496f37d84-353-1-c000.snappy.parquet.
Parquet column cannot be…

DejanS
- 96
- 9
0
votes
0 answers
Convert some specific columns that have 0 and 1 values in Kafka messages to False and True in PySpark
Requirement
We are consuming messages from Kafka using PySpark. In these JSON messages, there are some keys corresponding to which we have values such as 0 and 1.
Now the requirement here is to convert these 0's and 1's to False and True while…

tall-e.stark
- 23
- 4
0
votes
2 answers
Read a nested json string and explode into multiple columns in pyspark
I want to parse a JSON request and create multiple columns out of it in pyspark as follows:
{
"ID": "abc123",
"device": "mobile",
"Ads": [
{
"placement": "topright",
"Adlist": [
{
"name": "ad1",
…

Gingerbread
- 1,938
- 8
- 22
- 36
0
votes
1 answer
Pyspark: Compare Column Values across different dataframe
we are planning to do the following,
compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data.
We are using pyspark dataframe and the following are our dataframes.
Dataframe1:
| Manager |…

frp farhan
- 445
- 5
- 19
0
votes
0 answers
How to resolve org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow?
While trying to read a file using pyspark i'm geting this error:
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 60459493. To avoid this, increase spark.kryoserializer.buffer.max value.
Here is…

JG1
- 1
- 2
0
votes
1 answer
commas within a field in a file using pyspark
my data file contains column values that include commas
teledyne.com', 'Teledyne Technologies is a leading provider of sophisticated electronic components, instruments & communications products, including defense electronics, data acquisition &…

romi
- 5
- 2
0
votes
1 answer
Weird behaviour in Pyspark dataframe
I have the following pyspark dataframe that contains two fields, ID and QUARTER:
pandas_df = pd.DataFrame({"ID":[1, 2, 3,4, 5, 3,5,6,3,7,2,6,8,9,1,7,5,1,10],"QUARTER":[1, 1, 1, 1, 1,2,2,2,3,3,3,3,3,4,4,5,5,5,5]})
spark_df =…

Abdessamad139
- 325
- 4
- 16
0
votes
0 answers
How PySpark allows columns with special characters?
The dataframe df_problematic in PySpark has the following columns:
+------------+-----------+------------+
|sepal@length|sepal.width|petal_length|
+------------+-----------+------------+
| 5.1| 3.5| 1.4|
| 4.9| …

Uylenburgh
- 1,277
- 4
- 20
- 46
0
votes
0 answers
pyspark stream from kafka topic with avro format returns null dataframe
I have a topic with Avro format and I want to read it as a stream in Pyspark but the output is null. my data is like this:
{
"ID": 559,
"DueDate": 1676362642000,
"Number": 1,
"__deleted": "false"
}
and the schema in the schema registry…

Anna b
- 5
- 3
0
votes
1 answer
Spark incorrectly interpret data type from csv to Double when string ending with 'd'
There is a CSV with a column ID (format: 8-digits & "D" at the end).
When reading csv with .option("inferSchema", "true"), it returns the data type as double and trimed the…

Tracy Ng
- 1
0
votes
1 answer
AttributeError: 'DataFrameWriter' object has no attribute 'schema'
I will like to write a Spark Dataframe with a fix schema.
I m trying that:
from pyspark.sql.types import StructType, IntegerType, DateType, DoubleType, StructField
my_schema = StructType([
StructField("seg_gs_eur_am", DoubleType()),
…

Enrique Benito Casado
- 1,914
- 1
- 20
- 40
0
votes
1 answer
Read excel file in a directory using pyspark
`Hi ,
I am trying to read excel file in a directory using pyspark but i am getting fielnotfound error
`env_path='dbfs:/mnt'
raw='dev/raw/work1'
path=env_path+raw
file_path=path+'/'
objects = dbutils.fs.ls(file_path)
for file_name in objects:
`if…

workpyspark
- 23
- 3
0
votes
1 answer
Read multiple CSVs with different headers into one single dataframe
I have a few CSV files where some files might have some matching columns and some have altogether different columns.
For Example file 1 has the following columns:
['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt',…

Ankit Tyagi
- 175
- 2
- 17
0
votes
0 answers
I am having a sample json where the data type of the key is the value which is in string format which i want to read and save it to pyspark dataframe
Below is a piece of sample json schema.
I want my pyspark dataframe to read netWorthOfTheCompany as column and float as its data type.
But currently when i read the json schema and save it in dataframe & print(df.dtypes) it prints as string as it…

Aziz Shaikh
- 1
- 1
0
votes
0 answers
Pyspark Distinct records form the string column by considering Null values in groupby
I have a dataframe like the following:
rdd = sc.parallelize([(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),(22,'fl1.variant, fl2.neither,fl3.control','xxx','yyy','NULL'),
(22,'fl1.variant,…

shaa
- 17
- 6