Questions tagged [pyspark-schema]

68 questions
0
votes
0 answers

Is there a way in Spark SQL to do mergeschema option for parquet file?

I have a parquet table for which I get an error: FileReadException: Error while reading file dbfs:/mnt/gold/catalog.parquet/part-00120-tid-1146522170304013652-7e167102-3a27-46d7-b674-901496f37d84-353-1-c000.snappy.parquet. Parquet column cannot be…
DejanS
  • 96
  • 9
0
votes
0 answers

Convert some specific columns that have 0 and 1 values in Kafka messages to False and True in PySpark

Requirement We are consuming messages from Kafka using PySpark. In these JSON messages, there are some keys corresponding to which we have values such as 0 and 1. Now the requirement here is to convert these 0's and 1's to False and True while…
0
votes
2 answers

Read a nested json string and explode into multiple columns in pyspark

I want to parse a JSON request and create multiple columns out of it in pyspark as follows: { "ID": "abc123", "device": "mobile", "Ads": [ { "placement": "topright", "Adlist": [ { "name": "ad1", …
Gingerbread
  • 1,938
  • 8
  • 22
  • 36
0
votes
1 answer

Pyspark: Compare Column Values across different dataframe

we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data. We are using pyspark dataframe and the following are our dataframes. Dataframe1: | Manager |…
0
votes
0 answers

How to resolve org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow?

While trying to read a file using pyspark i'm geting this error: org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 60459493. To avoid this, increase spark.kryoserializer.buffer.max value. Here is…
JG1
  • 1
  • 2
0
votes
1 answer

commas within a field in a file using pyspark

my data file contains column values that include commas teledyne.com', 'Teledyne Technologies is a leading provider of sophisticated electronic components, instruments & communications products, including defense electronics, data acquisition &…
romi
  • 5
  • 2
0
votes
1 answer

Weird behaviour in Pyspark dataframe

I have the following pyspark dataframe that contains two fields, ID and QUARTER: pandas_df = pd.DataFrame({"ID":[1, 2, 3,4, 5, 3,5,6,3,7,2,6,8,9,1,7,5,1,10],"QUARTER":[1, 1, 1, 1, 1,2,2,2,3,3,3,3,3,4,4,5,5,5,5]}) spark_df =…
Abdessamad139
  • 325
  • 4
  • 16
0
votes
0 answers

How PySpark allows columns with special characters?

The dataframe df_problematic in PySpark has the following columns: +------------+-----------+------------+ |sepal@length|sepal.width|petal_length| +------------+-----------+------------+ | 5.1| 3.5| 1.4| | 4.9| …
0
votes
0 answers

pyspark stream from kafka topic with avro format returns null dataframe

I have a topic with Avro format and I want to read it as a stream in Pyspark but the output is null. my data is like this: { "ID": 559, "DueDate": 1676362642000, "Number": 1, "__deleted": "false" } and the schema in the schema registry…
0
votes
1 answer

Spark incorrectly interpret data type from csv to Double when string ending with 'd'

There is a CSV with a column ID (format: 8-digits & "D" at the end). When reading csv with .option("inferSchema", "true"), it returns the data type as double and trimed the…
0
votes
1 answer

AttributeError: 'DataFrameWriter' object has no attribute 'schema'

I will like to write a Spark Dataframe with a fix schema. I m trying that: from pyspark.sql.types import StructType, IntegerType, DateType, DoubleType, StructField my_schema = StructType([ StructField("seg_gs_eur_am", DoubleType()), …
Enrique Benito Casado
  • 1,914
  • 1
  • 20
  • 40
0
votes
1 answer

Read excel file in a directory using pyspark

`Hi , I am trying to read excel file in a directory using pyspark but i am getting fielnotfound error `env_path='dbfs:/mnt' raw='dev/raw/work1' path=env_path+raw file_path=path+'/' objects = dbutils.fs.ls(file_path) for file_name in objects: `if…
0
votes
1 answer

Read multiple CSVs with different headers into one single dataframe

I have a few CSV files where some files might have some matching columns and some have altogether different columns. For Example file 1 has the following columns: ['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt',…
Ankit Tyagi
  • 175
  • 2
  • 17
0
votes
0 answers

I am having a sample json where the data type of the key is the value which is in string format which i want to read and save it to pyspark dataframe

Below is a piece of sample json schema. I want my pyspark dataframe to read netWorthOfTheCompany as column and float as its data type. But currently when i read the json schema and save it in dataframe & print(df.dtypes) it prints as string as it…
0
votes
0 answers

Pyspark Distinct records form the string column by considering Null values in groupby

I have a dataframe like the following: rdd = sc.parallelize([(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),(22,'fl1.variant, fl2.neither,fl3.control','xxx','yyy','NULL'), (22,'fl1.variant,…
shaa
  • 17
  • 6