Questions tagged [pyspark-pandas]
131 questions
0
votes
1 answer
pyspark applying odm mapping on column level
I have below 2 data frames and i would like to apply similar condition and return the values in pyspark data frames.
df1.show()
+---+-------+--------+
|id |tr_type|nominal |
+---+-------+--------+
|1 |K |2.0 |
|2 |ZW |7.0 |
|3 |V…

santhosh
- 39
- 1
- 5
0
votes
0 answers
What is the difference between Pyspark vs Pyspark.pandas?
I'm trying to migrate some code from pandas to pyspark. Pyspark.pandas comes along as an easy maintainable solution. I want to make code as efficient as possible, so, my question is:
Is there any difference between pyspark and pyspark.pandas? If so,…

ABaron
- 124
- 7
0
votes
0 answers
PySpark pandas converting Excel to Delta Table Failed
I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not…

Lorenzo Cazador
- 71
- 1
- 7
0
votes
1 answer
How to cast Date column from string to datetime in pyspark/python?
I have a date column with string datatype when inferred in pyspark:
Mon Oct 17 15:57:48 EST 2022
How to cast string datatype as datetime?

Anos
- 57
- 8
0
votes
1 answer
Extend a range of given list from a column full of such lists in Pyspark
I need to extend a range from its given start number to end number, for example if I have [1,4] I need output as [1,2,3,4].
I have been trying to use this code block, as a logic, however, I am unable to make it dynamic. When I pass many lists in it…
0
votes
1 answer
How to add trailer row to a Pyspark data frame having row count
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
…

RickyS
- 13
- 5
0
votes
2 answers
Index with groupby PySpark
I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points:
But there is index in Spark DataFrame?
How can I group in level=0 like that?
I didn't find anything good in the documentation. If you have a…

Robert
- 63
- 6
0
votes
0 answers
Pyspark convert an array of json column to a pandas list of dict column
I have pyspark dataframe that I want to convert into a pandas dataframe, however I have an array of json column that gets converted into a string in pandas
my_df = (
spark
.createDataFrame(
pd.DataFrame([['Scott', 50], ['Jeff', 45], ['Thomas',…

3nomis
- 1,175
- 1
- 9
- 30
0
votes
0 answers
pyspark.pandas.read_delta() rows jumnbled
I have created a CSV file and read that.
The CSV created has...
1 to 100 in column 0
101 to 200 in column 1
201 to 300 in column 2
301 to 400 in column 3
401 to 500 in column 4
This is when reading with "read_csv" reads the rows perfectly.
Later…
0
votes
0 answers
Why does the toPandas() function in PySpark throw a connection error if the size of the file is too large?
I have been trying to apply the toPandas() function to a file that is 5GB in size and I keep getting a connection refused error.
ConnectionRefusedError Traceback (most recent call…

user460567
- 133
- 9
0
votes
1 answer
Is there a way to group by lambda function in pyspark pandas
I originally used the below code to work with a standard pandas df. Switched to pyspark pandas df once data grew. I've been unable to make this groupby work on the pyspark pandas df. I've also tried to replicate on a spark df using spark functions,…

cmb66
- 1
0
votes
1 answer
PicklingError: Could not serialize object (happens only for large datasets)
Context: I am using pyspark.pandas in a databricks jupyter notebook.
What I have tested:
I do not get any error if:
I run my code on 300 rows of data.
I simply replicate the dataset 2 times (600 rows by pd.concat).
I get an error if:
I simply…

newbie101
- 65
- 7
0
votes
2 answers
Trying to iterate over pyspark data frame without using spark_df.collect()
HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate?
df.foreach(lambda x: print(x)) and
def func1(x):
firstname=x.firstname
…

Jeevan Kande
- 33
- 4
0
votes
1 answer
Function to take a list of spark dataframe and convert to pandas then csv
import pyspark
dfs=[df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df1,df12,df13,df14,df15]
for x in dfs:
y=x.toPandas()
y.to_csv("D:/data")
This is what I wrote, but I actually want the function to take this list and convert…

XFlawless
- 15
- 4
0
votes
1 answer
Create Rows based on Column
I want to create a row based on a column.
For example - I have the following data frame.
| lookup_name | alt_name | inventory | location |
|-------------|----------|-----------|----------|
| Honda | Car | 1 | au |
| Apple …

Lance
- 768
- 7
- 21