Questions tagged [pyspark-pandas]

131 questions
0
votes
1 answer

pyspark applying odm mapping on column level

I have below 2 data frames and i would like to apply similar condition and return the values in pyspark data frames. df1.show() +---+-------+--------+ |id |tr_type|nominal | +---+-------+--------+ |1 |K |2.0 | |2 |ZW |7.0 | |3 |V…
0
votes
0 answers

What is the difference between Pyspark vs Pyspark.pandas?

I'm trying to migrate some code from pandas to pyspark. Pyspark.pandas comes along as an easy maintainable solution. I want to make code as efficient as possible, so, my question is: Is there any difference between pyspark and pyspark.pandas? If so,…
ABaron
  • 124
  • 7
0
votes
0 answers

PySpark pandas converting Excel to Delta Table Failed

I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not…
0
votes
1 answer

How to cast Date column from string to datetime in pyspark/python?

I have a date column with string datatype when inferred in pyspark: Mon Oct 17 15:57:48 EST 2022 How to cast string datatype as datetime?
0
votes
1 answer

Extend a range of given list from a column full of such lists in Pyspark

I need to extend a range from its given start number to end number, for example if I have [1,4] I need output as [1,2,3,4]. I have been trying to use this code block, as a logic, however, I am unable to make it dynamic. When I pass many lists in it…
0
votes
1 answer

How to add trailer row to a Pyspark data frame having row count

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName('SparkByExamples.com') \ .getOrCreate() data = [('James','Smith','M',3000), ('Anna','Rose','F',4100), …
0
votes
2 answers

Index with groupby PySpark

I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points: But there is index in Spark DataFrame? How can I group in level=0 like that? I didn't find anything good in the documentation. If you have a…
Robert
  • 63
  • 6
0
votes
0 answers

Pyspark convert an array of json column to a pandas list of dict column

I have pyspark dataframe that I want to convert into a pandas dataframe, however I have an array of json column that gets converted into a string in pandas my_df = ( spark .createDataFrame( pd.DataFrame([['Scott', 50], ['Jeff', 45], ['Thomas',…
3nomis
  • 1,175
  • 1
  • 9
  • 30
0
votes
0 answers

pyspark.pandas.read_delta() rows jumnbled

I have created a CSV file and read that. The CSV created has... 1 to 100 in column 0 101 to 200 in column 1 201 to 300 in column 2 301 to 400 in column 3 401 to 500 in column 4 This is when reading with "read_csv" reads the rows perfectly. Later…
0
votes
0 answers

Why does the toPandas() function in PySpark throw a connection error if the size of the file is too large?

I have been trying to apply the toPandas() function to a file that is 5GB in size and I keep getting a connection refused error. ConnectionRefusedError Traceback (most recent call…
0
votes
1 answer

Is there a way to group by lambda function in pyspark pandas

I originally used the below code to work with a standard pandas df. Switched to pyspark pandas df once data grew. I've been unable to make this groupby work on the pyspark pandas df. I've also tried to replicate on a spark df using spark functions,…
cmb66
  • 1
0
votes
1 answer

PicklingError: Could not serialize object (happens only for large datasets)

Context: I am using pyspark.pandas in a databricks jupyter notebook. What I have tested: I do not get any error if: I run my code on 300 rows of data. I simply replicate the dataset 2 times (600 rows by pd.concat). I get an error if: I simply…
newbie101
  • 65
  • 7
0
votes
2 answers

Trying to iterate over pyspark data frame without using spark_df.collect()

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate? df.foreach(lambda x: print(x)) and def func1(x): firstname=x.firstname …
0
votes
1 answer

Function to take a list of spark dataframe and convert to pandas then csv

import pyspark dfs=[df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df1,df12,df13,df14,df15] for x in dfs: y=x.toPandas() y.to_csv("D:/data") This is what I wrote, but I actually want the function to take this list and convert…
XFlawless
  • 15
  • 4
0
votes
1 answer

Create Rows based on Column

I want to create a row based on a column. For example - I have the following data frame. | lookup_name | alt_name | inventory | location | |-------------|----------|-----------|----------| | Honda | Car | 1 | au | | Apple …
Lance
  • 768
  • 7
  • 21
1 2 3
8 9