Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
1
vote
0 answers

"SparkException: Job aborted" when Koalas writes to Azure blob storage

I am using Koalas (pandas API on Apache Spark) to write a dataframe out to a mounted Azure blob storage. When calling the df.to_csv API, Spark throws an exception and aborts the job. Only a few of the stages seem to fail with the following…
bramb
  • 213
  • 2
  • 14
0
votes
0 answers

Why does ps.merge() give a different result from pd.merge()?

I have a dataframe billplan which is a Pyspark.Pandas dataframe. I have a code where i convert it to Pandas dataframe and do a pd.merge: # TS_BILLPLAN : PREV & PREV_PREV attributes billplan = billplan.to_pandas() billplan =…
Ee Ann Ng
  • 109
  • 1
  • 8
0
votes
1 answer

Solving a system of multi-variable equations using PySpark on Databricks

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is more complex than the set of equations provided…
0
votes
0 answers

Using Koalas, how do I save to an external table?

I have the code below to save a Koalas dataframe to an Orc table. How to modify it to save to an EXTERNAL table? df.reset_index().to_orc( f"/corporativo/mydatabase/mytable", mode="overwrite", partition_cols=["year", "month"] )
neves
  • 33,186
  • 27
  • 159
  • 192
0
votes
0 answers

How to save a Koala dataframe to ORC using ZLIB compression?

Koalas dataframes (the Pandas API to Spark) has a to_orc method to save in the ORC format. How to call it telling it save compressed using the ZLIB method?
neves
  • 33,186
  • 27
  • 159
  • 192
0
votes
0 answers

Can I do a groupby and shift on a Date column using Pandas on Spark API, similar to how I can in Pandas?

On Pandas, I can do the following code: contract['PREV_END'] = contract.groupby('SUBSCR_NO').END.shift(1) But using Pandas on Spark API, I get this error: AnalysisException: cannot resolve 'isnan(lag(CON_END, 1, NULL) OVER (PARTITION BY SUBSCR_NO…
Ee Ann Ng
  • 109
  • 1
  • 8
0
votes
0 answers

koalas: does it have PARTITION BY + ROW_COUNT()?

I'm trying to use Koalas to process my dataframes. Does it have rolling window functions over partitions? Something like PARTITION BY and ROW_NUMBER() in Hive or Postgres?
Felix
  • 3,351
  • 6
  • 40
  • 68
0
votes
1 answer

facing issues in installing koalas for Python version 3.8.10 (AttributeError: module 'numpy' has no attribute 'bool')

According to this document https://koalas.readthedocs.io/en/latest/getting_started/install.html System info: numpy 1.24.3 koalas 1.8.2 pyspark 3.4.0 Python 3.8.10 Facing Issue when trying to read csv file import databricks.koalas as…
Sam777
  • 15
  • 6
0
votes
2 answers

The method `pd.groupby.GroupBy.prod()` is not implemented yet

I have a database with two columns: name (str) and probability (float). I am running this command: df[['name','probability']].groupby('name').prod() on a Databricks (runtime 7.3) notebook and df is a pyspark.pandas dataframe. The error I get…
Qarolina
  • 1
  • 1
0
votes
1 answer

How to pivot string column using pandas api on spark

I am attempting to convert some code my organization uses from pandas dataframes to pandas api on spark dataframes. We have run into a problem when we try to convert our pivot functions where pandas api on spark does not allow pivot operations on…
MacMixer13
  • 73
  • 1
  • 8
0
votes
0 answers

Series object error in koalas using count vectorizer

I am new to spark and trying to run in count vectorizer using koalas data frame but getting error over this code. Koalas uses Pandas API, so I tried to run this count vectorizer code but got an error - 'Series' object has no attribute…
0
votes
0 answers

Koalas: ValueError: not enough values to unpack (expected 3, got 2)

I am doing a simple dfq.head() of a koalas dataframe but got the error below. I know this is not related to how my data looks like but rather than the versions of the libraries I am using. But can't figure out the issue. This is my spark…
heinistic
  • 731
  • 2
  • 8
  • 16
0
votes
2 answers

group by in pandas API on spark

I have a pandas dataframe below, data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year':…
code_bug
  • 355
  • 1
  • 12
0
votes
1 answer

Pass a python variable to SQL query with koalas

I am using a databricks notebook and I would like to pass several python variables to an SQL query using koalas.sql. Here a simplified example of what I am trying to do. import databricks.koalas as ks query = """ SELECT * FROM…
Qarolina
  • 1
  • 1
0
votes
1 answer

Index position in koalas

I have a kolas dataframe and I am trying to find out the index value of a specific record, but I keep getting the error "TypeError: 'Int64Index' object is not subscriptable". Below is the code which I tried. kdf = ks.DataFrame({ 'id':[1,2,3,4], …
Nikesh
  • 47
  • 6