Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
1
vote
0 answers

Unclear why I'm getting a TypeError: str object is not callable

I have a Koalas / Pandas-on-Spark dataframe named df. When I try the function below I get a TypeError: str object is not callable df[~(df.time.eq('00:00:00').groupby(df.vehicle_id).transform('sum')>=2)] When I check the datatypes of both columns I…
sampeterson
  • 459
  • 4
  • 16
1
vote
0 answers

Rolling window working in pandas but not in koalas

I have a rolling window computation that works in pandas but not in koalas, and I am wondering why: import pandas as pd import databricks.koalas as ks Timestamp = pd.Timestamp df = pd.DataFrame([[Timestamp('2022-05-18 18:10:50.021831300'),…
Lei
  • 733
  • 1
  • 5
  • 13
1
vote
1 answer

Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented

I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point: df_info is of type pyspark.pandas nlp is defined as: nlp = spacy.load('es_core_news_sm',…
Alejandro
  • 519
  • 1
  • 6
  • 32
1
vote
0 answers

How to find memory usage for a koalas dataframe

I am trying to do some memory profiling on an azure databricks job. This job uses a python script that relies heavily on koalas dataframes for analysis. I want to analyze which dataframes or objects are taking up the most memory but koalas and…
MacMixer13
  • 73
  • 1
  • 8
1
vote
2 answers

Join two dataframes on the values present in a specific column in the name_data dataframe using koalas

I am trying to join two the dataframes as shown below on the code column values present in the name_data dataframe. I have two dataframes shown below and I expect to have a resulting dataframe which would only have the rows from the…
Anna
  • 181
  • 1
  • 12
1
vote
1 answer

Azure Databricks - reading tables with koalas

I am quite new to Databricks, and I am trying to do some basic data exploration with koalas. When I log into Databricks, under DATA I see 2 main tabs, DATABASE TABLES and DBFS. I managed to read csv files as koalas dataframes…
1
vote
1 answer

PandasNotImplementedError for converted pandas dataframe to Koalas dataframe

I am having a small issue which I am facing in my code logic. I am converting a line of code which uses pandas dataframe to use Koalas dataframe and I get the following error during the code execution. # Error Message PandasNotImplementedError: The…
Anna
  • 181
  • 1
  • 12
1
vote
1 answer

pySpark dataframe transformations performance

I recently started working with pySpark. (Before it I worked with Pandas) I want to understand how does Spark execute and optimize transformations on dataframe. Can I make transformations one by one using one variable with dataframe? #creating…
Ando23
  • 11
  • 1
1
vote
1 answer

Understanding the jars in pyspark

I'm new to spark and my understanding is this: jars are like a bundle of java code files Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
1
vote
2 answers

min() function doesn't work on koalas.DataFrame columns of date types

I created the following dataframe: import pandas as pd import databricks.koalas as ks df = ks.DataFrame( {'Date1': pd.date_range('20211101', '20211110', freq='1D'), 'Date2': pd.date_range('20201101', '20201110',…
Eran
  • 844
  • 6
  • 20
1
vote
1 answer

How to use UDFs with pandas on pyspark groupby?

I am struggling to use pandas UDFs on pandas on pyspark. Can you please help me understand how this is to be achieved? Below is my attempt: import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf from pyspark…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
1
vote
0 answers

Is plotting with Koalas using TopN has any statistic meaning?

I was going through the source code of Koalas, trying to get a handle on how they actually achieve plotting large datasets. It turns our that they use either sampling or TopN - selecting a given number of records. I understand the meaning of…
1
vote
1 answer

Adding a new column to an existing Koalas Dataframe results in NaN's

I am trying to add a new column to my existing Koalas dataframe. But the values turn into NaN's as soon as the new column is added. I am not sure what's going on here, could anyone give me some pointers? Here's the code: import databricks.koalas as…
ShellZero
  • 4,415
  • 12
  • 38
  • 56
1
vote
1 answer

Set NOT NULL columns in koalas to_table

when I create a Delta table I can set some columns to be NOT NULL CREATE TABLE [db_name.]table_name [(col_name1 col_type1 [NOT NULL], ...)] USING DELTA Is there any way to set non null columns with koalas.to_table?
kismsu
  • 1,049
  • 7
  • 22
1
vote
1 answer

How to create a new column with 2 or more condition validation in Koalas

I have made the column "Turno" on the df3 using 3 validation to classify into "Turno_PM", "Turno_AM" or "N/A", but I want to know if exist an "easies way" to reach the same result, like a "cycle for" with if/elif/else or something like that. Here…