Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
1
vote
2 answers

How to get number of groups in a groupby object in koalas?

How to get number of groups in a groupby object in koalas ? In pandas we can use ngroups, but this method is not implemented yet in koalas. Suppose groupby object is called dfgroup. Any idea ?
Ousen92i
  • 137
  • 1
  • 8
1
vote
1 answer

Is there a better solution thant dt.weekofyear?

Is there a better solution than df['weekofyear'] = df['date'].dt.weekofyear? The problem of this solution is that, sometimes, the days after the last week of the year n but before the first week of the year n+1 are counted as week 1 and and not as…
Ousen92i
  • 137
  • 1
  • 8
1
vote
1 answer

HIVE JDBC Connection Using Pyspark returns Column names as row values

I am using Pyspark to connect to HIVE and fetch some data. The issue is that it returns all rows with the values that are column names. It is returning correct column names. Only the Row values are incorrect. Here is my…
1
vote
2 answers

How to calculate an average stock price depending on periods

I am trying to calculate the average opening price for a stock, depending on different periods (week, month, year). Here you can see a part of my df : My dataframe (987 rows for the complete df) Firstly, I am trying to calculate the average opening…
Ousen92i
  • 137
  • 1
  • 8
1
vote
1 answer

PandasNotImplementedError : Using nested np.where() in a Koalas DataFrame returns error

I am converting code written with Pandas to Koalas, but I'm coming across the error with use of numpy where: import pandas as pd import numpy as np import databricks.koalas as ks data = {'credit': [123.23, 23423.56, 0, 0], 'debit': [0, 0, 234.21,…
Whitewater
  • 297
  • 2
  • 12
1
vote
1 answer

How change the value in a koalas dataframe based in a condition

I am using Koalas and I want to change the value of a column based on a condition. In pandas I can do that using: import pandas as pd df_test = pd.DataFrame({ 'a': [1,2,3] ,'b': ['one','two','three']}) df_test2 = pd.DataFrame({ 'c':…
J.C Guzman
  • 1,192
  • 3
  • 16
  • 40
1
vote
1 answer

Sum null values using Koalas

What is a good method to sum dataframes for all Null / NaN values when using Koalas? or stated another way How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if…
1
vote
2 answers

Ffill and interpolate koalas dataframe

Is it possible to interpolate and ffill different columns in a Koalas dataframe something like this? %%spark -s sparkenv2 kdf = ks.DataFrame({ 'id':[1,2,3,4], 'A': [None, 3, None, None], 'B': [2, 4, None, 3], 'C': [99, None, None,…
Zeus
  • 1,496
  • 2
  • 24
  • 53
1
vote
1 answer

Koalas applymap moving all data to a single partition

I need to do element-wise operation on a Koalas DataFrame. I use for that the Koalas applymap method. On the execution Koalas moves all data to one partition and then applies the operation. The outcome is that the performance of the job is very…
Grzegorz
  • 1,268
  • 11
  • 11
1
vote
1 answer

Databricks Koalas fails importing parquet file

I ran into an error when importing parquet file from Azure data lake to databricks. I tried other ways like importing parquet as Spark DataFrame successfully, but when I converted the Spark DF to Koalas DF, it gave the same error. I also tried to…
MiRe Y.
  • 57
  • 8
1
vote
0 answers

Configure pyspark standalone to run executors by users

I had an issue writing parquet file using pyspark (Koalas) with standalone cluster. The error I encountered was java.io.IOException: Could not rename file. I figured out from here that it was because the driver ran by user, and executor processes…
Matthew Son
  • 1,109
  • 8
  • 27
1
vote
1 answer

Impossible to import koalas in scala notebook

It seems basic but from what I see on databricks website, nothing works on my side I have installed koalas package on my cluster But when I try to import the package in my Scala notebook, I have issue. command-3313152839336470:1: error: not found:…
Matthieu K
  • 25
  • 1
  • 7
1
vote
0 answers

Unable to load a JSON file in koalas, getting connection refused error

Problem Description I tried to load a JSON file using koalas but it's throwing connection refused error. Can someone please help me out to figure out the issue, if I am missing anything here? Package Versions Pyspark : '2.4.3' koalas:…
Naga Budigam
  • 689
  • 1
  • 10
  • 26
1
vote
1 answer

PySpark Cannot calculate column wise standard deviation in Koalas DataFrame

I have a Koalas DataFrame in PySpark. I want to calculate the column-wise standard deviation. I have tried doing: df2['x_std'] = df2[['x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10','x_11', 'x_12']].std(axis = 1) I get the…
K. K.
  • 552
  • 1
  • 11
  • 20
1
vote
1 answer

Do I need to install Koalas on every node of my Spark cluster or just on the master node?

I discovered Koalas from Spark+AI Summit which brings pandas to Spark. As far as I know if I need to map a third party function to a Spark DataFrame, I have to install the package on every node of my Spark cluster. Is this the same for Koalas? Or I…
Yuan JI
  • 2,927
  • 2
  • 20
  • 29