Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
0
votes
1 answer

how to convert pandas dataframe to koalas with mixed data types

I m using Azure Databricks to convert a pandas Dataframe into a koalas data frame... kdf = ks.DataFrame(pdf) This results in an error message of "an integer is required (got type str)" I tried adding a dtype of str to force the koalas dataframe to…
Laura Baker
  • 497
  • 5
  • 14
0
votes
1 answer

Spark/Koalas implementation of pandas resample('D') method

I have a Spark dataframe that needs to be ffilled. The size of the dataframe is large (>100 million rows). I'm able to achieve what I want using pandas as shown below. new_df = df_pd.set_index('someDateColumn') \ .groupby(['Column1',…
svn
  • 159
  • 10
0
votes
1 answer

How to run pandas-Koalas progam suing spark-submit(windows)?

I have pandas data frame(sample program), converted koalas dataframe, now I am to execute on spark cluster(windows standalone), when i try from command prompt as spark-submit --master local hello.py, getting error ModuleNotFoundError: No module…
Kumar Prvn
  • 31
  • 1
  • 7
0
votes
1 answer

code containing .iloc not working with Koalas dataframe

koalas_img = ks.read_spark_io(path="/mnt/databricks/demo/CarClassification/cars_train/009*.jpg", format="binaryfile") koalas_img.shape Out[16]: (100, 4) type(koalas_img) Out[17]: databricks.koalas.frame.DataFrame koalas_img.columns Out[18]:…
0
votes
2 answers

Databricks Koalas: use for loop to create new columns with conditions and dynamically name the new column based on the old column names

Example dataset: kdf = ks.DataFrame({"power_1": [50, 100, 150, 120, 18], "power_2": [50, 150, 150, 120, 18], "power_3": [60, 100, 150, 120, 18], "power_4": [150, 90, 150, 120, 18], …
MiRe Y.
  • 57
  • 8
0
votes
1 answer

koalas groupby -> apply returns 'cannot insert "key", already exists'

I've been struggling with this issue and haven't been able to solve it, I got the current dataframe: import databricks.koalas as ks x = ks.DataFrame.from_records( {'ds': {0: Timestamp('2018-10-06 00:00:00'), 1: Timestamp('2017-06-08 00:00:00'), …
0
votes
0 answers

Error Importing Koalas in PyCharm using Databricks Connector (Python 3.7)

I see the following error when I attempt to import koalas from databricks. I am using pyspark v2.4.5 and I'm able to successfully connect to my Spark cluster. It seems that using python 3.5 and connecting to Databricks Runtime 5.x works. I created a…
0
votes
1 answer

What is the Koalas equivalent of the pandas explode() function?

I would like to explode a Koalas column containing lists of values into multiple columns. When I am trying to use df.explode() as documented here, I am getting the AttributeError: 'DataFrame' object has no attribute 'explode'. I know Koalas is a…
DataBach
  • 1,330
  • 2
  • 16
  • 31
0
votes
1 answer

Koalas data frame column operation

I have a koalas data frame with approx. 6 million rows in it. I need to perform an operation where I read every row in the data frame, and extract the values of each row and then do a lookup in a list(That list has 30 K elements in it ). If it is…
0
votes
1 answer

how to include a parameter into a mask or where function in a Koalas dataframe

I have a Koalas dataframe running in Azure databricks, lets say: import databricks.koalas as pd df = pd.DataFrame({'category': ['A', 'A', 'B'], 'col1': [1, 2, 3], 'col2': [4, 5, 6]}, …
RobertoST
  • 61
  • 5
0
votes
1 answer

How to change the colors when calling koalas.hist() with RGB values

I have a koalas dataframe. I would like to plot a histogram but I would like to change the color with an RGB tuple (r,g,b). How can I alter the code below to do this? import databricks.koalas as ks import pandas as pd import numpy as np import…
DeeeeRoy
  • 467
  • 2
  • 5
  • 13
0
votes
0 answers

Not able to perform operations on koalas dataframes

import databricks.koalas as ks df = ks.DataFrame({'x': [1, 2], 'y': [3, 4], 'z': [5, 6]}) df.columns = ['x', 'y', 'z1'] df['x2'] = df.x + df.x print(df) Unable to get any output when ran in JupyterNotebook. Code is continuously running with a…
-1
votes
1 answer

Pandas to Koalas does not solve spark.rpc.message.maxSize exceeded error

I have an existing databricks job which heavily uses Pandas and below code snippet gives the error "org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 101059:0 was 1449948615 bytes, which exceeds max allowed:…
Nikesh
  • 47
  • 6
-1
votes
2 answers

How to convert np.where() while converting pandas to koalas?

I was converting some pandas series and pandas dataframes to koalas for scalability. But in places where i was using np.where() I tried to pass koalas dataframe like it was previously passing pandas dataframe. But I got an error an…
-2
votes
1 answer

PySpark based approach to inline regex matching like Pandas

I have a code snippet which works great in Pandas, however my data size quite high and Pandas consumes a lot of memory. This is where I am trying to have a solution based on either PySpark or Koalas since both are Spark based and Highly scalable.…
T3J45
  • 717
  • 3
  • 12
  • 32
1 2 3 4 5 6 7
8