Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
0
votes
0 answers

concat in koalas dataframe fails with error "Job aborted due to stage failure"

I am running pyspark notebook in Azure databricks environment and auto scaling cluster(2 to 32). I have 2 dataframes df1 and df2 and concatenating using the below code using Pandas. df1 -> 9 columns and around 11 million records df2-> exactly…
Nikesh
  • 47
  • 6
0
votes
1 answer

Koalas Dataframe read_csv reads null column as not null

I am working on loading a sample csv file using koalas. What I see is a weird behavior. The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank. When I read the file…
KrazzyNefarious
  • 3,202
  • 3
  • 20
  • 32
0
votes
1 answer

how to change individual value in koalas dataframe?

I'm looking to update a single value in a koalas dataframe by passing its positional index. I have tried using iat but keep running into errors. For example, if I tried: df.iat[2,3]=5 I get an error saying 'iAtIndexer' does not support item…
0
votes
1 answer

DataFrame Styler Object HTML Does Not Render Correctly in Amazon SES Email

I am working in a DataBricks Python notebook. I can currently successfully render a DataFrame Styler object inside the notebook that looks like the following: I now want to email the above as a report inside the body of an email. I followed this…
mmTmmR
  • 573
  • 2
  • 8
  • 20
0
votes
0 answers

Unique ID Column Returns Multiple Values

I have a Koalas data frame that has what I would expect to be a unique ID column ("index") created by resetting the index on a dataframe using breakdf2 = breakdf.reset_index() I believe I can prove that breakdf2 only has one record for index = 0…
L. Taylor
  • 23
  • 5
0
votes
1 answer

Update Multiple Column Values with Koalas

In Pandas I can use the following to select rows for 4 dates and then change the each date's value in a second column power_table_2 df.loc[df['date'].isin(['2022-03-14','2022-03-22','2022-04-01','2022-04-05']),'power_table_2'] =…
mmTmmR
  • 573
  • 2
  • 8
  • 20
0
votes
1 answer

AttributeError: module 'databricks.koalas' has no attribute 'DateOffset'

I am working on replacing Pandas library to Koalas Library in my python repo in VS Code. But Koalas module does not seem to have DateOffset() module similar to what pandas has. I tried this : import databricks.koalas as ks kdf["date_col_2"] =…
user19930511
  • 299
  • 2
  • 15
0
votes
0 answers

Unable to read parquet file into koalas dataframe

I am working on replacing the pandas API with Koalas API in my project. I am trying to read a parquet file from a location but getting the below error. import databricks.koalas as ks kdf = ks.read_parquet(path, columns=column_names) kdf =…
user19930511
  • 299
  • 2
  • 15
0
votes
1 answer

Iterate of two different dataframes efficiently on a specific column and store only the common rows

I have two dataframes as shown below. import databricks.koalas as ks input_data = ks.DataFrame({'code':['123a', '345b', '678c'], 'id':[1, 2, 3]}) my_data = ks.DataFrame({'code':['123a', '12a', '678c'], 'id':[7, 8, 9], 'stype':['A',…
Anna
  • 181
  • 1
  • 12
0
votes
2 answers

Check if two dataframes have the same values in the column using .isin in koalas dataframe

I am having a small issue in comparing two dataframes and the dataframes are detailed as below. The dataframes detailed below are all in koalas. import databricks.koalas as ks mini_team_df_1 = ks.DataFrame(['0000340b'], columns =…
Anna
  • 181
  • 1
  • 12
0
votes
1 answer

TypeError: 'module' object is not callable for time on Koalas dataframe

I am facing a small issue with a line of code that I am converting from pandas into Koalas. Note: I am executing my code in the databricks. The following line is pandas code: input_data['t_avail'] = np.where(input_data['purchase_time'] != time(0,…
Laura Smith
  • 293
  • 3
  • 13
0
votes
0 answers

date type hints in koalas

Say I want to run this code with type hints: def foo(df): """A very simple function which only add 3 days to one of the dataframe's datetime columns. """ df['time'] = df['col1'] + pd.Timedelta('3D') return df # Creating a dummy…
Eran
  • 844
  • 6
  • 20
0
votes
0 answers

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will…
n1tk
  • 2,406
  • 2
  • 21
  • 35
0
votes
0 answers

Index names must be exactly matched currently

I am trying to add koalas dataframe in an entitySet. Here is the code for it subset_kdf_fp_eta_gt_prd.spark.print_schema() root |-- booking_code: string (nullable = true) |-- order_id: string (nullable = true) |-- restaurant_id: string (nullable…
Mohit Jain
  • 733
  • 3
  • 9
  • 24
0
votes
0 answers

Different results using Pandas vs Koalas notna()

dtype={"ColA": str} ---------------------------------------------- use_koalas: True df: ColA ColB ColC 0 A 0 0.00 1 None 1 12.30 2 C 2 22.20 3 D 1 …