I have some text data:
data1
id | comment | title |
---|---|---|
user_A | good | a file name |
user_B | a better way is… | is there some good sugg? |
user_C | a another way is… | is there some good sugg? |
user_C | I have been using Pandas for a long time, so I… | a book |
You can use
pd.read_clipboard()
to replicate it.
data2
userid | title |
---|---|
user_X | is there some good sugg? |
user_Y | a great idea… |
user_Z | a file name |
user_W | a book |
desired output
uid | comment | title | uid |
---|---|---|---|
user_A | good | a file name | user_Z |
user_B | a better way is… | is there some good sugg? | user_X |
user_C | a another way is… | is there some good sugg? | user_X |
user_C | I have been using Pandas for a long time, so I… | a book | user_W |
An easy way is to merge on title
In pandas
:
dataall = pd.merge(
data1,data2,
on = 'title',
how ='left'
)
But it‘s memory expensive.
The size of data1 is (2942087, 7)(or some time maybe more than 3 times of the row numbers) and the size of data2 is (47516640, 4)
My memory size is 32GB, but it‘s not enough
I also try to use polars
In polars
:
dataall = data1.join(
data2,
on = 'title',
how ='left'
)
A error occurs
Canceled future for execute_request message before replies were done
I have tried the function is_in
in polars
and encoding the text to number, they are fast but I don't know how to realize.
Is there an efficiency and feasible way by pandas/polars/numpy?
After the suggestion by @ritchie46
-----edit 2022-5-24 16:00:10
import polars as pl
pl.Config.set_global_string_cache()
data1 = pl.read_parquet('data1.parquet.gzip').lazy()
data2 = pl.read_parquet('data2.parquet.gzip').lazy()
data1 = data1.with_column(pl.col('source_post_title').cast(pl.Categorical))
data2 = data2.with_column(pl.col('source_post_title').cast(pl.Categorical))
dataall = data1.join(
data2,
on = 'source_post_title',
how ='left'
).collect()
It seems that the code works for a period of time and then
Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.
Is this because my processor itself is too weak?
My CPU is i7-10850H