I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN
dataset a
and dataset b
to produce c
and group by col1
on dataset c
. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
Asked
Active
Viewed 120 times
0

DarioB
- 1,349
- 2
- 21
- 44
1 Answers
0
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection
to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.
For top-level functions (such as merge) one needs to do
from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd
and use operations beam_pd.func(...)
in place of pd.func(...)
.

robertwb
- 4,891
- 18
- 21
-
thanks for the answer, i have checked the beam dataframe API and it looks great, but I couldn't find anywhere an example of how to actually join the dataframes. Can you provide or link some example? thanks – DarioB Mar 29 '22 at 15:06
-
https://pandas.pydata.org/docs/user_guide/merging.html – robertwb Mar 29 '22 at 16:58
-
that's for merging pandas dataframes, how would you merge a Beam Dataframe? would you still use pandas libraries? will it beam distribute it in a cluster? – DarioB Mar 29 '22 at 17:03
-
pandas merge only accept dataframes, if i try to merge it i get `TypeError: Can only merge Series or DataFrame objects, a
was passed` – DarioB Mar 29 '22 at 17:28