How to use Pandas in apache beam?

Question

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam. Can anyone direct me to the desired link ?

And if Panda dataframes cannot be used in Apache beam, then why it has been specified in gcp document ? — Nagesh Singh Chauhan, Feb 15 '18 at 12:00

score 14 · Accepted Answer · answered Feb 17 '18 at 03:48

14

There's some confusion going on here.

pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.

It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.

That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.

answered Feb 17 '18 at 03:48

jkff

17,623
5
53
85

1

pipe pandas to a PCollections will be a great feature! – eilalan Feb 19 '18 at 02:47
@eilalan : Exactly, this is what I'm looking for, can you please show this using a sample code as I'm new to apache beam so I just wanted to get an idea how to pipe pandas to a PCollection. – Nagesh Singh Chauhan Feb 23 '18 at 05:42
@jkff : See, I know what you have written, again my question remains the same . How to use pandas in apache beam. **I need a sample code which demonstrate implementation of pandas in pcollection** – Nagesh Singh Chauhan Feb 23 '18 at 05:45
1

I'm sorry, I don't understand what you're asking. – jkff Feb 23 '18 at 06:05
@jkff : Can you provide a sample code that shows implementation of pcollection and pandas together as I didn't find any source code that properly demonstrates this functionality. I know pandas will be implemented in a DoFn, but how shall I pass arguments from ParDo and how shall I write that method which will perform any panda related operation. I hope I'm clear now. Thanks again. – Nagesh Singh Chauhan Feb 23 '18 at 11:47
That's more clear, thanks. Before I give an example, let me double-check: you're *not* looking to process the contents of a PCollection as if it were a dataframe using Pandas, and you're *not* looking to process the rows of a Pandas dataframe in parallel using Beam as if it was a PCollection, correct? (both of these are not possible) – jkff Feb 23 '18 at 17:38
Well if I'll be able to create a dataframe using pandas, I'll do concatenation, groupBy on multiple columns, aggregation of measures etc.. – Nagesh Singh Chauhan Feb 24 '18 at 04:10
Yes, but you can do that without Beam, right? What role do you want Beam to play? – jkff Feb 24 '18 at 05:38
@jkff : I can do all these operations (concatenation, groupBy on multiple columns, aggregation of measures ) using pandas. I have to implement it on beam because this is what my client desires. – Nagesh Singh Chauhan Feb 26 '18 at 06:17
3

Beam supports similar operations, but as mentioned above, does not provide a pandas-like interface for them. If this is what your client is requesting, you'll need to implement it yourself on top of the Beam API. – jkff Feb 26 '18 at 07:05
1

FWIW, this is now being implemented as part of Beam. – robertwb Nov 16 '20 at 20:09

score 7 · Answer 2 · answered Nov 16 '20 at 20:10

7

As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.

answered Nov 16 '20 at 20:10

robertwb

4,891
18
21

See also https://beam.apache.org/documentation/dsls/dataframes/overview/ – TheNeuralBit Jan 21 '22 at 23:51

score 2 · Answer 3 · edited Mar 20 '18 at 07:08

pandas is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1 version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.

In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollections with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey and ParDo to join the contents of several data objects.

How to use Pandas in apache beam?

3 Answers3

Linked