Pandas iterrows too slow, how can I vectorize this code?

Question

I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers. I'm dealing with Big Data on GCP and I need to optimize my code.

                                      [...]
    def send_to_bq(self, df):
        result = []
        for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
            data_dict = {
                "processing_timestamp": str(row["processing_timestamp"]),
                "id": row["id"],
                "embeddings_vector": [str(x) for x in row["vectors"]],
            }
            result.append(data_dict)
                                      [...]

Our DataFrame have the following pattern:

           id                                               name  \
0  3498001704  roupa natal flanela animais estimacao traje ma...   

                                             vectors  \
0  [0.4021441, 0.45425776, 0.3963987, 0.23765437,...   

        processing_timestamp  
0 2021-10-26 23:48:57.315275

Using iterrows on a DataFrame is too slow. I've been studying alternatives and I know that:

I can use apply
I can vectorize it through Pandas Series (better than apply)
I can vectorize it through Numpy (better that Pandas vectorization)
I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization

But I don't know how I can transform my code for those solutions.

Can anyone help me demonstrating a solution for my code? One is enough, but if someone could show more than one solution would be really educational for this matter.

Any help I will be more than grateful!

Please update your post with the output of `print(df.head().to_string(index=False))` as a reproducible sample — Corralien, Oct 26 '21 at 14:14
Use `df.to_dict(['processing_timestamp', 'id', 'embeddings_vector'])` — Corralien, Oct 26 '21 at 14:15
Thank you for the amendment. Best practices are to make my code reproducible on Stack Overflow, you're right. I will print the head and edit my Question. — Guilherme Giuliano Nicolau, Oct 26 '21 at 14:53
I've edited my Question and incorporated the head from the DataFrame. — Guilherme Giuliano Nicolau, Oct 26 '21 at 23:59

score 1 · Answer 1 · edited Oct 27 '21 at 08:17

1

So you basically convert everything to string and then transform your DataFrame to a list of dict

For the second part, there is a pandas method to_dict. For the first part, I would use astype and apply only to convert the type

df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "embeddings_vector", "processing_timestamp"]].to_dict('records')

A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda function you could basdically wrap your entire loop body inside an apply, but that would create far to many temporary dicitionaries to be fast.

edited Oct 27 '21 at 08:17

Guilherme Giuliano Nicolau

143
1
11

answered Oct 26 '21 at 14:22

maow

2,712
1
11
25

It worked perfectly. You understood correctly although I didn't post a data sample (I need to anonymize it first). I needed to transform everything to string and this is the better solution, using astype for timestamp, apply with lambda function for a list of vectors (actually an array) and using to_dict with 'records' so it can iterate similarly to iterrows. – Guilherme Giuliano Nicolau Oct 26 '21 at 16:42

score 1 · Answer 2 · answered Oct 27 '21 at 07:14

You can use agg:

>>> df.agg({'id': str, 'vectors': lambda v: [str(i) for i in v], 
            'processing_timestamp': str}).to_dict('records')

[{'id': '3498001704',
  'vectors': ['0.4021441', '0.45425776', '0.3963987', '0.23765437'],
  'processing_timestamp': '2021-10-26 23:48:57.315275'}]

score 0 · Answer 3 · answered Oct 26 '21 at 14:32

0

You can use pandas.DataFrame methods to convert it to other types such as DataFrame.to_dict() and more.

answered Oct 26 '21 at 14:32

Amir Aref

361
1
5

Pandas iterrows too slow, how can I vectorize this code?

3 Answers3