-1

I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers. I'm dealing with Big Data on GCP and I need to optimize my code.

                                      [...]
    def send_to_bq(self, df):
        result = []
        for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
            data_dict = {
                "processing_timestamp": str(row["processing_timestamp"]),
                "id": row["id"],
                "embeddings_vector": [str(x) for x in row["vectors"]],
            }
            result.append(data_dict)
                                      [...]

Our DataFrame have the following pattern:

           id                                               name  \
0  3498001704  roupa natal flanela animais estimacao traje ma...   

                                             vectors  \
0  [0.4021441, 0.45425776, 0.3963987, 0.23765437,...   

        processing_timestamp  
0 2021-10-26 23:48:57.315275

Using iterrows on a DataFrame is too slow. I've been studying alternatives and I know that:

  1. I can use apply
  2. I can vectorize it through Pandas Series (better than apply)
  3. I can vectorize it through Numpy (better that Pandas vectorization)
  4. I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization

But I don't know how I can transform my code for those solutions.

Can anyone help me demonstrating a solution for my code? One is enough, but if someone could show more than one solution would be really educational for this matter.

Any help I will be more than grateful!

3 Answers3

1

So you basically convert everything to string and then transform your DataFrame to a list of dict

For the second part, there is a pandas method to_dict. For the first part, I would use astype and apply only to convert the type

df["processing_timestamp"] = df["processing_timestamp"].astype(str)
df["embeddings_vector"] = df["vectors"].apply(lambda row: [str(x) for x in row])
result = df[["id", "embeddings_vector", "processing_timestamp"]].to_dict('records')

A bit hard to test without sample data but hopefully this helps ;) Also, like I did with the lambda function you could basdically wrap your entire loop body inside an apply, but that would create far to many temporary dicitionaries to be fast.

maow
  • 2,712
  • 1
  • 11
  • 25
  • It worked perfectly. You understood correctly although I didn't post a data sample (I need to anonymize it first). I needed to transform everything to string and this is the better solution, using astype for timestamp, apply with lambda function for a list of vectors (actually an array) and using to_dict with 'records' so it can iterate similarly to iterrows. – Guilherme Giuliano Nicolau Oct 26 '21 at 16:42
1

You can use agg:

>>> df.agg({'id': str, 'vectors': lambda v: [str(i) for i in v], 
            'processing_timestamp': str}).to_dict('records')

[{'id': '3498001704',
  'vectors': ['0.4021441', '0.45425776', '0.3963987', '0.23765437'],
  'processing_timestamp': '2021-10-26 23:48:57.315275'}]
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You can use pandas.DataFrame methods to convert it to other types such as DataFrame.to_dict() and more.

Amir Aref
  • 361
  • 1
  • 5