I'm a Jr. Data Scientist and I'm trying to solve a problem that may be simple for experienced programmers. I'm dealing with Big Data on GCP and I need to optimize my code.
[...]
def send_to_bq(self, df):
result = []
for i, row in df[["id", "vectors", "processing_timestamp"]].iterrows():
data_dict = {
"processing_timestamp": str(row["processing_timestamp"]),
"id": row["id"],
"embeddings_vector": [str(x) for x in row["vectors"]],
}
result.append(data_dict)
[...]
Our DataFrame have the following pattern:
id name \
0 3498001704 roupa natal flanela animais estimacao traje ma...
vectors \
0 [0.4021441, 0.45425776, 0.3963987, 0.23765437,...
processing_timestamp
0 2021-10-26 23:48:57.315275
Using iterrows on a DataFrame is too slow. I've been studying alternatives and I know that:
- I can use apply
- I can vectorize it through Pandas Series (better than apply)
- I can vectorize it through Numpy (better that Pandas vectorization)
- I can use Swifter - which uses apply method and then decides the better solution for you between Dask, Ray and vectorization
But I don't know how I can transform my code for those solutions.
Can anyone help me demonstrating a solution for my code? One is enough, but if someone could show more than one solution would be really educational for this matter.
Any help I will be more than grateful!