0

I'm quite new to vaex ;)

Problem: I'm importing a huge amount of logfiles into vaex, each as a string and with lowered leters. After that I'm calculating the size of each string into column size For every string I'm calculating and storing the most frequent digram into topdigram

Now I would like to replace the die most frequent digram in the string with another letter, but each row on its own.

Is there any way to implement it with str.replace? Or is it necessary to make a complete new implementation using multiprocess for at least parallelisation?

 #  InputFile      Contents                                  lower                                       size    topdigram  compressed    compressedsize
  0  HDFS_1_46.log  123456789                                 123456789                                     10           12  error         error
  1  HDFS_2_42.log  File 2: 222222222222222222222222          file 2: 222222222222222222222222              33           22  error         error
  2  HDFS_3_10.log  File 3: 33333333333333333333333333333333  file 3: 33333333333333333333333333333333      41           33  error         error
  3  HDFS_4_25.log  File 4: 444444444444444444444444444444    file 4: 444444444444444444444444444444        39           44  error         error
  4  HDFS_5_6.log   File 5: 5555555555555555555555555555      file 5: 5555555555555555555555555555          37           55  error         error

1 Answers1

0

I don't know how performant this would be, but this appears to work:

import vaex

# Example data that you provided
df = vaex.from_dict({'lower': ['123456789', '222222222222222222222222', '33333333333333333333333333333333'],
                     'topdigram': ['12', '22', '33']
                    })

# Create a custom function that does the replacing
# This is needed since the value you want to replace changes from row to row
def my_replace(x, old, new='x'):
    return x.replace(old, new)

# Use apply (ideally you want to avoid apply, but I don't know how in this case) At least you get multiprocessing by default with the vaex apply!
df['new_string'] = df.apply(my_replace, arguments=(df.lower, df.topdigram))
print(df)
Joco
  • 803
  • 4
  • 7