0

I am doing some EDA on the PUBG data from the Kaggle competition. I would like to convert the common game modes into the standard form Solo, Duo, Squad, Flare and Crash

Here is a list of unique values:

{'flaretpp', 'crashtpp', 'squad-fpp', 'duo-fpp', 'crashfpp', 'normal-squad',
'normal-squad-fpp', 'normal-duo-fpp', 'normal-duo', 'normal-solo', 'squad',
'duo', 'solo-fpp', 'solo', 'normal-solo-fpp', 'flarefpp'}

I basically want to remove the "normal-", "-fpp", "fpp", and "tpp" substring from the values.

I have some code that works, but is very slow (There is approx 4.5M rows). I'm wondering if there is a faster/better way to do this?

for i in range(len(data['matchType'])):
    data['matchType'][i] = data['matchType'][i].replace('normal-','')
    data['matchType'][i] = data['matchType'][i].replace('-fpp','')
    data['matchType'][i] = data['matchType'][i].replace('tpp','')
    data['matchType'][i] = data['matchType'][i].replace('fpp','')
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
theotheraussie
  • 495
  • 1
  • 4
  • 14
  • I presume you have the data in a file somewhere, so instead of loading the entire file into memory and scanning over it 4 separate times to do the replacements, I'd say iterate through the file line by line, writing to a new file and performing any changes on a line by line basis, which should reduce memory consumption. – akkatracker Nov 04 '18 at 06:47

1 Answers1

3

Load your data into a Pandas Series and do it with a single command:

mymode.str.replace(r'normal-|-fpp|fpp|tpp', '')

Using your example data, that gives you:

0     flare
1     crash
2     squad
3       duo
4     crash
5     squad
6     squad
7       duo
8       duo
9      solo
10    squad
11      duo
12     solo
13     solo
14     solo
15    flare
John Zwinck
  • 239,568
  • 38
  • 324
  • 436