I am working on a data analytics project to analyze some Spotify data. I am at a data cleaning phase, and dealing with a string value column. Basically, I have series of song names that are structured like this.
df_spot['name'].head(10)
0 Piano Concerto No. 3 in D Minor, Op. 30: III. ...
1 Clancy Lowered the Boom
2 Gati Bali
3 Danny Boy
4 When Irish Eyes Are Smiling
5 Gati Mardika
6 The Wearing of the Green
7 Morceaux de fantaisie, Op. 3: No. 2, Prélude i...
8 La Mañanita - Remasterizado
9 Il Etait Syndiqué
Name: name, dtype: object
What I want to do with this column is to separate each of the words into single rows, take the top 10 most appeared word, and replace the column with binary value whether the song has a word that are contained in that top 10 word. (Of course, I will not consider words like "a", "are", "is", or numbers ...etc) To start with, I made a new data frame that only contains name of the track, and remove some of the unnecessary words, and also insert colon as a separator. Here is what I did.
df_words = df_words.str.replace(' ', ',')
df_words = df_words.str.replace(' ', ',')
df_words = df_words.str.replace('.', ',')
df_words = df_words.str.replace(':', ',')
df_words = df_words.str.replace('-', ',')
df_words = df_words.str.replace("'", ',')
df_words = df_words.str.replace('"', ',')
df_words = df_words.str.replace('?', ',')
df_words = df_words.str.replace('!', ',')
df_words = df_words.str.replace('(', ',')
df_words = df_words.str.replace(')', ',')
df_words = df_words.str.replace('[', ',')
df_words = df_words.str.replace(']', ',')
df_words = df_words.str.replace('&', ',')
df_words = df_words.str.replace('/', ',')
df_words = df_words.str.replace('1', ',')
df_words = df_words.str.replace('2', ',')
df_words = df_words.str.replace('3', ',')
df_words = df_words.str.replace('4', ',')
df_words = df_words.str.replace('5', ',')
df_words = df_words.str.replace('6', ',')
df_words = df_words.str.replace('7', ',')
df_words = df_words.str.replace('8', ',')
df_words = df_words.str.replace('9', ',')
df_words = df_words.str.replace('0', ',')
This approach ended up inserting multiple commas in a row between each row, since a song name can have multiple characters to be replaced in a row. So then here is my first question. Are there any better approach to do what I am trying to achieve? Also, are there ways to do so in less repetitive and more reproducible manner than above code? My second question is, after all the words are perfectly separated with commas, what are some of the way that I can expand each words into separate list/vector elements so that I can count the number of times each words appear in the data?