Python: Dealing with String

Question

I am working on a data analytics project to analyze some Spotify data. I am at a data cleaning phase, and dealing with a string value column. Basically, I have series of song names that are structured like this.

df_spot['name'].head(10)

0    Piano Concerto No. 3 in D Minor, Op. 30: III. ...
1                              Clancy Lowered the Boom
2                                            Gati Bali
3                                            Danny Boy
4                          When Irish Eyes Are Smiling
5                                         Gati Mardika
6                             The Wearing of the Green
7    Morceaux de fantaisie, Op. 3: No. 2, Prélude i...
8                          La Mañanita - Remasterizado
9                                    Il Etait Syndiqué
Name: name, dtype: object

What I want to do with this column is to separate each of the words into single rows, take the top 10 most appeared word, and replace the column with binary value whether the song has a word that are contained in that top 10 word. (Of course, I will not consider words like "a", "are", "is", or numbers ...etc) To start with, I made a new data frame that only contains name of the track, and remove some of the unnecessary words, and also insert colon as a separator. Here is what I did.

df_words = df_words.str.replace(' ', ',')
df_words = df_words.str.replace('  ', ',')
df_words = df_words.str.replace('.', ',')
df_words = df_words.str.replace(':', ',')
df_words = df_words.str.replace('-', ',')
df_words = df_words.str.replace("'", ',')
df_words = df_words.str.replace('"', ',')
df_words = df_words.str.replace('?', ',')
df_words = df_words.str.replace('!', ',')
df_words = df_words.str.replace('(', ',')
df_words = df_words.str.replace(')', ',')
df_words = df_words.str.replace('[', ',')
df_words = df_words.str.replace(']', ',')
df_words = df_words.str.replace('&', ',')
df_words = df_words.str.replace('/', ',')
df_words = df_words.str.replace('1', ',')
df_words = df_words.str.replace('2', ',')
df_words = df_words.str.replace('3', ',')
df_words = df_words.str.replace('4', ',')
df_words = df_words.str.replace('5', ',')
df_words = df_words.str.replace('6', ',')
df_words = df_words.str.replace('7', ',')
df_words = df_words.str.replace('8', ',')
df_words = df_words.str.replace('9', ',')
df_words = df_words.str.replace('0', ',')

This approach ended up inserting multiple commas in a row between each row, since a song name can have multiple characters to be replaced in a row. So then here is my first question. Are there any better approach to do what I am trying to achieve? Also, are there ways to do so in less repetitive and more reproducible manner than above code? My second question is, after all the words are perfectly separated with commas, what are some of the way that I can expand each words into separate list/vector elements so that I can count the number of times each words appear in the data?

You want to tokenize. Check out some NLP resources first. Also take a look at the huggingface tokenizer library: https://huggingface.co/transformers/tokenizer_summary.html. Take a look at the some of the standard NLP libraries: nltk, spacy, etc. — CypherX, Jan 21 '21 at 01:16

score 0 · Answer 1 · answered Jan 21 '21 at 02:01

You can use .isalpha() to check if a character is a letter (.isalnum() checks if the character is a letter or a number).

With that, you can split the string into words, iterate through the words and only keep the characters that are letters

df = pd.DataFrame({'song': ['Clancy Lowered the Boom', 'Gati Bali', 'Danny Boy',
                   'Piano Concerto No. 3 in D Minor, Op. 30: III.']})

# create column to add words later
df['words'] = None

lettersOnly = []

# iterate through each song title
for i, song in enumerate(df['song']):

# this returns the song title as a list of words (or list of items in between the spaces)
song = song.split(" ")

  lettersOnly = []

  # iterate through each word in the song title
  for word in song:
    # only keep the character if it is a letter
    lettersOnly += ["".join(char for char in word if char.isalpha())]

  # adds the list to the correct dataframe cell (row, col)
  df.iloc[i, 1] = lettersOnly


# print(df)
'''
    song                                            words
0   Clancy Lowered the Boom                         [Clancy, Lowered, the, Boom]
1   Gati Bali                                       [Gati, Bali]
2   Danny Boy                                       [Danny, Boy]
3   Piano Concerto No. 3 in D Minor, Op. 30: III.   [Piano, Concerto, No, , in, D, Minor, Op, , III]
'''

score 0 · Answer 2 · answered Jan 21 '21 at 02:10

I would advise against creating another DataFrame because it seems like unnecessary effort for the task. Instead, I would highly recommend using Python's built in collections.Counter data structure. It is highly optimized and perfect for your task. You can read about it in the documentation here.

A counter object maintains a running tally of every element passed to it and can efficiently return the top number of entries with the counter.most_common(n) function. It returns a list of tuples that looks like this:

[('Op.', 2), ('Gati', 2), ('Piano', 1)

Here is a sample implementation that achieves what you asked for using the Counters. I have commented each line and how it works for your reference.

songs = ['Piano Concerto No. 3 in D Minor, Op. 30: III. ...', 'Clancy Lowered the Boom', 'Gati Bali', 'Danny Boy', 'When Irish Eyes Are Smiling', 'Gati Mardika', 'The Wearing of the Green', 'Morceaux de fantaisie, Op. 3: No. 2, Prélude i...', 'La Mañanita - Remasterizado', 'Il Etait Syndiqué']

#Importing the collection
from collections import Counter

#words to ignore
disregard_words = ["a", "are", "is", "1", "2", "","etc", "the", "No."]

word_counter = Counter()

for song in songs:
    #Splitting the song title into new words using spaces as a delimeter and filtering out common words listed in the array above
    words = [word for word in song.split(" ") if word not in disregard_words]
    #Passing the list of words to the counter to keep count
    word_counter.update(words)

#Returns a list of tuples, which contain the word and the count
#Ex. ('Gati', 2), ('Piano', 1), ('Concerto', 1)
top_words = word_counter.most_common(10)

#We only need the words, not the number.
top_words = [word[0] for word in top_words]

#Function to check if a word is 
def containTopWords(song):
    for word in top_words:
        #str.find will return the index of a substring if it is inside. It will return -1 if the substring is not valid.
        if song.find(word) != -1:
            return True
    return False

#Uses the apply function to parse each row with the function and return a new row with the results
#This row can be added to the existing df
df_spot["containTopWords"] = df_spot["name"].apply(containTopWords)

Essentially, this code loops through your rows and splits each song into its individual words. It also filters out the irrelevant words. It passes these filtered words into the counter which tallys and returns the most common. Finally it uses the PANDAS apply function to create a new row with a boolean if the song contains one of the most commonly used words.

Python: Dealing with String

2 Answers2