Python csv module is a wonderful library provided but often using it for simpler task may be an overkill.
This particular case, to me, is a classic example, where using csv module may over complicate things
To me,
- just iterating through the file,
- Splitting each line on comma, and extracting the first split
- Then splitting the residual part on white-space
- Converting each word to lower case
- Strip out all the punctuations and digits
- And comprehending the result as a set
Is a linear straight forward approach
An example run with the following file content
Lorem Ipsum is simply dummy "text" of the ,0
printing and typesetting; industry. Lorem,1
Ipsum has been the industry's standard ,2
dummy text ever since the 1500s, when an,3
unknown printer took a galley of type and,4
scrambled it to make a type specimen ,5
book. It has survived not only five ,6
centuries, but also the leap into electronic,7
typesetting, remaining essentially unch,8
anged. It was popularised in the 1960s with ,9
the release of Letraset sheets conta,10
ining Lorem Ipsum passages, and more rec,11
ently with desktop publishing software like,12
!!Aldus PageMaker!! including versions of,13
Lorem Ipsum.,14
>>> from string import digits, punctuation
>>> remove_set = digits + punctuation
>>> with open("test.csv") as fin:
words = {word.lower().strip(remove_set) for line in fin
for word in line.rsplit(",",1)[0].split()}
>>> words
set(['and', 'pagemaker', 'passages', 'sheets', 'galley', 'text', 'is', 'in', 'it', 'anged', 'an', 'simply', 'type', 'electronic', 'was', 'publishing', 'also', 'unknown', 'make', 'since', 'when', 'scrambled', 'been', 'desktop', 'to', 'only', 'book', 'typesetting', 'rec', "industry's", 'has', 'ever', 'into', 'more', 'printer', 'centuries', 'dummy', 'with', 'specimen', 'took', 'but', 'standard', 'five', 'survived', 'leap', 'not', 'lorem', 'a', 'ipsum', 'essentially', 'unch', 'conta', 'like', 'ining', 'versions', 'of', 'industry', 'ently', 'remaining', 's', 'printing', 'letraset', 'popularised', 'release', 'including', 'the', 'aldus', 'software'])