Vocabulary size in CSV file

Question

I have a CSV file that looks like:

Lorem ipsum dolor sit amet , 12:01
consectetuer adipiscing elit, sed , 12:02

etc...

It is quite a large file (approx. 10,000 rows) I would like to get the total vocabulary size of all the rows of text together. That is, ignoring the second column (the time), lowercasing everything and then counting the number of different words.

Issues: 1) how to separate each word within each row 2) how to lowercase everything and remove non-alphabetical characters.

So far I have the following code:

import csv
with open('/Users/file.csv', 'rb') as file:
    vocabulary = []
    i = 0
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        for word in row:
            if row in vocabulary:
                break
            else:
                vocabulary.append(word)
                i = i +1
print i

Thank you for your help!

score 3 · Answer 1 · edited Jan 21 '13 at 17:07

Python csv module is a wonderful library provided but often using it for simpler task may be an overkill. This particular case, to me, is a classic example, where using csv module may over complicate things

To me,

just iterating through the file,
Splitting each line on comma, and extracting the first split
Then splitting the residual part on white-space
Converting each word to lower case
Strip out all the punctuations and digits
And comprehending the result as a set

Is a linear straight forward approach

An example run with the following file content

Lorem Ipsum is simply dummy "text" of the ,0
printing and typesetting; industry. Lorem,1
 Ipsum has been the industry's standard ,2
dummy text ever since the 1500s, when an,3
 unknown printer took a galley of type and,4
 scrambled it to make a type specimen ,5
book. It has survived not only five ,6
centuries, but also the leap into electronic,7
typesetting, remaining essentially unch,8
anged. It was popularised in the 1960s with ,9
the release of Letraset sheets conta,10
ining Lorem Ipsum passages, and more rec,11
ently with desktop publishing software like,12
 !!Aldus PageMaker!! including versions of,13
Lorem Ipsum.,14

>>> from string import digits, punctuation
>>> remove_set = digits + punctuation
>>> with open("test.csv") as fin:
    words = {word.lower().strip(remove_set) for line in fin
         for word in line.rsplit(",",1)[0].split()}


>>> words
set(['and', 'pagemaker', 'passages', 'sheets', 'galley', 'text', 'is', 'in', 'it', 'anged', 'an', 'simply', 'type', 'electronic', 'was', 'publishing', 'also', 'unknown', 'make', 'since', 'when', 'scrambled', 'been', 'desktop', 'to', 'only', 'book', 'typesetting', 'rec', "industry's", 'has', 'ever', 'into', 'more', 'printer', 'centuries', 'dummy', 'with', 'specimen', 'took', 'but', 'standard', 'five', 'survived', 'leap', 'not', 'lorem', 'a', 'ipsum', 'essentially', 'unch', 'conta', 'like', 'ining', 'versions', 'of', 'industry', 'ently', 'remaining', 's', 'printing', 'letraset', 'popularised', 'release', 'including', 'the', 'aldus', 'software'])

Thank you for your help, however sometimes there are commas within the line of text that I am interested in... — Julia, Jan 21 '13 at 16:04
As Abhijit mentions, you are much better off using the set type to handle the deduplicating for you. The array method you're using is pretty slow, order n^2 I believe. At the least you could add the words as keys to a dictionary, which is much faster due to hashing. Sets are the way to go though as they do the same thing and are meant for this. — Binary Phile, Jan 21 '13 at 16:17
I think you forgot the "remove non-alphabetical characters" part of what the OP wanted, although that, too, can be done without `re`. — martineau, Jan 21 '13 at 16:29
@martineau: Yes, I missed the requirement to "remove non-alphabetical characters" and definitely using `str.strip`, this can be achieved. Also modified the code, to consider the fact that the text may have comma within. — Abhijit, Jan 21 '13 at 16:51
+1 The use of `set`s makes it better than linear, although it's still a straight forward approach. ;-) — martineau, Jan 21 '13 at 17:10
Yeah, the default syntax-highlighting is for Python code (because of the question's tag), which doesn't always make sense for _everything_ marked-up as code. — martineau, Jan 21 '13 at 17:24

Thorsten Kranz · Accepted Answer · 2013-01-21T16:07:54.893

1

You have pretty much what you need. One missing point is lowercase-conversion, which can simply be done with word.lower().

Another thing you're missing is splitting into words. You should use .split() for this task, which by default splits on every whitespace-character, i.e., spaces, tabs etc.

One problem you will have is to distinguish between commas within the text and the column-separation comma. Maybe don't use csv-reader but simply read each line and remove the time, then split it into words.

import re

with open('/Users/file.csv', 'rb') as file:
    for line in file:
        line = re.sub(" , [0-2][0-9]:[0-5][0-9]", "", line)
        line = re.sub("[,|!|.|?|\"]", "", line)
        words = [w.lower() for w in line.split()]
        for word in words:
            ...

If you want to remove other characters, include them in the second regular expression. If performance matters to you you should compile two regular expressions once before the for loop.

edited Jan 21 '13 at 16:07

answered Jan 21 '13 at 15:39

Thorsten Kranz

12,492
2
39
56

Thank you, could you precise how I can use .split() ? Not sure how to. – Julia Jan 21 '13 at 15:45
expanded my sample, hopefully addressing most of your issues. – Thorsten Kranz Jan 21 '13 at 15:52
Thank you! Is there a way to get rid of the character " without falsing the expression? – Julia Jan 21 '13 at 16:05
Use escaping: `re.sub("[,|!|.|?|\"]", "", line) – Thorsten Kranz Jan 21 '13 at 16:07
Great thank you so much! Last question, how would I get rid of any word that starts with "\" ? e.g. \x90, \ht ... Thank you :) – Julia Jan 21 '13 at 16:10

Vocabulary size in CSV file

2 Answers2