0

I have a default dict that has 3 layers of embedding that is to be used later for a trigram.

counts = defaultdict(lambda:defaultdict(lambda:defaultdict(lambda:0)))

Then, I have a for loop that goes through a document and creates counts of each letter (and bicounts and tricounts)

counts[letter1][letter2][letter3] = counts[letter1][letter2][letter3] + 1

I want to add another layer so that I can specify if the letter is a consonant or a vowel.

I want to be able to run my bigram and trigram over Consonant vs. Vowel instead of over every letter of the alphabet, but I do not know how to do this.

Nissa
  • 4,636
  • 8
  • 29
  • 37
Katie Tetzloff
  • 55
  • 1
  • 1
  • 6
  • can you provide your current code? – mitoRibo Feb 17 '17 at 00:22
  • I'm not sure I understand your question... how does not simply adding another "layer" to your defaultdict solve the problem? What exactly do you not know how to approach? – juanpa.arrivillaga Feb 17 '17 at 00:34
  • 1
    Good god, what did `+=` do to you that you hate it so much? Not using it (especially here) is both slower and ridiculously verbose/redundant compared to: `counts[letter1][letter2][letter3] += 1` – ShadowRanger Feb 17 '17 at 01:59

2 Answers2

0

I'm not sure exactly what you want to do, but I think the nested dict approach is not as clean as having a flat dict where you key by the combined string of characters (i.e. d['ab'] instead of d['a']['b']). I also put in code to check if the bigram/trigram is composed only of vowels/consonants or a mixture.

CODE:

from collections import defaultdict


def all_ngrams(text,n):
    ngrams = [text[ind:ind+n] for ind in range(len(text)-(n-1))]
    ngrams = [ngram for ngram in ngrams if ' ' not in ngram]
    return ngrams


counts = defaultdict(int)
text = 'hi hello hi this is hii hello'
vowels = 'aeiouyAEIOUY'
consonants = 'bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ'

for n in [2,3]:
    for ngram in all_ngrams(text,n):
        if all([let in vowels for let in ngram]):
            print(ngram+' is all vowels')

        elif all([let in consonants for let in ngram]):
            print(ngram+' is all consonants')

        else:
            print(ngram+' is a mixture of vowels/consonants')

        counts[ngram] += 1

print(counts)

OUTPUT:

hi is a mixture of vowels/consonants
he is a mixture of vowels/consonants
el is a mixture of vowels/consonants
ll is all consonants
lo is a mixture of vowels/consonants
hi is a mixture of vowels/consonants
th is all consonants
hi is a mixture of vowels/consonants
is is a mixture of vowels/consonants
is is a mixture of vowels/consonants
hi is a mixture of vowels/consonants
ii is all vowels
he is a mixture of vowels/consonants
el is a mixture of vowels/consonants
ll is all consonants
lo is a mixture of vowels/consonants
hel is a mixture of vowels/consonants
ell is a mixture of vowels/consonants
llo is a mixture of vowels/consonants
thi is a mixture of vowels/consonants
his is a mixture of vowels/consonants
hii is a mixture of vowels/consonants
hel is a mixture of vowels/consonants
ell is a mixture of vowels/consonants
llo is a mixture of vowels/consonants
defaultdict(<type 'int'>, {'el': 2, 'his': 1, 'thi': 1, 'ell': 2, 'lo': 2, 'll': 2, 'ii': 1, 'hi': 4, 'llo': 2, 'th': 1, 'hel': 2, 'hii': 1, 'is': 2, 'he': 2})
mitoRibo
  • 4,468
  • 1
  • 13
  • 22
0

Assuming that you need to keep counts for the sequence of vowel and consonants you could simply keep a different map.

If you have a function is_vowel(letter) that returns True if the letter is a vowel and False if it's a consonant you could do this.

vc_counts[is_vowel(letter1)][is_vowel(letter2)][is_vowel(letter3)] = \
vc_counts[is_vowel(letter1)][is_vowel(letter2)][is_vowel(letter3)] + 1
cjungel
  • 3,701
  • 1
  • 25
  • 19