1

Given a dict of vocabulary: {'A': 3, 'B': 4, 'C': 5, 'AB':6} and a sentence, which should be segmented: ABCAB.

I need to create all possible combinations of this sentence such as [['A', 'B', 'C', 'A', 'B'], ['A', 'B', 'C', 'AB'], ['AB', 'C', 'AB'], ['AB', 'C', 'A', 'B']]

That's what I have:

def find_words(sentence):   
    for i in range(len(sentence)):

        for word_length in range(1, max_word_length + 1):

            word = sentence[i:i+word_length]
            print(word)

            if word not in test_dict:
                continue

            if i + word_length <= len(sentence):
                if word.startswith(sentence[0]) and word not in words and word not in ''.join(words):
                    words.append(word)
                else:
                    continue

                next_position = i + word_length

                if next_position >= len(sentence):
                    continue
                else:
                    find_ngrams(sentence[next_position:])

    return words

But it returns me only one list.

I was also looking for something useful in itertools but I couldn't find anything obviously useful. Might've missed it, though.

ninesalt
  • 4,054
  • 5
  • 35
  • 75
muc777
  • 37
  • 7

3 Answers3

3

Try all possible prefixes and recursively do the same for the rest of the sentence.

VOC = {'A', 'B', 'C', 'AB'}  # could be a dict

def parse(snt):
    if snt == '': 
        yield []
    for w in VOC:
        if snt.startswith(w):
            for rest in parse(snt[len(w):]):
                yield [w] + rest

print(list(parse('ABCAB')))

# [['AB', 'C', 'AB'], ['AB', 'C', 'A', 'B'],
# ['A', 'B', 'C', 'AB'], ['A', 'B', 'C', 'A', 'B']]
VPfB
  • 14,927
  • 6
  • 41
  • 75
  • I have a new problem: I have a huge dict (6MB) and thousands of sentences (30MB) which should be parsed. Do you think this method will need long time to process? Cuz I waited for over 8h yesterday and it was still not finished. @VPfB – muc777 May 13 '18 at 07:20
  • @Y.River Some optimisation is surely possible, but otherwise I don't know about a different and more efficient apporach. I'd like to suggest to measure the time for few average sentences. Then you could estimate the time needed to process thousands of sentences. You could also add some kind of counter to monitor the progress. – VPfB May 13 '18 at 12:47
  • Yep, i will try it. Thank u! – muc777 May 13 '18 at 20:31
0

Although not the most efficient solution, this should work:

from itertools import product

dic = {'A': 3, 'B': 4, 'C': 5, 'AB': 6}
choices = list(dic.keys())
prod = []

for a in range(1, len(choices)+2):
    prod = prod + list(product(choices, repeat=a))

result = list(filter(lambda x: ''.join(x) == ''.join(choices), prod))
print(result) 

# prints [('AB', 'C', 'AB'), ('A', 'B', 'C', 'AB'), ('AB', 'C', 'A', 'B'), ('A', 'B', 'C', 'A', 'B')]
ninesalt
  • 4,054
  • 5
  • 35
  • 75
  • Thank you, but what I need is to segment a sentence. The dictionary would be quite huge and offers only different words that only part of them will be used. @ninesalt – muc777 May 13 '18 at 10:25
-1

Use itertools permutations to give all unique combinations.

d ={'A': 3, 'B': 4, 'C': 5, 'AB':6}

l = [k for k, v in d.items()]

print(list(itertools.permutations(l)))

[('A', 'B', 'C', 'AB'), ('A', 'B', 'AB', 'C'), ('A', 'C', 'B', 'AB'), ('A', 'C', 'AB', 'B'), ('A', 'AB', 'B', 'C'), ('A', 'AB', 'C', 'B'), ('B', 'A', 'C', 'AB'), ('B', 'A', 'AB', 'C'), ('B', 'C', 'A', 'AB'), ('B', 'C', 'AB', 'A'), ('B', 'AB', 'A', 'C'), ('B', 'AB', 'C', 'A'), ('C', 'A', 'B', 'AB'), ('C', 'A', 'AB', 'B'), ('C', 'B', 'A', 'AB'), ('C', 'B', 'AB', 'A'), ('C', 'AB', 'A', 'B'), ('C', 'AB', 'B', 'A'), ('AB', 'A', 'B', 'C'), ('AB', 'A', 'C', 'B'), ('AB', 'B', 'A', 'C'), ('AB', 'B', 'C', 'A'), ('AB', 'C', 'A', 'B'), ('AB', 'C', 'B', 'A')]

johnashu
  • 2,167
  • 4
  • 19
  • 44