2

I have UTF-8 Unicode text file as below (non-english)

unicode textfile

So I marked encoding as UTF-8 in python and imported file into python.

# -*- coding: utf-8 -*-

I have tokenized sentences by "." and got list of sentences.

sentence list

Now i need to compare with another unicode word list and find out whether any of those words in each sentence.

This is my code. But it shows only first match identified.

for sentence in sentences:
    for word in sentence.split(" "):
        if word in pronouns:
            print sentence

EDIT:

Finally I noticed there is invalid unicode character in source text files. It is described here Tokenizing unicode using nltk

Community
  • 1
  • 1
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • 1
    There is nothing obvious in what you are showing us. Have you confirmed how many items are in sentences? Is there more than one pronoun in the haystack? – jwpfox Jul 15 '13 at 18:14
  • As you can see there are 6 sentences. and there are set of pronouns. Each of those sentence started with one pronoun in the list. So it suppose to show all sentences one by one. – ChamingaD Jul 15 '13 at 18:25
  • You are probably better off just using `sentence.split()`. Some alphabets may have whitespace characters that do not match `" "`. – llb Jul 15 '13 at 18:27
  • Also note there are three sentences starting with same pronoun. – ChamingaD Jul 15 '13 at 18:43
  • Sorry, I am not being clear enough. You have a list named 'sentences' and you think it has 6 items in it because your eyeballs tell you there are 6 sentences in our data. But perhaps the program is not seeing things the same way your eyes are seeing things. If you try print('len(sentences) = ' + len(sentences)) what is the output? Do you actually have the 6 sentences you expect in the list? I am touching on the same issues of encoding and the like that @KarTo is, wisely, suggesting needs to be explored as a possible source of your problems. – jwpfox Jul 16 '13 at 04:51

1 Answers1

2

I tried to simulate your problem, but I get the expected result, maybe the problem is in the Encoding or in your list of pronouns.

pronouns = ['aa','bb','cc']

sentences = ['aa dkdje asdf aesr','bb asersada','cc ase aser sa sa c ','aa saef sf se s', 'aa','bb']

for sentence in sentences:
    for word in sentence.split(" "):
        if word in pronouns:
            print (sentence)

The output of the code was:

aa dkdje asdf aesr
bb asersada
cc ase aser sa sa c 
aa saef sf se s
aa
bb

Hope this is helpful.

KarTo
  • 98
  • 7