-2

Trying to search 2 lists for common strings. 1-st list being a file with text, while the 2-nd is a list of words with logarithmic probability before the actual word – to match, a word not only needs to be in both lists, but also have a certain minimal log probability (for instance, between -2,123456 and 0,000000; that is negative 2 increasing up to 0). The tab separated list can look like:

-0.962890   dog
-1.152454   lol
-2.050454   cat


I got stuck doing something like this:

common = []
for i in list1:
    if i in list2 and re.search("\-[0-1]\.[\d]+", list2):
        common.append(i)


The idea to simply preprocess the list to remove lines under a certain threshold is valid of course, but since both the word and its probability are on the same line, isn’t a condition also possible? (Regexps aren’t necessary, but for comparison solutions both with and without them would be interesting.)

EDIT: own answer to this question below.

Россарх
  • 147
  • 1
  • 9
  • 2
    How do you read the second list? Is it one list of strings? If so you should split it (like in two list, or a list of tuples (log, str)) – Mel Dec 20 '17 at 08:37
  • 2
    Also you should use `and` instead of `&` as the latter is a bitwise operator. – mij Dec 20 '17 at 08:39
  • 2
    please show an example input and expected output, it's hard to see what exactly your looking for here. – RoadRunner Dec 20 '17 at 08:44
  • 1
    you mean `-2.123456` and `0.0` right? not `,` as in *separator comma*. – Ma0 Dec 20 '17 at 08:45
  • 1
    Why don't you first edit the second list to remove all those entries that have probabilities outside the desired range and then simply search for common entries? Simplify your life (and in this case, the code would be more efficient too). – Ma0 Dec 20 '17 at 08:47
  • 2
    @Ev.Kounis some countries use `,` as a decimal separator instead of `.`, not everyone is from the US – Mel Dec 20 '17 at 08:47
  • 3
    @Mel How about we stick to whatever Python is using to understand us all better? In Python `-2,123456` is a tuple no matter where in the world the code is typed. – Ma0 Dec 20 '17 at 08:49

3 Answers3

1

Assuming your list contains strings such as "-0.744342 dog", and my_word_list is a list of strings, how about this:

worddict = dict(map(lambda x: (x, True), my_word_list))
import re
for item in my_list:
    parts = re.split("\s+", item)
    if len(parts) != 2:
         raise ValueError("Wrong data format")
    logvalue, word = float(parts[0]), parts[1]
    if logvalue > TRESHHOLD and word in worddict:
         print("Hurrah, a match!")

Note the first line, which makes a dictionary out of your list of words. If you don't do this, you'll waste a lot of time doing linear searches through your word list, causing your algorithm to have a time complexity of O(n*m), while the solution I propose is much closer to O(n+m), n being the number of lines in my_list, m being the number of words in my_word_list.

Pascal
  • 448
  • 3
  • 11
1

Here's my solution without using regex. First create a dictionary of words which are within the accepted range, then check if each word in the the text is in the dict.

word_dict = {}

with open('probability.txt', 'r') as prob, open('text.txt', 'r') as textfile:
    for line in prob:
        if (-2.123456 < float(line.split()[0]) < 0):
            word_dict[line.split()[1]] = line.split()[0]

    for line in textfile:
        for word in line.split():
            if word in word_dict.keys():
                print('MATCH, {}: {}'.format(word, word_dict[word]))
mij
  • 532
  • 6
  • 13
0

Answering my own question after hours of trial and error, and read tips from here and there. Turns out, i was thinking in the right direction from start, but needed to separate word detection and pattern matching, and instead combine the latter with log probability checking. Thus creating a temporary list of items with needed log prob, and then just comparing that against the text file.

    common = []
    prob = []
    loga , rithmus =   -9.87   ,   -0.01

    for i in re.findall("\-\d\.\d+", list2):
        if (loga < float(i.split()[0]) < rithmus):
            prob.append(i)

    prob = "\n".join(prob)
    for i in list1:
        if i in prob:
            common.append(i)
Россарх
  • 147
  • 1
  • 9