2

So I accidentally forgot to include a return statement in my method and it just finished running after 10 hours, so i do not want to run it again. Is there a way i can access the wordlist inside this function?

def rm_foreign_chars(corpus):
    wordlist=[]
    for text in corpus:
        for sentence in sent_tokenize(text):
            wordlist.append(sentence)
            for word in word_tokenize(sentence):
                for c in symbols:

                    if c in word:
                        if sentence in wordlist:
                            wordlist.remove(sentence)
                            break

Symbols is a list of symbols: symbols = '฿‑‒–—―‖†‡•‰⁰⁷⁸₂₣℃™→↔∆∙≤⋅─■□▪►▼●◦◾★☎☺♣♦✓✔❖❗➡⠀ⱻ�ₒ'1

mojbius
  • 95
  • 6
  • 3
    Nope. It's gone. – user2357112 Nov 19 '20 at 04:13
  • 1
    Fix your bug and start rerunning. – user2357112 Nov 19 '20 at 04:13
  • 1
    rip list, this is why you test your functions before implementation! – Ironkey Nov 19 '20 at 04:14
  • 2
    Once the function completes, local variables are garbage collected, so probably not. For future reruns, instead of adding every sentence to the list and then removing it if it contains some symbol in any of its words, why not simply write every sentence that doesn't have the symbol in it to some file? No risk of losing results, even halfway through and less complication, so it should be a lot faster. – Grismar Nov 19 '20 at 04:15
  • 3
    You've got a really convoluted and inefficient way of filtering out sentences with symbols. It's doing a lot of unnecessary work. Clean that up, and the next run should be a lot faster than 10 hours. – user2357112 Nov 19 '20 at 04:16
  • 1
    To speed it up, your `break` should actually break out of the second `for` too, so you'll need to add an extra flag for that. Also, it would be better to only add the sentence once the tests are cleared, rather than adding then deleting. – Ken Y-N Nov 19 '20 at 04:19

2 Answers2

4

Unfortunately, there is no way to access the wordList outside of the function without using some really hacky methods, and munging around in memory. Instead, we can focus on making your function faster. This is what I came up with:

def rm_foreign_chars(corpus):
    wordlist=[]
    for text in corpus:
        for sentence in sent_tokenize(text):
            if not any(c in word for word in word_tokenize(sentence) for c in symbols):
                wordlist.append(sentence)
    return wordlist

You can also make wordlist a global variable. Only reason I suggest making it global is due to how long the function runs (27 minutes is still a long time) If the function fails before completion, you can still get something from wordlist.

def rm_foreign_chars(corpus):
    global wordlist
    for text in corpus:
        for sentence in sent_tokenize(text):
            if not any(c in word for word in word_tokenize(sentence) for c in symbols):
                wordlist.append(sentence)
    return wordlist

wordlist=[]

rm_foreign_chars(...)
# use wordlist here
smac89
  • 39,374
  • 15
  • 132
  • 179
  • 1
    thank you, it went from 10 hours to 27 minutes! was the overhead caused by the remove() function? – mojbius Nov 19 '20 at 06:36
  • @mojbius probably a combination of the remove and `if sentence in wordlist`. Glad it worked out well for you – smac89 Nov 19 '20 at 08:10
2

There is no way to do this without returning the list. The alternative would be to create a class which contains the function and store the list as an attribute of self.

class Characters:
    def __init__(self, corpus):
        self.corpus = corpus
        self.wordlist = []

    def foreign_chars(self):
        pass
        # Function code goes here
        # Be sure to replace corpus and wordlist
        # With their respective self attributes

chars = Characters()
chars.foreign_chars()
words = chars.wordlist

Do refer to the other answers and comments to optimize your code.

Jacob Lee
  • 4,405
  • 2
  • 16
  • 37