read from txt file and divide words

Question

I would like to create a program in python that reads a txt file as input from the user. Then I would like for the program to seperate the words as follows in the example below:

At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties.

At the time
the time of
time of his
of his accession
his accession the ...

And i want this program to save these in a different file. any ideas?

What part of writing a small program for this are you struggling with? — Zac Taylor, Jan 26 '19 at 20:53
Your question is not clear. "Any ideas?" is far too vague. Please be more specific. Also, what work have you done on this problem so far, and just where are you stuck? This information helps others identify where you have difficulties and helps them write answers appropriate to your experience level. You also need to state exactly what the difficulty is, what you expected, what you got, and any error messages. — Rory Daulton, Jan 26 '19 at 21:07

motyzk · Answer 1 · 2019-01-27T01:17:38.433

you did not detail what format you want to save the text in a different file. assuming you want it line by line, that would do:

def only_letters(word):
    return ''.join(c for c in word if 'a' <= c <= 'z' or 'A' <= c <= 'Z')

with open('input.txt') as f, open('output.txt', 'w') as w:
    s = f.read()
    words = [only_letters(word) for word in s.split()]
    triplets = [words[i:i + 3] for i in range(len(words) - 2)]
    for triplet in triplets:
        w.write(' '.join(triplet) + '\n')

score 0 · Accepted Answer · answered Jan 26 '19 at 21:51

You can try this, note that it will fail if you don't give it at least 3 words.

def get_words():
    with open("file.txt", "r") as f:
        for word in f.readline().split(" "):
            yield word.replace(",", "").replace(".", "")

with open("output.txt", "w") as f:
    it = get_words()
    current = [""] + [next(it) for _ in range(2)]
    for word in it:
        current = current[1:] + [word]
        f.write(" ".join(current) + "\n")

score 0 · Answer 3 · answered Jan 26 '19 at 23:05

My understanding is that you want to generate n-grams which is a common practice in text vectorization before doing any NLP. Here is a simple implementation:

from sklearn.feature_extraction.text import CountVectorizer

string = ["At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties."]
# you can change the ngram_range to get any combination of words
vectorizer = CountVectorizer(encoding='utf-8', stop_words='english', ngram_range=(3,3))

X = vectorizer.fit_transform(string)
print(vectorizer.get_feature_names())

which will give you a list of ngrams with the length of 3, but the order is lost.

['accession the swedish', 'at the time', 'between rival parties', 'bitterly divided between', 'but was bitterly', 'divided between rival', 'held more power', 'his accession the', 'monarchy but was', 'more power than', 'of his accession', 'power than the', 'riksdag held more', 'swedish riksdag held', 'than the monarchy', 'the monarchy but', 'the swedish riksdag', 'the time of', 'time of his', 'was bitterly divided']

read from txt file and divide words

3 Answers3