I built many spiders to get news articles from different websites and i have an api to convert the text to audio clips, but i need a framework or python tools to refine the articles' text such as:
removing anything related to the source. removing any dates formats. removing urls. change acronyms such as CEO to chief excution officer for example. removing special characters and typos.
making sure that the sentence is written correctly after all the edits. use the previously edited articles as a reference for the new articles.
I am using python, nltk and re, but it's exhausting and each time i think i covered all the cases, i find new cases to add and i think i am stuck in an infinite loop.
Any suggestions?