what you are doing wrong is, calling a string method (lower()
) for a list (in your case, data)
data = [line.strip() for line in open('corpus.txt', 'r')]
what you should do after getting lines as list entry is
texts = [[words for words in sentences.lower().split()] for sentences in data]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^*********^^^^^^^^^^^^^^^^^^^^^^*********^^^^
#you should call lower on iter. value - in our case it is "sentences"
this will give you list of lists. each list contains the lowercased words form lines.
$ tail -n 10 corpus.txt
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> data = [line.strip() for line in open('corpus.txt', 'r')]
>>> texts = [[words for words in sentences.lower().split()] for sentences in data]
>>> texts[:5]
[['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution']]
>>>
sure you can flatten or just keep as it is.
>>> flattened = reduce(lambda x,y: x+y, texts)
>>> flattened[:30]
['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a']
>>>