6

I have a list of 10k words in a text file like so:

G15 KDN C30A Action Standard Air Brush Air Dilution

I am trying to convert them into lower cased tokens using this code for subsequent processing with GenSim:

data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
texts = [[word for word in data.lower().split()] for word in data]

and I get the following callback:

AttributeErrorTraceback (most recent call last)
<ipython-input-84-33bbe380449e> in <module>()
      1 data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
----> 2 texts = [[word for word in data.lower().split()] for word in data]
      3 
AttributeError: 'list' object has no attribute 'lower'

Any suggestions on what I am doing wrong and how to correct it would be greatly appreciated!!! Thank you!!

tom
  • 315
  • 1
  • 3
  • 10

4 Answers4

21

try:

data = [line.strip() for line in open("C:\corpus\TermList.txt", 'r')]
texts = [[word.lower() for word in text.split()] for text in data]

you were trying to apply .lower() to data, which is a list.
.lower() can only be applied to strings.

epattaro
  • 2,330
  • 1
  • 16
  • 29
  • 1
    Thank you!!! It worked perfectly. Now I understand what I was doing wrong. I am new to python. – tom Jan 24 '17 at 13:41
2

You need

texts = [[word.lower() for word in line.split()] for line in data]

This code for each line in data ([... for line in data]) generate a list of lower case words ([word.lower() for word in line.split()]). Each str line will contain a sequence of space-separated words.line.split() will turn this sequence into list. And word.lower() will convert each word to lowercase.

kvorobiev
  • 5,012
  • 4
  • 29
  • 35
0

what you are doing wrong is, calling a string method (lower()) for a list (in your case, data)

data = [line.strip() for line in open('corpus.txt', 'r')]

what you should do after getting lines as list entry is

texts = [[words for words in sentences.lower().split()] for sentences in data]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^*********^^^^^^^^^^^^^^^^^^^^^^*********^^^^
#you should call lower on iter. value - in our case it is "sentences"

this will give you list of lists. each list contains the lowercased words form lines.

$ tail -n 10 corpus.txt 
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution
G15 KDN C30A Action Standard Air Brush Air Dilution


$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> data = [line.strip() for line in open('corpus.txt', 'r')]
>>> texts = [[words for words in sentences.lower().split()] for sentences in data]
>>> texts[:5]
[['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution'], ['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution']]
>>> 

sure you can flatten or just keep as it is.

>>> flattened = reduce(lambda x,y: x+y, texts)
>>> flattened[:30]
['g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a', 'action', 'standard', 'air', 'brush', 'air', 'dilution', 'g15', 'kdn', 'c30a']
>>> 
marmeladze
  • 6,468
  • 3
  • 24
  • 45
0

Simply we can convert list into small latter do this.

>>> words = ["PYTHON", "PROGRAMMING"]
>>> type((words))
>>> for i in words:
          print(i.lower())

Output:

python programming

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Viraj Wadate
  • 5,447
  • 1
  • 31
  • 29