Readlines function for an xlsx file works inproper

Question

The goal is sentiment classification. The steps are to open 3 xlsx files, read them, process with gensim.doc2vec methods and classify with SGDClassificator. Just try to repeat this code on doc2vec. Python 2.7

with open('C:/doc2v/trainpos.xlsx','r') as infile:
    pos_reviews = infile.readlines()
with open('C:/doc2v/trainneg.xlsx','r') as infile:
    neg_reviews = infile.readlines()
with open('C:/doc2v/unsup.xlsx','r') as infile:
    unsup_reviews = infile.readlines()

But it turned out that the resulting lists are not what they are expected to be:

print 'length of pos_reviews is %s' % len(pos_reviews)
>>> length of pos_reviews is 1

The files contain 18, 1221 and 2203 raws correspondingly. I thought that the lists will have the same number of elements.

The next step is to concatenate all the sentences.

y = np.concatenate((np.ones(len(pos_reviews)), np.zeros(len(neg_reviews))))
x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_reviews, neg_reviews)), y, test_size=0.2)

This leads to the situation when x-train, x-test are lists of sentences as they should be while

y_train = [0.]
y_test = [1.]

After this division every sentence gets a label:

def labelizeReviews(reviews, label_type):
labelized = []
for i,v in enumerate(reviews):
    label = '%s_%s'%(label_type,i)
    labelized.append(LabeledSentence(v, [label]))
return labelized
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
unsup_reviews = labelizeReviews(unsup_reviews, 'UNSUP')

As written in the numpy documentation, the arrays should be equal in size. But when I reduce the bigger files to 18 lines, nothing changes. As I searched on the forum noone has a similar error. I've broken my head what went wrong and how to fix it. Thanks for help!

score 1 · Accepted Answer · answered Sep 01 '16 at 14:23

1

Generally you can't read Microsoft Excel files as a text files using methods like readlines or read. You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly.

answered Sep 01 '16 at 14:23

Stanislav Ivanov

1,854
1
16
22

I'm surprised 'cause in the book [link] (https://automatetheboringstuff.com/)' that was cited not once on this site readlines() method was advised to work with Excel files. – Talka Sep 02 '16 at 10:34
Could you then tell me which module method return an object of the list type so that the list.append method could be applied later? I edited the code adding the labelizing function. – Talka Sep 03 '16 at 15:06
@Talka [This code](http://pastebin.com/rzi57bhE) should work for `python3` with `xlrd` module. In «Automate the Boring Stuff...» book module `openpyxl` described. This module works with MS Office 2007 files (`.xlsx`) only. – Stanislav Ivanov Sep 05 '16 at 15:40
Thanks for the code. I used openpyxl methods described in [link] (https://www.getdatajoy.com/learn/Read_and_Write_Excel_Tables_from_Python), though I still don't understand how this article ([link] https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis#disqus_thread) was written with 'readlines'. – Talka Sep 06 '16 at 03:07
Aggr! I just read that article ones more and saw that the author opened a txt file which is ok for `readlines`. Where were my eyes... – Talka Sep 06 '16 at 03:17

Readlines function for an xlsx file works inproper

1 Answers1