Vectorizing Files using sklearn

Question

I am trying to read 100 training files and vectorize them using sklean. The contents of these files are word representing system calls. Once vectorized, I would like to print the vectors out. My first attempt was the following:

import re
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import numpy.linalg as LA

trainingdataDir = 'C:\data\Training data'

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()

    return data 

train_set = [readfile()]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray

However, this only returns the vector for the last file. I concluded that the print function should be placed in the for loop. So the second attempt:

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()
    trainVectorizerArray = vectorizer.fit_transform(data).toarray()
    print 'Fit Vectorizer to train set', trainVectorizerArray

However, this does not return anything. Could you please assist me with this? Why am I not able to see the vectors being printed out?

score 0 · Accepted Answer · answered Oct 22 '15 at 09:02

The issue was because the list of data sets used to vectorize was empty. I managed to vectorize a set of 100 files. I first opened the files, then read each file and finally added them to a list. The list of data set is then used to by the 'tfidf_vectorizer'

import re
import os
import sys
import numpy as np
import numpy.linalg as LA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

trainingdataDir = 'C:\\data\\Training data'

tfidf_vectorizer = TfidfVectorizer()

transformer = TfidfTransformer()
def readfile(trainingdataDir):
    train_set = []
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles): 
            data = open(trainingfiles, 'r')
            data_set=str.decode(data.read())
            train_set.append(data_set)
    return train_set 


tfidf_matrix_train = tfidf_vectorizer.fit_transform(readfile(trainingdataDir))
print 'Fit Vectorizer to train set',tfidf_matrix_train
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)

Vectorizing Files using sklearn

1 Answers1