I am trying to read 100 training files and vectorize them using sklean. The contents of these files are word representing system calls. Once vectorized, I would like to print the vectors out. My first attempt was the following:
import re
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import numpy.linalg as LA
trainingdataDir = 'C:\data\Training data'
def readfile():
for file in os.listdir(trainingdataDir):
trainingfiles = os.path.join(trainingdataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, "rb").read()
return data
train_set = [readfile()]
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
However, this only returns the vector for the last file. I concluded that the print function should be placed in the for loop. So the second attempt:
def readfile():
for file in os.listdir(trainingdataDir):
trainingfiles = os.path.join(trainingdataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, "rb").read()
trainVectorizerArray = vectorizer.fit_transform(data).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
However, this does not return anything. Could you please assist me with this? Why am I not able to see the vectors being printed out?