I have the following dataset which contains of malware categories and their correspondig API calls .API call column contain a string of words. Basing on those strings i need a classifier to be able to classify each category accordingly. Here is the dataset sample
Class APIcall
virus LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,NtCreateSection,….
trojan NtOpenFile,NtCreateSection,NtClose,LdrLoadDll,……….
worm LdrLoadDll,LdrGetProcedureAddress,LdrGetProcedureAddress,LdrGetProcedureAddress…
i have managed to use Bayesian Naive classifier by the code below
# split into train and test
from sklearn import cross_validation
data_train, data_test, labels_train, labels_test = cross_validation.train_test_split(
data.ApiCall,
data.Malware,
test_size=0.25,
random_state=42)
print (data_train[:10])
### text vectorization--go from strings to lists of numbers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
data_train_transformed = vectorizer.fit_transform(data_train)
data_test_transformed = vectorizer.transform(data_test)
print (data_train_transformed[:10])
# slim the data for training and testing
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(data_train_transformed, labels_train)
data_train_transformed = selector.transform(data_train_transformed).toarray()
data_test_transformed = selector.transform(data_test_transformed).toarray()
print (data_train_transformed[:10])
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
clf = GaussianNB()
clf.fit(data_train_transformed, labels_train)
predictions = clf.predict(data_test_transformed)
print (accuracy_score(labels_test, predictions))
which seems to work. But what i need is to first generate n-grams e.g 4, 5-grams of each API call such that classification can base on n-grams for classification not just the API calls. Your help is highly appreciated. Thank you