How to classify data basing on n-grams

Question

I have the following dataset which contains of malware categories and their correspondig API calls .API call column contain a string of words. Basing on those strings i need a classifier to be able to classify each category accordingly. Here is the dataset sample

Class   APIcall
virus   LdrGetDllHandle,LdrGetProcedureAddress,LdrGetDllHandle,NtCreateSection,….
trojan  NtOpenFile,NtCreateSection,NtClose,LdrLoadDll,……….
worm    LdrLoadDll,LdrGetProcedureAddress,LdrGetProcedureAddress,LdrGetProcedureAddress…

i have managed to use Bayesian Naive classifier by the code below

# split into train and test
from sklearn import cross_validation
data_train, data_test, labels_train, labels_test = cross_validation.train_test_split(
    data.ApiCall,
    data.Malware, 
    test_size=0.25, 
    random_state=42)

print (data_train[:10])

### text vectorization--go from strings to lists of numbers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
data_train_transformed = vectorizer.fit_transform(data_train)
data_test_transformed  = vectorizer.transform(data_test)

print (data_train_transformed[:10])


# slim the data for training and testing
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(data_train_transformed, labels_train)
data_train_transformed = selector.transform(data_train_transformed).toarray()
data_test_transformed  = selector.transform(data_test_transformed).toarray()

print (data_train_transformed[:10])


from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

clf = GaussianNB()
clf.fit(data_train_transformed, labels_train)
predictions = clf.predict(data_test_transformed)

print (accuracy_score(labels_test, predictions))

which seems to work. But what i need is to first generate n-grams e.g 4, 5-grams of each API call such that classification can base on n-grams for classification not just the API calls. Your help is highly appreciated. Thank you

Do you mean `TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(5, 5))` ? If you use the ngram_range there, the data_train_transformed & data_test_transformed features will be the 5-grams and the corresponding TfIdf values. [scikit-documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) — mkaran, Jun 07 '17 at 15:28

How to classify data basing on n-grams

0 Answers0