3

I am trying to produce a bigram of hexdump of a malware file which will help me to relate with the different malware files based on bigram, I am trying to use counter, zip, and slice to get the result but instead getting an error. I would be glad if someone can help me out.

import binascii
import re
import collections
try:
    from itertools import izip as zip
except ImportError: # will be 3.x series
    pass
try:
    from itertools import islice as slice
except ImportError: # will be 3.x series
    pass
with open('path', 'rb') as f:
    for chunk in iter(lambda: f.read(), b''):
        s=binascii.hexlify(chunk)
        print(collections.Counter(zip(s),slice(s,1,None)))

The result should be like:Counter({(4d5a):200,(5a76):120,(7635):1000...}) but instead i am getting this error:


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-110-d99ed11a1260> in <module>
      3     for chunk in iter(lambda: f.read(), b''):
      4         s=binascii.hexlify(chunk)
----> 5         print(collections.Counter(zip(s),slice(s,1,None)))
      6 

~\Anaconda3\lib\collections\__init__.py in __init__(*args, **kwds)
    562         self, *args = args
    563         if len(args) > 1:
--> 564             raise TypeError('expected at most 1 arguments, got %d' % len(args))
    565         super(Counter, self).__init__()
    566         self.update(*args, **kwds)

TypeError: expected at most 1 arguments, got 2

1 Answers1

1
import binascii
import collections
import pathlib

malware = pathlib.Path().home().joinpath('Desktop').joinpath('Malware').joinpath('HWID_4_0_6YMBWX.exe')
malware.exists()

with open(malware, 'rb') as fh:
    data = fh.read()

def find_ngrams(data, n):
    s = binascii.hexlify(data).decode()
    return zip(*[s[i:] for i in range(n)])

x = find_ngrams(data, 2)

output = dict()
for ngram, count in collections.Counter(x).items():
    output[''.join(ngram)] = count
i = sorted(output.items(), key=lambda x: x[1], reverse=True)

print(i)

Output (truncated):

[('00', 31198), ('ff', 14938), ('40', 11669), ('8b', 11537), ('06', 11360), ('20', 11340), ('08', 11144)......
Utkonos
  • 631
  • 6
  • 21
  • Made a bunch of edits. I think this is what you need. – Utkonos Feb 17 '19 at 03:12
  • The above is from http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/ – Utkonos Feb 17 '19 at 04:00
  • Please let me know what you will do to compare those counts across many files. I'm very interested. – Utkonos Feb 17 '19 at 05:40
  • Thank you so much for the help but I think this is not what I want. The hexdump that is created from the binary(for example- 4d5a90000300000004000000ff....) should be sliced and zip in such a way so that I could count the bigram of the binaries such as {4d5a}:100,{5a90}:300,{9000}:100. By this when I will produce bigrams of different binaries I can check how one binary is related to other binaries which in the future will help me to separate between a binary or malicious binaries. I will let you know if I will find a solution. – Shubham Kalsi Feb 18 '19 at 00:33
  • I think you just need to change the 2 to a 4. the two is characters, but you want 4d5a, so you want 4 characters, right? – Utkonos Feb 18 '19 at 01:15
  • I went and built a bunch of code around the above and am now working on various similarity algorithms: [SimilarityResult(near=0.5267333984375, far=57.467529296875, grams=4, simtype=), SimilarityResult(near=404496, far=3766192, grams=4, simtype=), SimilarityResult(near=0.1981453128933405, far=0.11489986201073397, grams=4, simtype=)] – Utkonos Feb 18 '19 at 01:17
  • The bigrams you want are actually quads when looking at it as text, which the `.decode()` does. – Utkonos Feb 18 '19 at 01:18
  • let me know if this is incorrect, but it appears to work for me. – Utkonos Feb 18 '19 at 01:19