What machine learning model should I use?

Question

I'm currently making a machine learning model for a student project, and I'm still deciding what model I should use. Here's the brief I was given:

Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group. Use attack type, weapons used, description of the attack, etc. to build a model that can predict what group may have been responsible for an incident.

The data frame has:

134 columns, about 100,000 rows
many of the columns have missing values
I've only been given 5 days to submit my final work, so I can't spend a prolonged period training the model

I'm leaning towards using a backpropogation neural network, as I believe it can handle the missing values, though a random forest might also be viable given the limited amount of time I have to train it. I've done a lot of research on the various pros and cons of common ML models, but any additional advise would be greatly appreciated.

score 2 · Accepted Answer · answered Nov 29 '18 at 05:04

It would be easier to answer this question if you tried several candidate methods and described why they don't suffice, but here's one place to start... If you didn't have access to a computer and someone gave you this table and asked you to qualitatively describe how terrorism works, you might notice very quickly, say, that Irish Republican Army doesn't operate in Afghanistan and only ISIS is involved in attacks that kill more than 1000 people (let's stipulate). This observation is akin to how a random forest operates on categorical and continuous data respectively.

The point is that your brain gravitates towards a random forest when trying to qualitatively describe the fundamental reality behind data like this. (Multiple splits would look like... well there was no terrorism in America before 1991 and after 1991 most terrorist attacks in America have involved groups X, Y, and Z -- and so forth) A corollary of this is that you will have a lot to say about what your trained random forest is telling you, where it fails, and why it fails for where it fails.

If you use a neural network, without knowing a lot about the details of how it works, you might end up mindlessly tuning things until something seems to work and have no idea what to say about how well it works for various situations or which features are informative.

why not use a random forest, find out where it does and does not work, contemplate this result, and iterate on that?

wow @arra, you reaffirmed a lot of my thoughts about using a random forest, and built a very strong case for using random trees. I'll admit most of my experience with NNs has been working with "out of the box" NNs, so it probably wouldn't be feasible to become sufficiently experienced with them to create a powerful NN in such a short time period. Also, I think you're right, random trees do seem very intuitive and suitable for this dataset. My only issue is with how they will handle missing values, as I believe they require complete information to perform effectively? Thanks for your response. — Uncle_Timothy, Nov 29 '18 at 05:20

What machine learning model should I use?

1 Answers1