3

I am trying to implement Neural network for email spam detection. I have neural network for solving XOR problem and I want to edit that network for my purpose and use ba. Its accessible here: https://github.com/trentsartain/Neural-Network

I downloaded some database of email spam and ham in text formats for training the network.So I have some training sets. But my question is:

What should be inputs for that neural network?

Thanks for every comment! :)

Amir
  • 10,600
  • 9
  • 48
  • 75
user2095405
  • 347
  • 1
  • 4
  • 15
  • 3
    There's so much prior research about this... search google scholar for papers discussing various signals useful in spam-detection, then extract those signals from the text and feed them into your ANN. – Johannes Rudolph Jan 07 '16 at 21:18

2 Answers2

2

The short answer: the input will be your spam emails.

The longer answer, at a very basic level: Assuming your emails are free of weird characters. Imagine a vector, where each element of the vector represents one of the words that appear in those emails.
And for each email, you create one those vectors, and for each element, you calculate the frequency of that word in the email.
And all these vectors, one for each email, will be your inputs.

That's the basic idea. Then you can refine this by applying stemming, use tf-idf instead of plain frequency, bring in other input elements (from the email headers for example).

Olivier De Meulder
  • 2,493
  • 3
  • 25
  • 30
1

I have met some spam filters for emails and SMS and most efficient of them based on "Naive Bayes spam filtering" technique. So I suggest to look at this technique first.

As an idea to start with:

You can use the weighting words technique in neural-network like following.

First step: create a "dictionary" based on neural-network witch answers you with what probability given word is spam.

Second step: calculate probability for whole message to be a spam. You may have several inputs, for example first input takes number of words with a spam probability from 0-10%, a second number of words with probability 10-20% and so on till the last number of words from probability 90-100%, output of such neural-network you can set the probability for message being spam.

Mikhailov Valentin
  • 1,092
  • 3
  • 16
  • 23