Should the email header be ignored when using an email dataset for machine learning?

Question

I have been looking at email datasets for machine learning and noticed that the emails contain header information in addition to email content. Is it best to ignore or skip over the header and focus on the email content? Or, should the header be included? Does this depend on what you are trying to do?

For training Word2Vec, should headers be used?

For classifying email as spam or non spam, should headers be used?

score 1 · Answer 1 · answered Oct 26 '17 at 06:12

The header part of the email definitely has information which help determine if a mail is spam or not. The from , reply-to and subject are some of the important fields which can be used for spam filtering.

Having said that , you can always experiment with different types of data inputs to train your ML algorithm better.

Should the email header be ignored when using an email dataset for machine learning?

1 Answers1