0

I am using the Enron dataset for a machine learning problem. I want to merge all the spam files into a single csv file and all the ham files into another single csv for further analysis. I'm using the dataset listed here: https://github.com/crossedbanana/Enron-Email-Classification

I used the code below to merge the emails and I'm able to merge them. However when I try to read the csv file and load it into pandas, I get errors due to ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

Code to merge email files in txt into csv

import os
for f in glob.glob("./dataset_temp/spam/*.txt"):
    os.system("cat "+f+" >> OutFile1.csv")

Code to load into pandas:

```# reading the csv into pandas

emails = pd.read_csv('OutFile1.csv')
print(emails.shape)```

1. How can I get rid of the parser error? this is occuring due to commas present in the email messages I think.
2. How can I just load each email message into pandas with just the email body?

This is how the email format looks like(an example of a text file in the spam folder)
The commas in line 3 are causing a problem while loading into pandas


*Subject: your prescription is ready . . oxwq s f e
low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu* 


Thanks for any help. 
py_noob
  • 433
  • 2
  • 8
  • 17

3 Answers3

0

Instead of reading and writing data in CSV file, you can use an excel file. So you will not get any errors because of ',' (comma). Just replace csv with excel.

Here is an example:

    import os
    import pandas as pd
    import codecs

    # Function to create list of emails.
    def create_email_list(folder_path):
        email_list = []
        folder = os.listdir(folder_path)#provide folder path, if the folder is in same directory provide only the folder name
        for txt in folder:
            file_name = fr'{folder_path}/{txt}'
            #read emails
            with codecs.open(file_name, 'r', encoding='utf-8',errors='ignore') as f:
                email = f.read()
                email_list.append(email)
        return email_list

    spam_list = create_email_list('spam')#calling the function for reading spam 
    spam_df = pd.DataFrame(spam_list)#creating a dataframe of spam
    spam_df.to_excel('spam.xlsx')#creating excel file of spam

    ham_list = create_email_list('ham')#calling the function for reading ham
    ham_df = pd.DataFrame(ham_list)#creating a dataframe of spam
    ham_df.to_excel('ham.xlsx')#creating excel file of ham

You just need to pass the folder path in the function(folder name is the folder is in the same directory). This code will create the excel files.

Sachin Gupta
  • 186
  • 1
  • 14
0

To avoid problems with the , you can use a different separator (for example |) or put quotes around the field:

"soma , ultram , adipex , vicodin many more"

If there are quotes inside the fields, you have to escape them with another quote:

"soma , ultram , ""adipex"" , vicodin many more"

However, your example will have a csv record for each line in every mail. It might be more logical to have one record per email:

subject,body
your prescription is ready . . oxwq s f e,"low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu"
test subject2,"test
body 2"

The above example gives you a table with 2 columns: subject and body, where body is a multiline field surrounded by double quotes.

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
  • Good suggestion but I have thousands of txt files so I cannot modify and remove the commas or replace them. Confused about what would be a practical yet not so tedious approach. – py_noob Apr 27 '20 at 15:20
  • @py_noob It all depends on how you want to access the data, but if you want your emails in a single csv file (and avoid the read errors you're getting now), the above format would be the one to use. Another format you might want to take a look at is the [Mbox format](https://en.wikipedia.org/wiki/Mbox), although the files you're working with are not complete emails. Or if you don't want to touch the emails maybe just leave them in separate files? – Danny_ds Apr 28 '20 at 09:00
0

I solved my problem this way. Read all the txt files first

```
BASE_DIR = './'
SPAM_DIR = './spam'
 def load_text_file(filenames):
        text_list = []
        for filename in filenames:
             with codecs.open(filename, "r", "utf-8", errors = 'ignore') as f:
                 text = f.read().replace('\r\n', ' ')
                 text_list.append(text)
    return text_list

# add it to a list with filenames 
ham_filenames = glob.glob( BASE_DIR + HAM_DIR + '*.txt')
ham_list = load_text_file(ham_filenames)

# load the list into a dataframe
df = DataFrame (train_list,columns=['emails'])
```

Once I had it in a dataframe, I just parsed the emails into subject and body. Thanks to everyone for their help.

py_noob
  • 433
  • 2
  • 8
  • 17