Extract a specific block of text and place it in a new document

Question

I am using EmEditor and I see there is a "find and extract to new document" function that supports Regex statements. I am trying to extract some specific text from a Thunderbird mailbox text file. In the mailbox there are copies of customer service chats. Unfortunately, because we use a free version of this chat program it does not allow to export the data. In the body of the email is a lot of text including the chats and decoded attachments. But on the bottom of each chat is the name, email, company name, etc.

It looks like this:

Name: Tan
Email: someone@domcin.com
Operator: OperatorName
Start Time: 07/01/2014 14:43:47
End Time: 07/01/2014 15:35:22
Product/Service: Delivery
Phone: 123 1234567
Company: MyCompany Inc.

I try to extract the name, email, operator, product, phone and company. To make matters worse, not all have company since there are private people too. Also the telephone sometimes has a +60 or (60) or spaces, since the chat user could enter what we wants. I can do this manually but its 6k entries.

The question is if there would be a regex statement to find them. I could then use EmEditor to find this block and place the result in a new document and with a bit of tweaking I should be able to make a excel file to import into a CRM.

If this does not work with regex than does anyone know of a smart way to do this so I do not have to copy and paste all this?

I tried the following, I did some research but I am not real good at this and its hard to learn as a non programmer. Name(.*)Source `code`(?<=Name.*?)*(?=.*?Source)`code` `code`(?<=ID.*?)*(?=.*?Source)`code` `code`(?<=ID)(.*)(?=Source)`code` — Thom, May 27 '20 at 06:33
I think I figured it out (?<=\nID: )([\S\s]*?)(?=.*?Source:) — Thom, May 27 '20 at 07:01
@Mandy8055 this is great and helps a lot. The one I though I had figures out did not work in EmEditor. However yours is way better because it enable to select the particular fields. I did not need the times so with your I can exclude this as well. Thank you very much, it worked perfect for what I needed to do. — Thom, May 27 '20 at 08:29

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You can use the below regex to achieve your results:

^(?:Name|Email|Operator|Start Time|End Time|Product\/Service|Phone|(?:Company?)).*$

Explanation of the above regex:

^ - Represents the start of the given test string.

(?:Name|Email|Operator|Start Time|End Time|Product\/Service|Phone|(?:Company?)) - Represents a non-capturing group containing any one of the given fields. Notice field Company may appear 0 or 1 time in the match.

| - Represents alternation.

.* - This matches greedily anything except newline.

$ - Represents the end of the given test String.

You can find the demo of the above regex in here.

Venturer · Answer 2 · 2020-06-01T12:22:17.607

If I understand your question correctly, you want to broadly manipulate your source file in order to get it into a csv of some sort in order to load in excel, etc.

Using EmEditor you could try the following steps (assuming the sample fields mentioned):

1) delete blank lines [optional]

2) Find:^(.*)\r?\n(?!Name) Replace:\1\t

4) Add a header line

I've rolled this into a macro you can try on a copy of your source file, and hopefully providing a tab separated output file:

editor.ExecuteCommandByID(3882);        //Heading = 0
editor.ExecuteCommandByID(4323);        //Remove existing bookmarks

document.selection.Find("^[ \\t]*$\x0a",eeFindNext | eeFindReplaceCase | eeFindReplaceRegExp | eeFindCount | eeFindBookmark,0); //Bookmark blank lines
editor.ExecuteCommandByID(4589);        //Delete Bookmarked lines 

document.selection.Replace("^(.*)\\r?\\n(?!Name)","\\1\\t",eeFindReplaceCase | eeReplaceAll | eeFindReplaceRegExp,0);       //Find:^(.*)\r?\n(?!Name)       R:^(.*)\r?\n(?!Name)
document.selection.Replace("((Name|Email|Operator|Start Time|End Time|Product\\/Service|Phone|Company): ?)","",eeFindReplaceCase | eeReplaceAll | eeFindReplaceRegExp,0);   //Find:((Name|Email|Operator|Start Time|End Time|Product\/Service|Phone|Company): ?)    R:[blank]

document.selection.StartOfDocument(false);          //Ctrl-Home, insert blank line, and header line
document.selection.NewLine(1);
document.selection.StartOfDocument(false);          //Ctrl-Home
document.write("Name\tEmail\tOperator\tStart Time\tEnd Time\tProduct\/Service\tPhone\tCompany");    editor.ExecuteCommandByID(3901);        // Adjust separator visible lines only

editor.ExecuteCommandByID(3894); //Heading=1

this is a great idea. Because once I had the vertical list I came to the challenge of converting this into a CSV file so that ultimately I can import it to a CRM. I did do this in excel using a formula but of course the ultimate way would be the create a delimited file from the start. I did not think so far. I tried the MACRO but if ended up giving an unspecified error. — Thom, Jun 01 '20 at 08:40
Error might be related to SetCell (3rd line from the end - I assumed you had a TAB format). I have amended the code above, please retry; Step 1 to 4 can hopefully be done manually regardless. Let me know if you need further assistance. You can also then take this and convert to comma separated using EmEditor as well, if that is preferred. — Venturer, Jun 01 '20 at 12:23
Yes, this did it, works very nice. However only works once the fields are extracted from the mailbox. Because the initial issue was to look only for the particular fields are extract them from a very messy MBox file. The regex code from @Mandy8055 does this very good. Once this is done and once the labels are removed, than the macro can do the conversion to tap for further use. Thank you very much for the help, it is very much appreciated. — Thom, Jun 02 '20 at 02:12

Extract a specific block of text and place it in a new document

2 Answers2