Regex to extract Content-Type

Question

How can extract the lines with the Content-Type info? In some mails, these headers can be in 2 or 3 or even 4 lines, depending how it was sent. This is one example:

Content-Type: text/plain;
    charset="us-ascii"
Content-Transfer-Encoding: 7bit

Lorem ipsum dolor sit amet, consectetur adipisicing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna 
aliqua. Ut enim ad minim veniam, quis nostrud exercitation 
ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit 
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint 
occaecat cupidatat non proident, sunt in culpa qui officia 
deserunt mollit anim id est laborum.

I tried this regex: ^(Content-.*:(.|\n)*)* but it grabs everything.

How should I phrase my regex in Java to get only part:

Content-Type: text/plain;
    charset="us-ascii"
Content-Transfer-Encoding: 7bit

score 2 · Answer 1 · answered Oct 28 '11 at 02:33

2

Pattern regex = Pattern.compile("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)");

This will match everything which starts with Content-Type until the first completely empty line.

answered Oct 28 '11 at 02:33

FailedDev

26,680
9
53
73

Thanks! But why do I get a `StackOverFlowError` when I use it this way: `mailContent.replaceFirst("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)", "");` – Carven Oct 28 '11 at 02:49
@xEnOn I honestly do not know. Can you post a sample at ideone.com? – FailedDev Oct 28 '11 at 02:55
I don't even know which part of the code I should paste it as a sample. lol. It's like the whole thing works fine but as long as I change the regex to the one you suggested, I get a StackOverFlowError. So the only problem is the `replaceAll` line. It's weird because the regex you had works when I put it into a regex tester. But I don't know why Java throws that error. – Carven Oct 28 '11 at 03:11
I think you may need to escape the newlines in the pattern like so: `"^Content-Type(?:.|\\s)*?(?=\\n\\s+\\n)"` – ridgerunner Oct 28 '11 at 03:13
@ridgerunner Yeah I thought that too but my tool insists that \n is not to be doubly escaped. – FailedDev Oct 28 '11 at 03:18
@ridgerunner Escaping the new lines still have the StackOverFlowError. I usually don't escape new lines and they worked too. Do new lines need to be escaped too? – Carven Oct 28 '11 at 03:18
@xEnOn Can you try with double escapes at \n too? – FailedDev Oct 28 '11 at 03:18
@FailedDev I put a sample code on http://ideone.com/lLRg5 Somehow, the StackOverFlowError is thrown when the `find()` function is called. – Carven Oct 28 '11 at 03:19
@xEnOn Could you try with a smaller mail body? The code you posted does not compile :D – FailedDev Oct 28 '11 at 03:21
@FailedDev I am trying it with the exact sample mail content above in the question. In a smaller body, the application hangs. The code I posted isn't complete so it doesn't compile. I don't know where I should start posting my code from because it's kind of long but the main part is where I got the `emailContent` into a String already, and then attempt to do a `replaceFirst()` or `find()` with the regex you suggested. I tried some other random regex and there is no StackOverFlowError. It's weird. – Carven Oct 28 '11 at 03:27

score 1 · Answer 2 · answered Oct 28 '11 at 03:22

1

^Content-(.|\n)*\n\n This will match until the blank line.

answered Oct 28 '11 at 03:22

hllau

9,879
7
30
35

Narendra Yadala · Accepted Answer · 2011-10-28T03:44:42.623

1

You can try this regex

Pattern regex = Pattern.compile("Content-Type.*?(?=^\\s*\n?\r?$)", 
                                 Pattern.DOTALL | Pattern.MULTILINE);

edited Oct 28 '11 at 03:44

answered Oct 28 '11 at 03:26

Narendra Yadala

9,554
1
28
43

I tried this but it `find()` returns false. It doesn't find the part. – Carven Oct 28 '11 at 03:34
@xEnOn I am not sure why it is returning false, here it shows the match http://regexr.com?2v20l – Narendra Yadala Oct 28 '11 at 03:43
@xEnOn I updated the regex, can you try it now and let me know if it works. – Narendra Yadala Oct 28 '11 at 03:45

score 0 · Answer 4 · answered Oct 28 '11 at 07:59

Checkout the relevant RFCs for the exact definition of headers. IIRC in essence you need to consider everything with a linebreak and one or more whitespace characters (eg space, nonbreaking space, tab) to be part of the same header line. I also believe that you should collapse the linebreak and whitespace(s) into a single whitespace element (note: there might be more complex rules, so check the RFCs).

Only if the new line directly starts with a non-whitespace character it is the next header, and if it is immediately followed by another linebreak it ends the header section and starts the body section.

BTW: Why not just use JavaMail instead of reinventing the wheel?

score 0 · Answer 5 · answered Oct 28 '11 at 15:44

This tested script works for me:

import java.util.regex.*;
public class TEST
{
    public static void main( String[] args )
    {
        String subjectString =
            "Content-Type: text/plain;\r\n" +
            "    charset=\"us-ascii\"\r\n" +
            "Content-Transfer-Encoding: 7bit\r\n" +
            "\r\n" +
            "Lorem ipsum dolor sit amet, consectetur adipisicing elit,\r\n" +
            "sed do eiusmod tempor incididunt ut labore et dolore magna\r\n" +
            "aliqua. Ut enim ad minim veniam, quis nostrud exercitation\r\n" +
            "ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n" +
            "Duis aute irure dolor in reprehenderit in voluptate velit\r\n" +
            "esse cillum dolore eu fugiat nulla pariatur. Excepteur sint\r\n" +
            "occaecat cupidatat non proident, sunt in culpa qui officia\r\n" +
            "deserunt mollit anim id est laborum.\r\n";
        String resultString = null;
        Pattern regexPattern = Pattern.compile(
            "^Content-Type.*?(?=\\r?\\n\\s*\\n)",
            Pattern.DOTALL | Pattern.CASE_INSENSITIVE |
            Pattern.UNICODE_CASE | Pattern.MULTILINE);
        Matcher regexMatcher = regexPattern.matcher(subjectString);
        if (regexMatcher.find()) {
            resultString = regexMatcher.group();
        } 
        System.out.println(resultString);
    }
}

It works for text having both valid: \r\n and (invalid but commonly used in the wild): \n Unix style line terminations.

Regex to extract Content-Type

5 Answers5