Regex: take string between and \n or

Question

I can't figure out where I'm wrong.

I have a pile of pages from where I need to get the content of the tags and make it a file name.

My regex

title2 = re.search(r'(<title>)(.+)(</title>)', content)
filename_test = str(title2.group(2)+'.txt')

It works fine until it comes to a title like this:

<title>Klaatu - barada nikto
</title>

I've tried a lot of variants, none of them works.

Main idea is that something like this should have worked:

title2 = re.search(r'(<title>)(.+)(\n|(</title>))', content)

i.e. "stop when you come to new line or this tag" But it doesn't.

Don't parse HTML with regex..on a sidenote use `(?s)()(.+?)()` — rock321987, May 19 '16 at 11:30
If you want `.` in a regex to match `\n` then you need to use the `re.DOTALL` flag, which can be abbreviated as `re.S`. — PM 2Ring, May 19 '16 at 11:37
Parse HTML with regex [at your own peril](http://stackoverflow.com/a/1732454/4014959) :) — PM 2Ring, May 19 '16 at 11:40
Really?What I see as output is "string" then newline and then ".txt" instead of string.txt — goto112, May 19 '16 at 11:53

score 0 · Answer 1 · edited May 23 '17 at 12:08

0

<(title)>[\S\s]*<\/title>

As you've found, . will not match newlines - you can use [\S\s] to match any character that "is not a space or is a space" - basically any character.

There are actually quite a few ways you can approach this - have a look at this question for alternatives: Regex to match any character including new lines

edited May 23 '17 at 12:08

Community

1
1

answered May 19 '16 at 11:37

TVOHM

2,740
1
19
29

1

Why use `[\S\s]` workaround when you have `re.S` flag (or inline `(?s)`) in Python? – Wiktor Stribiżew May 19 '16 at 11:39
It's not a duplicate. I do not need to include \n. Thought it's clear. I need everything after and before \n or – goto112 May 19 '16 at 11:44

Regex: take string between and \n or

1 Answers1