-3

I can't figure out where I'm wrong.

I have a pile of pages from where I need to get the content of the tags and make it a file name.

My regex

title2 = re.search(r'(<title>)(.+)(</title>)', content)
filename_test = str(title2.group(2)+'.txt')

It works fine until it comes to a title like this:

<title>Klaatu - barada nikto
</title>

I've tried a lot of variants, none of them works.

Main idea is that something like this should have worked:

title2 = re.search(r'(<title>)(.+)(\n|(</title>))', content)

i.e. "stop when you come to new line or this tag" But it doesn't.

goto112
  • 5
  • 4

1 Answers1

0
<(title)>[\S\s]*<\/title>

As you've found, . will not match newlines - you can use [\S\s] to match any character that "is not a space or is a space" - basically any character.

There are actually quite a few ways you can approach this - have a look at this question for alternatives: Regex to match any character including new lines

Community
  • 1
  • 1
TVOHM
  • 2,740
  • 1
  • 19
  • 29