0

Quick Perl question with hopefully a simple answer. I'm trying to perform a split on a string containing non breaking spaces ( ). This is after reading in an html page using HTML::TreeBuilder::XPath and retrieving the string needed by $titleString = $tree->findvalue('/html/head/title')

use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "filename" );
$titleString = $tree->findvalue('/html/head/title');
print "$titleString\n";

Pasted below is the original string and below that the string that gets printed:

Mr Dan Perkins (Active)
Mr?Dan Perkins?(Active)

I've tried splitting $titleString with @parts = split('\?',$titleString); and also with the original nbsp, though neither have worked. My hunch is that there's a simple piece of encoding code to be added somewhere?

HTML code:

<html>
<head>
<title>Dan&nbsp;Perkins&nbsp;(Active)</title>
</head>
</html>
dan j
  • 157
  • 1
  • 11
  • 1
    Is it `&nbsp` or ` `? Those are different. Can you add the original website, or is that local? – simbabque Oct 06 '15 at 14:53
  • Sorry it's a local html page, but I'll add the html to the question. It is ` ` - sorry didn't see that. – dan j Oct 06 '15 at 15:04

1 Answers1

2

You shouldn't have to know how the text in the document is encoded. As such, findvalue returns an actual non-breaking space (U+00A0) when the document contains &nbsp;. As such, you'd use

split(/\xA0/, $title_string)
   -or-
split(/\x{00A0}/, $title_string)
   -or-
split(/\N{U+00A0}/, $title_string)
   -or-
split(/\N{NBSP}/, $title_string)
   -or-
split(/\N{NO-BREAK SPACE}/, $title_string)
ikegami
  • 367,544
  • 15
  • 269
  • 518