1

I'm trying to work out why this won't work:

my $url = 'www880740.com';

use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );

my $tx = $ua->get(
    $url =>
    { 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
    );

    my $page_title = $tx->result->dom->at( 'title' )->text;

    print "GOT: $page_title \n";

    foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana  Inherited Kannada Katakana Khmer Lao Limbu  Malayalam  Mongolian Myanmar Ogham Oriya  Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
      if ($page_title =~ /\p{$type}/) {

          print "$page_title seems to be $type!\n";
          last;

        }
    }

Basically I want to test the title from the URL, and check if it matches any of those charsets. I'm assuming its because I need to decode it into something the regex can find. It works fine when I slurp a "curled" version of the page into memory. Devel::Peek::Dump gives me:

SV = PV(0x55cd8264d650) at 0x55cd824c4b10
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55cd82655d80 "\301\371\272\317\264\253\306\34644181.com/\301\371\272\317\264\253\306\346\313\304\262\273\317\361/\302\355\273\341\277\252\275\261\275\341\271\373/\317\343\270\333\301\371\272\317\264\253\306\346/\302\355\273\341\277\252\275\261\274\307\302\274/\317\343\270\333\271\322\305\306|\310\374\302\355\273\341\327\312\301\317"\0
  CUR = 91
  LEN = 96
  COW_REFCNT = 0

UPDATE: I finally got this working:

my $page_title = $tx->result->dom->at( 'title' )->text;

use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
  
print "GOT: $page_title \n";

foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana  Inherited Kannada Katakana Khmer Lao Limbu  Malayalam  Mongolian Myanmar Ogham Oriya  Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {

  if ($page_title =~ /\p{Script_Extensions=$type}/) {

      print "$page_title seems to be $type!\n";
      last;

    }
}

This bit:

my $page_title = decode("Detect", $page_title);

detects attempts to detect the encoding, and then convert to Perl's internal representation (ready for my regex to work). I tried to post my example output, but for some reason it triggered a spam message?

Andrew Newby
  • 4,941
  • 6
  • 40
  • 81
  • 1
    If you do it with WWW::Mechanize it says it's _Han_ and the output is `六合传奇44181.com/六合传奇四不像/马会开奖结果/香港六合传奇/马会开奖记录/香港挂牌|赛马会资料`. – simbabque Nov 02 '20 at 09:06
  • 1
    Note: html head has `` -- [GB_2312](https://en.wikipedia.org/wiki/GB_2312) – Polar Bear Nov 02 '20 at 09:12
  • @PolarBear thanks - this was just an example one. I do already do a basic check for gb2312, but unfortunatly most sites just use utf-8 or don't include the charset in the HTML – Andrew Newby Nov 02 '20 at 09:19
  • @simbabque thanks. Any ideas why its not playing ball with MOJO::UserAgent? – Andrew Newby Nov 02 '20 at 09:19
  • Note: there may be `` tag available, [From ASCII to UTF-8](https://www.w3schools.com/html/html_charset.asp). – Polar Bear Nov 02 '20 at 09:30
  • @PolarBear - yeah, already doing that as well :) `$page =~ /\/i` . Lots of sites don't put a lang code in those either, which is why I'm resorting to this other method to try and weed out foreign sites (I don't want any non-english type sites) – Andrew Newby Nov 02 '20 at 09:42
  • Well, Mechanise uses `decoded_content` from HTTP::Message, which is a completely different, much older implementation than Mojo::UserAgent. Start digging in https://metacpan.org/release/HTTP-Message/source/lib/HTTP/Message.pm#L205. It would be the same with LWP::UA, I just chose to use Mech because I was too lazy to parse the title myself. – simbabque Nov 02 '20 at 14:24
  • @simbabque thanks. I ended up getting it going with Encode: `decode("Detect", $page_title);` – Andrew Newby Nov 03 '20 at 07:56

1 Answers1

2

The title is in charset=gb2312 which requires to be decoded into perl internal representation.

Following code demonstrates decoding and output to console the title for this particular website.

use strict;
use warnings;
use feature 'say';

use utf8;

use Mojo::UserAgent;
use Encode qw/encode decode/;

binmode STDOUT, 'encoding(UTF-8)';

my $url = 'www880740.com';
my $ua  = Mojo::UserAgent->new->max_redirects(3);

$ua->transactor->name( 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0' );

my $res = $ua->get( $url )->result;

my $page_title = decode('euc-cn',$res->dom->at('title')->text);

say 'GOT: ' . $page_title;

exit;

my @langs = qw/Arabic Armenian Bengali Bopomofo Braille Buhid
               Canadian_Aboriginal Cherokee Cyrillic Devanagari
               Ethiopic Georgian Greek Gujarati Gurmukhi Han
               Hangul Hanunoo Hebrew Hiragana  Inherited Kannada
               Katakana Khmer Lao Limbu  Malayalam  Mongolian
               Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog
               Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/;

for( @langs ) {
    say "$page_title matches $_!" if $page_title =~ /\p{$_}/;
}
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • thanks. Thats the kind of thing I'm aiming for. The thing it though, its not always going to be `euc-cn`. I'll have to try and work out a way to detect it, and decode into internal – Andrew Newby Nov 02 '20 at 10:33