0

I've migrated to a new hosting provider, with same freebsd system, and one of my perl scripts stopped working properly.

It downloads data from external https site and stores it in mysql db. Data is in cp1251 encoding, same encoding is in mysql base, tables and connection. From my.cnf:

character-set-server=cp1251
collation-server=cp1251_general_ci
init-connect="SET NAMES cp1251"

When connecting to mysql from perl script:

$dbh->do('SET CHARACTER SET cp1251');

So, I'm getting this data with

$ua = new LWP::UserAgent;
....
$res = $ua->get(....)
$s = $res->decoded_content();

Then script will parse this $s and insert result into mysql. When it does, encoding is corrupted!

Funny thing that I discovered is if I just write this data to a text file, then read it from this file and insert it into mysql - it's not corrupted!

When I view this text file I see that data is in cp1251 encoding.

What changed since previous hosting:

perl: from 5.10.1 to 5.14.4

libwww: from 5.835 to 6.05

mysql server is the same 5.1

UPDATE: Wow, just found something. If I replace $res->decoded_content() with $res->content(), everything works. Maybe that's because there's no charset in headers of the page I'm downloading.

I still don't understand how decoded_content messes with the string in such manner, that it looks like cp1251 but it isn't. Some utf flags maybe? Help plz.

UPDATE2: Here's the script (main parts):

#!/usr/bin/perl

use POSIX qw(strftime);
use LWP::UserAgent;
use HTTP::Headers;
use HTTP::Cookies;
use Digest::MD5 qw(md5_hex);
use DBI;
use common::sense;
no utf8;
no strict;

$ua = new LWP::UserAgent;
$hh = HTTP::Headers->new(
  User-Agent => 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20100101 Firefox/21.0',
  Accept => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  Accept-Language => 'en-us,en;q=0.7,ru;q=0.3',
  Accept-Encoding => 'gzip, deflate',
  Connection => 'keep-alive',
);
$ua->default_headers( $hh );
$ua->cookie_jar({});
$ua->timeout(20);

YMoney();

sub YMoney {

  $res = $ua->get('...');
  $res = $ua->post('...');

...

  $res = $ua->get("...");
  $s = $res->decoded_content();
  @list = reverse split("\n", $s);

  $dbh = DBI->connect("DBI:mysql:database=orders;host=localhost;port=3306", ....);
  $dbh->do('SET CHARACTER SET cp1251');

  for $line (@list) {
    next if ($line !~ /^\+;/);

    @pay{'data', 'amount', 'comment'} = map { s/"+//g; $_ } (split(';', $line))[1, 2, 5];
    $pay{hash} = md5_hex( join('', @pay{'data', 'amount', 'comment'}) );

    $id = $dbh->selectrow_array("SELECT id FROM ymoney WHERE hash = ?", {}, $pay{hash});

    if (!$id) {
      $dbh->do("INSERT INTO ymoney (operator, hash, data, amount, comment) VALUES ('yandex', ?, ?, ?, ?)", {},
      $pay{hash}, DB_Date($pay{data}), DB_Amount($pay{amount}), $pay{comment}
      );
    }
  }
}
Sly
  • 415
  • 2
  • 8

1 Answers1

2

As an approximation, Perl operates either on the raw bytes you give it, or on Unicode codepoints. When dealing with text data, the latter is much more useful. But this means that you have to decode all your input, and encode your output.

 __________  |                  _______________
\ WEB PAGE \ |               __|__             |               _______
 \ -------- \ \-------------\  L  | YOUR APP    \--------------\ DATA |
 / -------- /  〉 encoded data 〉 W  |              〉 encoded data 〉 BASE |
/ -------- /  /-------------/__P _| codepoints  /--------------/______|
\__________\ |                 |_______________|

When you use decoded_content, LWP is so nice to give you codepoints directly. The undecoded content is not useful: It may be compressed, have a transfer encoding, or may be in an unexpected charset.

But this means that now, you have to encode that text again. You can either do so explicitly if the server expects a binary blob, or let DBI sort this out for you – no set character set should be necessary.

TL;DR: Remove any encoding hacking unless you know what you are doing. If you follow best practices, everything should work out just fine. Otherwise, do your own encoding with Encode.

amon
  • 57,091
  • 2
  • 89
  • 149
  • Sorry, I didn't understand what you meant by 'encoding hacking'. I've updated my question with quotes from my script. – Sly Oct 13 '13 at 20:09