-1

I have a problem in reading special characters in perl. I have the following xml file and Iam using a sax parser which loop on each hotel and take the values but when it read HotelInfo it we skip the text because we have a special charterers in the 1000 m�

<?xml version="1.0" encoding="UTF-8"?>
<XMLResponse>
    <ResponseType>HotelListResponse</ResponseType>
    <RequestInfo>
        <AffiliateCode>NI9373</AffiliateCode>
        <AffRequestId>2</AffRequestId>
        <AffRequestTime>2015-10-29T15:52:05</AffRequestTime>
    </RequestInfo>
    <TotalNumber>264234</TotalNumber>
    <Hotels>
        <Hotel>
            <HotelCode>AD0BFU</HotelCode>
            <OldHotelId>0</OldHotelId>
            <HotelLocation/>
            <HotelInfo>Renovated in 2001, Hotel Bringue features a 1000 m� garden and comprises 5 floors with 105 double rooms, 2 suites and 7 single rooms. Hotel Bringue is situated in the picturesque village El Serrat, boasting the most amazing mountain views in the region and just a short drive to the main ski resort of Vallnord.After an exhausting day, you can go for a relaxing swim in the pool, re-energise your body in the jacuzzi or pamper yourself in the sauna. The rooms are beautifully appointed and come with an array of modern amenities for a pleasant stay.</HotelInfo>
            <HotelTheme>Ski Hotels</HotelTheme>
        </Hotel>
    </Hotels>
</XMLResponse>  

How I can skip like there characters in the sax parser.

Drew
  • 24,851
  • 10
  • 43
  • 78
M Muneer
  • 377
  • 4
  • 17

2 Answers2

1

If you're trying to fix the file, I'm not sure why an XML parser is even needed here.

perl -i~ -pe's/\xC3\xAF\xC2\xBF\xC2\xBD//g' file.xml
ikegami
  • 367,544
  • 15
  • 269
  • 518
0

How would you define "special characters"? One definition might be: non-ASCII characters. ASCII characters are in the range 0x00 - 0x7f (although not all are valid in XML). So you could discard every character that is not in that range with something like:

$data =~ s/[^\x00-\x7f]//g;

But that is potentially going to throw away a lot of perfectly good data. All accented characters will be discarded (eg: the "ü" in "Zürich" - leaving "Zrich"). Currency symbols like €, £ or ¥ (or even ¢) will be lost. You'll also lose otherwise harmless characters like –,—, “, ”, or • and invisible characters like non-breaking spaces.

So the question is why do you want to discard these characters? At what point are they becoming a problem? I notice you've tagged the question 'mysql' - do you get a problem when you try to insert the data in a database? Have you declared the encoding of the database correctly? Have you enabled mysql_enable_utf8 on your database connection? Maybe you could do your insert in an eval block and only apply the regex above if the insert fails.

Another option may be to pass the data through Encoding::FixLatin. Which should make the string safe to insert into a UTF-8 database, even if the resulting characters aren't exactly what was originally intended.

By the way, I think in the specific instance above, the data originally said:

Hotel Bringue features a 1000 m² garden

The SUPERSCRIPT TWO character is Unicode U+00B2 and in UTF-8 that would be encoded as two bytes: C2 B2. Somewhere along the line a process may have read those bytes but decoded them as Latin-1 rather than UTF-8 and each byte got turned into a character. This double encoding can happen over and over when data has the wrong encoding declaration or people fail to understand how to work with Unicode characters - causing one character to turn into many characters of garbage.

Grant McLean
  • 6,898
  • 1
  • 21
  • 37