6

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

Shall I just press on with the queries? There will probably only be 40 or so of them.

lynks
  • 5,599
  • 6
  • 23
  • 42
  • You'll want to `VACCUM` aggressively if you're doing this, to avoid huge table bloat. Doing the text processing in a PL is by far the better approach as @SzymonGuz explains. It's possible in SQL using `substring` or `regexp_matches` and a replacement table, but it'll be slow and ugly. – Craig Ringer Aug 28 '12 at 03:21
  • thanks for the VACCUM tip, I shall look into that. – lynks Aug 28 '12 at 10:12

3 Answers3

7

Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

Of course you need to have perl installed and pl/perl available.

1) First of all create the procedural language pl/perlu:

CREATE EXTENSION plperlu;

2) Then create a function like this:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3) Then you can use it like this:

select decode_html_entities('aaabbb&.... asasdasdasd …');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)
Randal Schwartz
  • 39,428
  • 4
  • 43
  • 70
Szymon Lipiński
  • 27,098
  • 17
  • 75
  • 77
5

You could use xpath (HTML-encoded content is the same as XML encoded content):

select 
  'AT&T' as input ,
  (xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] as output 
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
1

This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.

From the command line

sudo apt install postgresql-plpython3-10

From your SQL interface:

CREATE LANGUAGE plpython3u;

CREATE OR REPLACE  FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
    from html.parser import HTMLParser
    h = HTMLParser() 
    if str is None:
        return str
    return h.unescape(str);
$$ LANGUAGE plpython3u;
sorrell
  • 1,801
  • 1
  • 16
  • 27