HTTP proxy to deal with different encoding

Question

I am working on a web crawler, so I parse HTML pages. My problem is sometime the page encoding is not UTF8 (ISO, exotic Windows[0-9] etc..) and my analyser failled.

I tried many solution in PHP/Java/NodeJS to convert the content but there is always a problem.

Is exist a proxy module (nginx, squid, varnish ....) to convert automatically the content charset to UTF8?

Can't you analyse the html headers (for example with a regex) and convert them individually? — Jeredepp, Jan 07 '14 at 09:36
Yes, I am doing this and with HTTP headers as well, but I would like to separate this part of job — Thomas Decaux, Jan 07 '14 at 10:26

score 1 · Answer 1 · answered Jan 07 '14 at 09:44

1

The charset should be declared in the header - if it's not utf-8 then convert it - iconv is available on most flavours of Linux and Unix. If you're building a web crawler then it'd be easier to integrate in your code than in a proxy.

answered Jan 07 '14 at 09:44

symcbean

21,009
1
31
52

I want a proxy because I prefer split my project components, I already tried iconv and auto-detect but it doesnt work at 100%, I have many crawlers thats why I prefer isolate the encoding part. – Thomas Decaux Jan 07 '14 at 10:25

HTTP proxy to deal with different encoding

1 Answers1