0

I am working on a web crawler, so I parse HTML pages. My problem is sometime the page encoding is not UTF8 (ISO, exotic Windows[0-9] etc..) and my analyser failled.

I tried many solution in PHP/Java/NodeJS to convert the content but there is always a problem.

Is exist a proxy module (nginx, squid, varnish ....) to convert automatically the content charset to UTF8?

Thomas Decaux
  • 1,289
  • 12
  • 13

1 Answers1

1

The charset should be declared in the header - if it's not utf-8 then convert it - iconv is available on most flavours of Linux and Unix. If you're building a web crawler then it'd be easier to integrate in your code than in a proxy.

symcbean
  • 21,009
  • 1
  • 31
  • 52
  • I want a proxy because I prefer split my project components, I already tried iconv and auto-detect but it doesnt work at 100%, I have many crawlers thats why I prefer isolate the encoding part. – Thomas Decaux Jan 07 '14 at 10:25