1

Having a strange behavior using simple html dom

$html = str_get_html($output, true, true, DEFAULT_TARGET_CHARSET, false);

Than

var_dump($html->find('title', 0));

returns an object. It's ok

But

var_dump($html->find('body', 0));

returns NULL.

I can't understand what's wrong.

mb_detect_encoding($output);

returns UTF-8 - all seems to be ok with the string.

I increased MAX_FILE_SIZE to 6000000 - it not helps (((

  • May be it’s case sensitive then try to use for loop to look for each possible case https://stackoverflow.com/a/2213688/10634638 – estinamir Mar 15 '19 at 18:23
  • @bestinamir The second parameter of str_get_html is TRUE - that say to the simple_html parser to lower all tags before parsing, I think. – Сергей Сергеев Mar 15 '19 at 18:40
  • Try to see if it will work with very basic html page.. How big is the size of the page in megabytes.. If it’s to big try splitting it into multiple files and “crawl” it in separate threads.. – estinamir Mar 15 '19 at 18:45
  • 1
    Found that the problem in tags with ciryllic content. But all things is in UTF-8 without BOM. On other server it works perfect. – Сергей Сергеев Mar 15 '19 at 18:51
  • Still do not understand. On one server all works fine, on another - not. Both servers uses UTF-8 as mb_string internal encoding. In one server $html->find on tag, that contains non-latin characters somewhere in it - returns NULL – Сергей Сергеев Mar 15 '19 at 19:58
  • You must be getting close.. Since there are different types of Cyrillic like koi8 etc, some may work and others won’t.. May be try different decoding techniques(also checking if decoding worked before passing it into parser) https://stackoverflow.com/a/1623715/10634638 – estinamir Mar 15 '19 at 22:37

1 Answers1

0

Just wrote mbstring.func_overload 0 in php.ini and all things begin to work perfectly. May be this helps somebody else.