1

I'm making a PHP script to reverse the text within an HTML document to handle badly converted Hebrew PDFs. (sigh :))

Everything works, however the script has a very strange output. Only SOME of the characters, instead of staying Hebrew letters, turn into blank characters (those black diamonds with question marks).

I tried some solution I could find on SO and beyond but nothing changed. Perhaps you can enlighten me?

You can check the script in action here: pilau.phpnet.us/html_invert.php, and this is the entire source code:

<!DOCTYPE html>
<html lang="he-IL">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
    <form action="html_invert.php" method="post" enctype="application/x-www-form-urlencoded">
        <textarea id="html_code" name="html_code" rows="30" cols="80"><?php
            if (isset($_POST['html_code']))
            {
                function invert_string ($str) {
                        $new_str = '';
                        $i = strlen($str);
                        while ($i > 0) {
                            $new_str .= substr($str, --$i, 1);
                        }
                        return '>'.$new_str.'<';
                    }

                    echo htmlspecialchars(preg_replace('/>(\s*.*\s*)</imUue', 'invert_string("$1")', stripslashes($_POST['html_code'])));
            }
            else { echo 'paste your text here'; }
        ?></textarea>
        <br />
        <input type="submit" value="Process HTML" />
    </form>
</body>
</html>
pilau
  • 6,635
  • 4
  • 56
  • 69
  • 1
    You want `mb_substr` and `mb_strlen` for multi-byte safety. – Wooble Apr 20 '12 at 15:05
  • Also, I don't think that `stripslashes` is UTF safe. – Matthew Apr 20 '12 at 15:05
  • I didnt get any output from the link.. – RyanS Apr 20 '12 at 15:06
  • Are you trying to convert the text directly from the PDF? If so that is your problem. The PDF internals are not readable text in its raw form, you have to use a parser on it, of which there is limited support. – mseancole Apr 20 '12 at 15:07
  • @Wooble: @Matthew: Thanks! I used `mb_substr('UTF-8')` and `mb_strlen('UTF-8')`, as well as replaced `stripslashes()` with this regex: `preg_replace(array('/\x5C(?!\x5C)/u', '/\x5C\x5C/u'), array('','\\'), $_POST['html_code'])` and it works now, check it our :) @showerhead: Of course not, I'm letting Google convert it to HTML for me - the best free method I could find. But then it reverses up all the text. @RyanS: I'm sorry buddy, I really have no idea why - it works for me. Thanks anyway. How do I go about marking this question as answered? – pilau Apr 20 '12 at 15:31

2 Answers2

1

Looks like there is something wrong with the charset I guess.

Look for default_charset in the php.ini, this might be set to iso-8859-1.

Edit: now I think of it, you could also try sending this header:

header('Content-Type: text/html; charset=utf-8'); 
e--
  • 198
  • 1
  • 15
0

I wanted to mark this question as answered, so here is the solution, courtesy of Wooble and Matthew as described in the comments on the question above:

I used mb_substr('UTF-8') and mb_strlen('UTF-8'), as well as replaced stripslashes() with this regular expression: preg_replace(array('/\x5C(?!\x5C)/u', '/\x5C\x5C/u'), array('','\\'), $_POST['html_code']).

Thus the complete code is as follows:

    <textarea id="html_code" name="html_code" rows="30" cols="80"><?php
        if (isset($_POST['html_code']))
        {
            function add_delimiters ($str, $deli, $optional_suffix) {
                return (isset($optional_suffix) ? $deli.$str.$optional_suffix : $deli.$str.$deli);
            }

            function reverse_string ($str) {
                $new_str = '';
                $i = mb_strlen($str, 'UTF-8');
                while ($i > 0) {
                    $new_str .= mb_substr($str, --$i, 1, 'UTF-8');
                }
                return $new_str;
            }

            function utf_stripslashes ($str) {
                return preg_replace(array('/\x5C(?!\x5C)/u', '/\x5C\x5C/u'), array('','\\'), $str);
            }

            function strip_blank_lines ($str) {
                return preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/u", "\n", $str);
            }

            function reverse_html_content ($html) {
                return preg_replace('/>(\s*.*\s*)</imUue', 'add_delimiters(reverse_string("$1"), ">", "<")', utf_stripslashes($html));
            }

            function clear_unsupported_css ($style) {
                return preg_replace(array('/top:\s{0,1}([0-9]*(?!px));{0,1}/iu', '/left:\s{0,1}([0-9]*(?!px));{0,1}/iu'), array('top:$1px;', 'left:$1px;'), $style);
            }

            function process_inline_style ($html, $func) {
                return preg_replace('/style="[a-zA-Z0-9:;\s{0,1}]*"/imUue', $func.'("$0")', $html);
            }               

            echo strip_blank_lines(htmlspecialchars(process_inline_style(reverse_html_content($_POST['html_code']), 'clear_unsupported_css')));
        }
        else { echo 'paste your text here'; }
    ?></textarea>
pilau
  • 6,635
  • 4
  • 56
  • 69