Strip HTML charaters and convert to plain text

Question

Ok I've searched hours!!! for an answer. Everything I've found has not done what I want it to do.

Our client likes to copy parts of an HTML website right into TinyMCE wysiwyg editor AND into a plain text textarea or input field (for the title). Problem is that the wysiwyg characters are HTML and not RAW html.

Here's just ONE example. Keep in mind I want to accommodate for ANY possible characters that might throw this error.

Companion Dual Massage – Two Seat Walk In Bathtub

That DASH in the middle has the html entity of –

Copying the HTML directly and pasting it into the plain text input field or a textarea throws an error

invalid byte sequence for encoding "UTF8": 0x96

When trying to submit to a UTF8 database.

There's chance of the client copying trademark, copyright, or reserved symbols.

I dont just want to strip them out. I want to CONVERT them.

I've tried all kinds of converters. I dont want to list every site I've been to.

Any ideas?

Worse case is I take just those 4 characters and convert them to whatever.

*"I dont want to list every site I've been to."* That's the spirit. — GolezTrol, Nov 26 '12 at 22:35
1. Detect character encoding. 2. Convert to utf-8. 3. Have a sandwich. — Musa, Nov 26 '12 at 22:39
What is UTF-8 database? What language is this? Where is the code? — Esailija, Nov 26 '12 at 22:42

score 0 · Answer 1 · edited Nov 26 '12 at 22:51

Try this. It's a little effort to convert 'old' data to Utf-8. With 'old' I mean data that came from our old database, which could be either UTF-8 or Latin and either escaped or non-escaped characters. The result is always a Utf-8 string that contains the original characters (not the entities).

/**
 * Decodes HTML entities and converts the string to UTF-8 if it isn't UTF-8 already.
 * @param string $string LATIN-1 or UTF-8 string that may contain html_encoded characters.
 * @returns string
*/
private function tidyUtf8($string)
{
  // Check if the string contains any Latin characters that are not valid UTF-8.
  $utfCheckString = @iconv(
       'UTF-8',
       'UTF-8//IGNORE',
       $string
  );
  $isUtf = ($string === $utfCheckString);

  // If the string is not UTF-8, convert it to UTF-8
  if ($isUtf === false)
  {
       // Decode HTML entities to prevent double encoding later. 
       // Decode only the ones that are valid LATIN-1 characters.
       $string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-1');
       $string = iconv('ISO-8859-1', 'UTF-8', $string);
  }

  // Decode all HTML entities to prevent double encoding later. 
  // Include UTF-8 characters.
  $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

  return $string;
}

This function is aimed towards accepting UTF-8 and LATIN-1(ISO-8859-1). You may not need the latter, so you could maybe strip a part of this function and just use:

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

score 0 · Accepted Answer · edited May 23 '17 at 10:24

This is an encoding problem, not a problem with the HTML entities. When you copy data from HTML into a text box, the browser is not pasting in the entity like –, it's pasting in the actual character. It looks like the character you are getting is encoded in Windows-1252 (sometimes mistakenly referred to as ISO-8859-1). Since the database is expecting UTF-8, it can't handle this character.

There are a few possible reasons this might be happening. You didn't list what browser, language, web framework, or database you're using, so I'm going to offer a few suggestions, and hopefully one of them works. In general, it is best to use UTF-8 for your encoding at every stage; but if that't not possible, you either need to use a consistent encoding throughout all of the levels, or you need to convert.

Since your database is using UTF-8, I'll assume that's the encoding that you want to use. One thing to check is whether your pages are being served as UTF-8. Check the headers on your HTTP response; there should be a Content-Type: text/html; charset=utf-8 header. If that is wrong, missing, or missing the charset=utf-8 part, then the browser may choose the wrong charset. One more thing that's good to do is add a <meta charset=utf-8> tag in your <head>; while this isn't necessary if you have the charset sent as part of the HTTP headers, it can help select the correct charset if the headers aren't present, or the document is loaded from a file: URL or the like, which doesn't have headers available.

While the browser should use the character set of the document when submitting the form, you can ensure that it submits using the correct charset by using the accept-charset attribute on the form: <form accept-charset=utf-8>. This will ensure that even if the page has the no charset set in the headers, forms will submit data as UTF-8.

Finally, even if all of that is correct, IE 5 through 8 will sometimes submit data in a different encoding than what the page is sent in, if the user has changed their encoding settings. To force it to send UTF-8 data, you can use a hidden form attribute that includes a character that cannot be encoded in a legacy encoding like Windows-1252. Some versions of Ruby on Rails famously used a snowman (☃) for this purpose, though it was later changed to a checkmark (✓) to be less puzzling. You can add a similar element to your form to force IE to use UTF-8: <input name="_utf7" type="hidden" value="✓">.

If the above suggestions don't work, please let us know what browser, programming language, web framework, and database you are using, and try to provide a short, self-contained piece of sample code that demonstrates the problem.

Strip HTML charaters and convert to plain text

2 Answers2