2

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.

It'll usually look something like this in an HTML file in a browser:

this is the beginning of the file, ��

In the file, it'll appear as like this:

this is the beginning of the file, xE2xA0

I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.

Thank you for any help.

user717236
  • 4,959
  • 19
  • 66
  • 102
  • 2
    This sounds like a code page problem. You may be viewing the data with the wrong encoding. For instance if it was encoded in ISO 8859, and you look at it in Unicode. CHEERS – happy coder Mar 08 '13 at 15:21
  • 3
    That's not garbage data, you're using the wrong encoding to read the file. What are you trying to do? I sense you're having an [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Lie Ryan Mar 08 '13 at 15:23
  • 1
    that's a character set problem. Your computer doesn't recognise the characters so it shows them as ? xE2 is a representation of a control character, you can't remove it using regex easily. http://facebook.stackoverflow.com/questions/14946109/how-to-remove-escape-sequence-like-xe2-or-x0c-in-python – SoWhat Mar 08 '13 at 15:23
  • Thank you all for your contributions. I agree, it is an XY problem. Unfortunately, I'm having difficulty determining the character encoding set. Notepad++ encodes it as UTF-8. I changed it to ascii and the questionmark turned into an a with a carot symbol on top of it. The stackoverflow article referenced helps a great deal, as far as removing it. But if it's an XY problem, then it doesn't technically solve the problem. Nonetheless, if I can't determine the character set, what choice do I have? – user717236 Mar 08 '13 at 16:18

2 Answers2

4

Those appear because something is wrong with a character set on your site.

For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.

Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.

You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.

I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • Thank you for your help. I'm not sure what the pages were originally encoded in. They are not my pages, unfortunately. It's mostly static code and the database is out of the equation on this one. The pages are displayed as UTF-8 (I'm using notepad plus plus). – user717236 Mar 08 '13 at 15:37
  • I also wanted to add that I'm not on a Unix/Linux box, as I'm using Windows. So, file and iconv are not a possibility, but that is good to know for the future, thank you. – user717236 Mar 08 '13 at 15:56
3

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.

As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.

Is there any chance someone copied & pasted stuff from a Russian version of some software package?

Also, you can get & use iconv on Windows.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • The braille did not check out and would definitely be highly unlikely. I tried ISO-8859-5 and that did not work. It may have been Japanese encoding, which I tried and that did not work. Definitely not Russian. Thank you for the link to iconv. I'll install it and update this space. – user717236 Mar 08 '13 at 16:43
  • 1
    OK. Thanks for the update. Also, try Windows versions of the encodings as well. Good luck. – Sinan Ünür Mar 08 '13 at 16:58
  • 1
    I tried a plethora of different encoding sets and none have eliminated the control characters. In the short-term, I removed them using regex. The files are generated from an external process, I found out, which means this process is probably injecting the control characters. I don't have much control over the external process. So, removing them might be the best option for my task. But it is obviously not the ideal solution, I understand that. – user717236 Mar 08 '13 at 19:34