Questions tagged [mojibake]

Garbled text that is the result of bytes being decoded using an incorrect coding.

Mojibake is the phenomenon which occurs when text is decoded from a byte stream using the wrong character encoding, resulting in a sequence of characters which is unreadable. The term "mojibake" is derived from Japanese where it literally means "unintelligible sequence of characters".

Example mojibake:

اÙ"إعÙ"ان اÙ"عاÙ"Ù

References:

150 questions
6
votes
3 answers

Why is text in Swedish from a resource bundle showing up as gibberish?

Possible Duplicate: How to use UTF-8 in resource properties with ResourceBundle I want to allow internationalization to my Java Swing application. I use a bundle file to keep all labels inside it. As a test I tried to set a Swedish title to a…
Brad
  • 4,457
  • 10
  • 56
  • 93
6
votes
1 answer

What does this mojibake/krakozyabry on The Simpsons say?

On Season 12 Episode 07 "The Great Money Caper" of The Simpsons, I noticed a few years ago "gibberish" signs on the Russian spaceship. Randomly today, I decided to search and see if anyone decoded them but couldn't find any results. I suspect that…
chfoo
  • 386
  • 3
  • 13
5
votes
2 answers

python replace unicode characters

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. Below is one of the example: (13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)' I want to replace all…
kenneth171
  • 55
  • 5
4
votes
1 answer

Python2.7 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordinal not in range(128)

I am currently using python 2.7 and doing web scraping on a Chinese website. How to convert unicode below into a string? Simple str() function does not work and states UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11:…
Perry Zhuang
  • 65
  • 1
  • 4
4
votes
1 answer

Russian symbols in Python output corrupted (ENCODING)

I parsed a HTML document and have Russian text in it. When I'm trying to print it in Python, I get this: ÐлÑбниÑнÑй новогодний пÑÐ½Ñ I tried to decode it and I get ISO-8859-1 encoding. I'm trying to decode it like that: print…
aaaapppp
  • 51
  • 1
  • 5
4
votes
1 answer

Unbaking mojibake

When you have incorrectly decoded characters, how can you identify likely candidates for the original string? Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png I know for a fact that this image filename should have been some Japanese characters. But with…
wim
  • 338,267
  • 99
  • 616
  • 750
4
votes
1 answer

Hebrew text in vba code doesn't decode properly

I've developed a workbook, with some underlying vba code. The workbook is in Hebrew, and the vba code uses Hebrew as well, e.g. comparing strings in Hebrew, or accessing Sheets using their Hebrew names. I've developed this workbook in Excel 2010,…
Matan_ma
  • 151
  • 2
  • 4
  • 8
3
votes
2 answers

Extract text from corrupt (?) pdf document

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online. Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one. If you open it in a PDF reader, it…
mlissner
  • 17,359
  • 18
  • 106
  • 169
3
votes
1 answer

Character Encoding and the ’ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post: (Note: This is an example, not a spam job post... :-) I have recently seen that exact error on websites, in popular IM…
Eric J.
  • 147,927
  • 63
  • 340
  • 553
3
votes
5 answers

Unexpected output of std::wcout << L"élève"; in Windows Shell

While testing some functions to convert strings between wchar_t and utf8 I met the following weird result with Visual C++ express 2008 std::wcout << L"élève" << std::endl; prints out "ÚlÞve:" which is obviously not what is expected. This is…
chmike
  • 20,922
  • 21
  • 83
  • 106
3
votes
4 answers

How to identify likely broken pdf pages before extracting its text?

TL;DR My workflow: Download PDF Split it into pages using pdftk Extract text of each page using pdftotext Classify text and add metadata Send it to client in a structured format I need to extract consistent text to jump from 3 to 4. If text is…
Kfcaio
  • 442
  • 1
  • 8
  • 20
3
votes
1 answer

python unicode: when written to file, writes in different format

I am using Python 3.4, to write a unicode string to a file. After the file is written, if I open and see, it is totally a different set of characters. CODE:- # -*- coding: utf-8 -*- with open('test.txt', 'w', encoding='utf-8') as f: name =…
Remis Haroon - رامز
  • 3,304
  • 4
  • 34
  • 62
3
votes
2 answers

How do I transform "ТеÑ" (it is russian word) into something readable?

I got MySQL DB which contains UTF8 column with such "ТеÑ" records. PHP's mb_detect_encoding() told me that this is UTF-8. How can I transform this "horror" into something readable? Thank you
Kirzilla
  • 16,368
  • 26
  • 84
  • 129
3
votes
7 answers

Pound symbol not displaying on web page

I have a mysql database table to store country name and currency symbol - the CHARSET has correctly set to UTF8. This is example data inserted into the table insert into country ( country_name, currency_name, currency_code, currency_symbol) values…
Gublooo
  • 2,550
  • 8
  • 54
  • 91
3
votes
2 answers

Identify garbage unicode string using python

My script is reads data from csv file, the csv file can have multiple strings of English or non English words. Some time the text file has garbage strings , i want to identify those string and skip those string and process others doc =…
Shashi
  • 2,137
  • 3
  • 22
  • 37
1
2
3
9 10