0

I am editing a webpage whose visible text is in Russian and whose encoding is set to UTF-8. I'm not a Russian speaker by the way. So I am aiming to copy and paste the rendered cyrillic from a browser into the html source of a newly edited version of these pages. I find that doesn't work, and I am pasting in a row of question marks ('???????'). I don't understand why that is but it's incidental to the main question here, which is what is this source format called (and relatedly, how can I reproduce it from the cyrillic that it renders). Let me give an example. The html source for the Russian words for "occupation" or "job" looks like this:

Сфера деÑтельноÑти/Роль

When the html page renders that in a browser it looks like this:

Занимаемая должность/Бизнес

My goal again here is to copy the Russian Cyrillic text and paste it into a new page successfully, so it looks like the first format. Can anyone tell me what the first format is called, and also is there a way I can "reverse engineer" it to arrive at the first format using the second as input?

Thank you.

Frankie
  • 596
  • 3
  • 24
  • What happens in copy and paste depends both on the data and on the programs used. This is not a coding question. And it’s not clear why you would do things like that, copying text from a page into the HTML format, instead of just editing the HTML format. – Jukka K. Korpela Jul 07 '14 at 12:50
  • Yes I know it 'depends' but that doesn't help me unfortunately. Also I didn't say it was a 'coding question'; it's about the format of the data. Why I don't edit the original is too complex to describe here, just that it's inefficient and too time consuming IF there is an alternative - that of perhaps deriving the (still unidentified) first format from the second. – Frankie Jul 07 '14 at 13:02

2 Answers2

1

It seems the answer here relates to the fact that many legacy Windows programs still use non-Unicode representations of Cyrillic script. Alternatives such as 'Cyrillic Windows-125 and about half a dozen variants. They managed the job 'back in the day' but don't play nicely with Unicode representations. I did even more digging than before over the last few days and found a discussion of the issue here. I want to acknowledge Karol S for nudging me in the right direction for pointing me at Dreamweaver Documentation. That made me ask the right questions. The link I reference goes on to specify that some versions of Dreamweaver do use non-Unicode representations. I was using Dreamweaver 4 which is quite old these days.

For completeness, the mystery format (above), while it has no exact name, could properly be described as "Cyrillic characters encoded as 'Western (Latin 1)' characters". This representation 'works' insofar as the encoding holds the data, and this data can be correctly rendered back as Cyrillic by a modern browser, but it's a legacy representation as Unicode is now more commonly used.

Frankie
  • 596
  • 3
  • 24
-1

Get a better editor.

If you're on Windows, Notepad++ is good. Just switch it to UTF-8 and use only that.

The format:

Сфера деÑтельноÑти/Роль

is called garbage.

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • It's clearly not garbage as it's the source of the Cyrillic. The editor is Dreamweaver which I think is a tad advanced compared to notepad. – Frankie Jul 07 '14 at 15:25
  • http://help.adobe.com/en_US/dreamweaver/cs/using/WS4A31B6A6-8F51-4b2a-AC51-3AA1F6F709A4a.html – Karol S Jul 07 '14 at 23:16