How to not convert Unicode text files into UTF-8 with cvs2svn?

Question

I got my CVS database transformed into SVN with the cvs2svn tool, but all my unicode text files were changed into UFT-8, and I don't want that.

How can I avoid that? Is there a flag or parameter to keep my Unicode files?

You realize that Unicode is not an encoding, and that UTF-8 is part of Unicode? — dda, Jun 05 '13 at 15:55

score 2 · Answer 1 · answered Jun 05 '13 at 15:59

I assume that what you mistakenly refer to as Unicode is UTF-16LE. There is an option in cvs2svn, and it's in the documentation:

--encoding=ENC

Use ENC as the encoding for filenames, log messages, and author names in the CVS repos. (By using an --options file, it is possible to specify one set of encodings to use for filenames and a second set for log messages and author names.) This option may be specified multiple times, in which case the encodings are tried in order until one succeeds. Default: ascii. Other possible values include the standard Python encodings.

So you could try passing --encoding=utf_16_le to the command line.

score 1 · Answer 2 · answered Jun 06 '13 at 09:15

The encoding Windows (misleadingly) refers to as "Unicode" is UTF-16LE. This is a troublesome encoding because it is not ASCII-compatible; Windows adopted it because at the time (before UTF-8 was invented) it was expected to be the most common encoding for Unicode text. Today UTF-8 is overwhelmingly the preferred encoding for in-file Unicode storage.

Whilst dda's answer should probably work (+1), Subversion does not support handling UTF-16 files as text - they'll be handled as binary files which means you won't get usable diff/patch/merge. For this reason I would strongly recommend letting cvs2svn go ahead and change the files to UTF-8.

How to not convert Unicode text files into UTF-8 with cvs2svn?

2 Answers2