0

We have a jdbc program which moves data from one database to another.

  1. Source database is using Latin9 character set
  2. Destination database uses UTF-8 encoding and the size of a column is specified in bytes instead of characters
  3. We have converted ddl scripts of source database to equivalent script in destination database keeping the size of the column as-is.
  4. In some cases, if there are some special characters, the size of the data after converting to UTF-8 is exceeding the size of the column in destination database causing the jdbc program to fail.

I understand that UTF-8 is variable-width encoding scheme which can take 1-4 bytes per character, given this the worst case solution would be to allocate 4 times the size of a column in destination database.

Is there a better estimate?

sandeepkunkunuru
  • 6,150
  • 5
  • 33
  • 37
  • 1
    I thought the size of varchar columns was given in chars not bytes. Therefore the size should be the same even if the byte size is different. – Alastair McCormack Dec 31 '13 at 14:12
  • That was the expectation but in this specific case the destination database used bytes for varchar column size – sandeepkunkunuru Dec 31 '13 at 14:56
  • @AlastairMcCormack for example Oracle allows you to define the size in bytes or characters; it could also be an indication that the column has no character set and stores pure bytes (so it is actually a degenerate varbinary). – Mark Rotteveel Dec 31 '13 at 15:15

2 Answers2

1

Since there's no telling in advance exactly how much a text string will grow, I think that all you can do is a trial run to convert the text to UTF-8, and generate a warning that certain columns need to be increased in size. Any ASCII (unaccented) characters will remain single bytes, and most Latin-9 accented characters will probably be 2 bytes each, but there are some that might be 3. You'd have to look at the Latin-9 and UTF-8 tables to see if any will be 3 or 4 bytes after conversion. Still, you'd have to examine your Latin-9 text to see how much it will grow.

Phil Perry
  • 2,126
  • 14
  • 18
  • 1
    Adding a bit: the [Wikipedia article for Latin-9](http://en.wikipedia.org/wiki/ISO/IEC_8859-15) shows the difference between Latin-1 and Latin-9. Those 8 characters mentioned are the only ones that could be > 2 bytes (since Latin-1 characters are only 1 or 2 bytes in UTF-8) – DPenner1 Dec 31 '13 at 14:19
1

The Euro symbol in Latin-9 will take 3 bytes to represent in utf-8. The ascii characters will only take 1 byte. The remaining 127 characters will take 2 bytes. Depending on what the actual locale is (and what characters are commonly used) an estimate between 1.5x and 2x should be sufficient.

Brett Okken
  • 6,210
  • 1
  • 19
  • 25