US-ASCII string (de-)compression into/from a byte array (7 bits/character)

Question

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters

For example:

    StringBuilder text = new StringBuilder();
    IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
    int letters = text.length();
    int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
    System.out.println(letters); // expected  160,  actual 160
    System.out.println(bytes); //   expected  140,  actual 160

Always letters = bytes, but the expected is letters > bytes.

the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library

Can anyone explain it to me please and guide me to the right way, Thanks in advance (:

How could you expect that 160 1 byte chars could be preserved in a 140 bytes array :? In this straight forward case each of the 160 chars are taking exactly 1 byte... **Change this one** -> `int bytes = text.toString().getBytes(StandardCharsets.UTF_16).length;` results to the couple `160, 322` — dbl, Jul 04 '19 at 09:36
7 bits are used, but those 7 bits are stored in (8-bit) bytes. i.e. there's always a 0 most significant bit. You *could* compact your string across a series of bytes, but in practise that doesn't happen — Brian Agnew, Jul 04 '19 at 09:55
thanks @BrianAgnew, and @dbl for respond, thank you for respond, but in smpp protocol, sms body is `140` bytes when you encode message in `ascii` you can write 160 letters, but when using `utf-8` you can write just `140` letters — Anas, Jul 04 '19 at 10:53

kry · Answer 1 · 2019-07-04T09:58:45.983

2

(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII

‭01100001‬

Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.

Which concludes, in pure ASCII letters must always equal bytes.

(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)

edited Jul 04 '19 at 09:58

answered Jul 04 '19 at 09:52

kry

362
3
13

'You can't hack the objects to pieces' - you *could* compact your 7-bits across a series of bytes. e.g. the first bit of character 2 would be the 8th bit of your first bytes. In theory that would work, although in the vast majority of cases it's really not worth it – Brian Agnew Jul 04 '19 at 09:59
That would require manual programming. Long time ago I had a Wiegand reader, which returned 36 bits worth of data, and I used 2 32-bit int variables, and I wrote the necessary bitwise functions to process it. It's good if the computer knows where one data ends, and where does the other start. (I would use one 64-bit register now...) – kry Jul 04 '19 at 10:03
'That would require manual programming' - yes. I don't dispute that. I'm highlighting that it's possible, however – Brian Agnew Jul 04 '19 at 10:03
Oh, right. It's just the in the question the preset 8 bit registers and functions were used, that's why I was confused. – kry Jul 04 '19 at 10:06
thank you for respond, but in smpp protocol, sms body is `140` bytes when you encode message in `ascii` you can write 160 letters, but when using `utf-8` you can write just `140` letters – Anas Jul 04 '19 at 10:47
That's right, because utf-8 register length is also variable, it can change in bytes between 1 and 4 bytes. Normal ASCII characters take up the same 8 bits in utf-8, but other characters vary in length. If I remember well, the first byte of an utf-8 code contains how many bytes of data is used by the character. For the SMPP protocol, check out @Brian Agnew 's comment on compacting 7-bit characters. – kry Jul 04 '19 at 11:05
I checked the SMPP protocol, it seems it's also using 8 byte transfer: `'short_message', (Hello Wikipedia) ... 48 65 6C 6C 6F 20 57 69 6B 69 70 65 64 69 61` , it's simple ASCII. – kry Jul 04 '19 at 11:24
Cancel that, if SMPP is using SMSC encoding, the characters are indeed compactified. Since SMPP sends the encoding in advance in a header, in the documentation it seems like data is indeed concatenated the way @BrianAgnew mentioned. – kry Jul 04 '19 at 11:33
@kry, do you mean that smpp client need to handle bytes concatanation? – Anas Jul 04 '19 at 12:21
Well, in the documentation it says encoding must be sent - so if it is set to SMSC, it should compact ASCII codes to 7 bit values, allowing more characters to be sent. Of course, it comes down if YOU want to write the encoder method, and only send the data to be sent. – kry Jul 04 '19 at 12:34
thank you brother, but as a `smpp-client` i had to send `sms` as array of bytes, can you tell if how to convert `text` to `7-bit` based byte array, and thank you again (: – Anas Jul 04 '19 at 12:54
Well, comments are not a good way to do this, and it would require, like an hour, for me to come up with the encoding, but I would generally do like this: have a 7 step loop (i=1-7), shift a character i times to the left, then copy the last i values of the next character to the first i places. At 140 bytes it should give you 160 characters. Of course, the whole thing depends on HOW is it decoded on the other side. – kry Jul 04 '19 at 12:58
Or if you want funky, make a "How to encode 160 bytes to 140 bytes" question on https://codegolf.stackexchange.com/, I'm sure there will be a string|byte&number^whatever like answer, that is short and fast. :D – kry Jul 04 '19 at 13:03
+1 At first i am sorry for lating, stackoverflow didn't show notifications, second thank you so much, Allah make things clearer by you thank you brother, for implemnation i already do it thank you again – Anas Jul 04 '19 at 13:28
1

https://github.com/opentelecoms-org/jsmpp/blob/bae1d7af6212586c5c683960b93926f4edd636f3/jsmpp/src/main/java/org/jsmpp/bean/GeneralDataCoding.java Check the toByte() function, it seems like it compresses the message if you specify it. – kry Jul 04 '19 at 13:33

kriegaex · Accepted Answer · 2019-07-05T10:31:56.817

Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:

package de.scrum_master.stackoverflow;

import java.util.BitSet;

public class ASCIIConverter {
  public byte[] compress(String message) {
    BitSet bits = new BitSet(message.length() * 7);
    int currentBit = 0;
    for (char character : message.toCharArray()) {
      for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
        if ((character & 1 << bitInCharacter) > 0)
          bits.set(currentBit);
        currentBit++;
      }
    }
    return bits.toByteArray();
  }

  public String decompress(byte[] compressedMessage) {
    BitSet bits = BitSet.valueOf(compressedMessage);
    int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
    StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
    for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
      char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
      decompressedMessage.append(character);
    }
    return decompressedMessage.toString();
  }

  public static void main(String[] args) {
    String[] messages = {
      "Hello world!",
      "This is my message.\n\tAnd this is indented!",
      " !\"#$%&'()*+,-./0123456789:;<=>?\n"
        + "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
        + "`abcdefghijklmnopqrstuvwxyz{|}~",
      "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
        + "1234567890123456789012345678901234567890"
    };

    ASCIIConverter asciiConverter = new ASCIIConverter();
    for (String message : messages) {
      System.out.println(message);
      System.out.println("--------------------------------");
      byte[] compressedMessage = asciiConverter.compress(message);
      System.out.println("Number of ASCII characters = " + message.length());
      System.out.println("Number of compressed bytes = " + compressedMessage.length);
      System.out.println("--------------------------------");
      System.out.println(asciiConverter.decompress(compressedMessage));
      System.out.println("\n");
    }
  }
}

The console log looks like this:

Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!


This is my message.
    And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
    And this is indented!


 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~


1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

thanks for your ansower, i also developed some-quick code to do the same job, but your method seems to be better — Anas, Jul 07 '19 at 07:56
Thanks for the feedback. Maybe you want to consider accepting my answer in order to close it then. :-) — kriegaex, Jul 07 '19 at 09:13

score 0 · Answer 3 · answered Jul 04 '19 at 09:50

Based on the encoding type, Byte length would be different. Check the below example.

String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"

byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length); 
// prints "10"

byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length); 
// prints "22"

byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length); 
// prints "10"

thanks Rowi for your answer but i thinks it's so far from my question, this was due to my question wasn't clear sorry for that — Anas, Jul 07 '19 at 07:57

score 0 · Answer 4 · answered Jul 04 '19 at 12:20

0

Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.

You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.

answered Jul 04 '19 at 12:20

Tom Blodget

20,260
3
39
72

thanks Tom for response, but in `smpp` protocol, sms body is `140` bytes when you encode message in `ascii` you can write `160` letters, but when using utf-8 you can write just 140 letters – Anas Jul 04 '19 at 12:29
Okay. The key point is that an encoding library does what it does. In the abstract, a character encoding is a map between codepoints (as integers) in the character set and code units (as integers) of the character encoding. A character encoding library then serializes a code unit to whatever it wants. In Java's and other environment general purpose libraries, they serialze to byte sequences. You are using Java's general purpose library. – Tom Blodget Jul 04 '19 at 12:34
you mean that smpp client need to make `7-bit` encoding by itself ? – Anas Jul 04 '19 at 12:37
Well, yes, but surely someone has done that already. And you could skip that step because if the input text is only from the [C0 and Basic Latin](http://www.unicode.org/charts/nameslist/index.html) block, a code comment could cite the well-known fact that each `char` value (UTF-16) has the same value as the ASCII encoding of the same character from both the ASCII and Unicode character set. – Tom Blodget Jul 04 '19 at 12:42
but put in consideration: `smpp-client` need to send `sms` as array of bytes do you have ideas to convert my text to `7-bit` based array or this is sort of imagination ? – Anas Jul 04 '19 at 12:50

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

4 Answers4