-2

I have a std::stringstream strs variable with this serialized binary data:

�G�{SSF��>%�����hgRQ;Tjh A "ʐk�R3 1[Z�yA _�Kx O��� ���f��' ����t %��+>, ���~� 삾�+/ Tb�Ҷ�7 �(���� �Q1�5m& ��( G#�bm 3O�AN ) �DP߇g �0=ʆ�0 ���j�u E�3�� �G�#�" \��!�o% L.�� �WMG?B- 3����}& �.�S� (�B� �j&� �@��%&, 65��0 !G�5R ��N��0 ��b�� hv) �8�� x�%7 5e��: w|7ώJ 8���� ����X/ v�c�h2 3��i� o^���
�A��� �oG��0 +���Ȑ" n�� ���4 F#�>b .��m�=; � �X ��< � c(= ���7Y: �� �� Q��O� w�k�q! �D��G�8 O���l�1 j��DH ��rhJ v͑UF� �P���; �| ���h �U��z�* 0 Ԏ��6 @I� ��,K�� �R�B� ��ﲒi ��o�H�0 �"�� ���B,� $6QP�@ YŽ�05 �8�s�� �>���:> ���*� Kp1>�~< ����� x�5 05 S?�V�" �m�7 )����z$ �Ye��- �nKPz ~8�쩳 dF �̄5 �ɼ��% v�x�O�# 9�B�/�6 �5��[. ��P%:� ���V�tG'�O ��#bQ�9 �����) �0%�[0 f�(? e( �5 rt �O
�[�۠ �ɴKG� �'$�_s ��g:] Aߞ,�Q

This is the same data after using GDB:

"\374G\371{SSF\301\251*>\177%\225\201\253i\b\344\206\341hgRQ\016\022\001;Tjh\002\000\000\000A\000\000\000\001\000\000\000\"ʐk\323R3\000\061[Z\272yA\001\000_\340\016K\004x\005\000O\226\f\262\a\375\020\000\346\302\341f\230\235'\000\026\275\367\371\215t\020\000%\356\033\372+>,\000\307\016\361\327~\223\033\000삾\351\254+/\000Tb\335Ҷ\363\067\000\342(\357\336\346\326\017\000\327Q1\320\065m&\000\034\216\377\035\005(\a\000\020G#\345bm\017\000\063O\324AN )\000\225DP߇g\a\000\355\060=ʆ\260\060\000\232\377\272j\234u\006\000E\304\063\372\355\v\f\000\223G\353\024#\307\"\000\\317\327!\270o%\000\005L\035.\345\306\002\000\257WMG?B-\000\063\275\342\373\304}&\000\365\017.\367S\264\031\000(\236\033B\354\255\n\000\316\034j&\321\021\n\000\266@\226\314%&,\000\066\065\236\024\271\033\060\000!G\332\v5R\030\000\372\f\353N\225\201\060\000\331\377\035b\244\263\033\000h\361\205\267\207v)\000\322\027\070\246\212w\b\000x<\001\026n}\034\000Җp{\362F\t\000\370\352m\026ŵ8\000(\366\366ۡ:\017\000\364\026\210b=\235\"\000\251\214[)\262\342\022\000 m\316fd\345\060\000[VAl\206\233\006\000\035\354o\021'9#\000\363\032\327\372\357\322\034\000|'\256\306K{\021\000B\266\330v\257\332+\000x\346\023\300\315\306!\000qO\016\005wz*\000T\236*\021\272H;\000\ba\006$\376\214\027\000\213\377\a\330\372\023\003\000\361,m\274\300\321\004\000DË\277\272T \000\357\231Ţ\355\270\067\000;:\263\303\070\246\022\000~\267\374\060\272\261\r\000\264>\374\021\027%7\000\065\034e\216\305:\006\000w|7ώJ\006\000\070\242\210\310\343\206\n\000\365\263\370\377\005X/\000v\335c\200h\035\062\000\063\225\363\244i\362\r\000o^\371\353\302\n\027\000\000\000\000\000\000\000\000\000\t\224A\207\331\374\024\000\254oG\352\364\177\060\000+\254\332\330Ȑ\"\000n\024\310\360\265?\b\000\004\234\231\006\250\064\016\000F#\225>b\020 \000.\271\224m\246=;\000\305\t\017\372X\024\024\000\232\023\005\360\277<\017\000\036\242 c\005(=\000\211\200\212\067Y:\000\273\206\017\n\204\233\027\000\002Q\364\376O\277\033\000w\212\033k\273q!\000\333D\204\241G\331\070\000O\242\306\346l\327\061\000j\224\205DH\033\032\000\270\025\375rhJ\n\000v͑UF\227 \000\230P\272\211\247\005;\000\372\177\017\310\b|\032\000\367\357\255\352h\n\036\000\226U\230\255z\247*\000\060\000Ԏ\241\301\066\000\315\b@\024I\227\032\000\275\322,K\275\304\034\000\332R\375B\023\376\001\000\331\324ﲒi\r\000\256\353o\237H\205\060\000\353\"\026\025\360\320\n\000\203\306\344B,\200\006\000$6QP\325@\017\000YŽ\036\203\060\065\000\366\070\251s\264\252\017\000\360>\251\310\340:>\000\257\310\326\005*\216\036\000Kp1>\232~<\000\225\002\253\302\365\350\021\000x\341\065\000\060\065\021\000S?\331V\311\"\000\000\356m\307\003\030\067\004\000)\212\265\351\331z$\000\336Ye\217\323\033-\000\215nK\026Pz\v\000~8\273쩳\r\000dF\r\261̄5\000\206\aɼ\303\365%\000v\365x\304O\346#\000\071\210B\373/\264\066\000\324\065\216\003\274[.\000\363\343P%:\311\033\000\244\301\370V\367t\006\000G\005'\213O\017\000\252\220#bQ\324\071\000\376\272\377\347\016\237)\000\374\060%\311[0\000\177\rf\357\233(?\000e(\r\2045\r\000rt\000\005\230O*\000\345[\204۠x\b\000\207ɴKG\224\v\000\273'$\261_s\036\000\215\240\fg:]\002\000Aߞ,\303Q\t\000\000\000\000\000\000\000\000"

I'm trying to upload this data (stored in std::stringstream strs) to a PostgreSQL database encoded with UTF8 and into the byte_info column excepting bytea using the libpqxx library:

 pqxx::connection Con("My con information");
 pqxx::work W(Con);
 W.exec("UPDATE " + tableName + " SET byte_info =" + strs.str() + " WHERE id = 1;");
 W.commit();

But all I get is this error:

ERROR: invalid byte sequence for encoding "UTF8": 0xfc

What am I missing here or doing wrong?

S-Man
  • 22,521
  • 7
  • 40
  • 63
TalG
  • 677
  • 1
  • 9
  • 26
  • This looks like *not* serialized binary data. – user7860670 Aug 04 '18 at 16:43
  • Missing quotes? And, why not using prepared statements? – Peter VARGA Aug 04 '18 at 16:43
  • @VTT I'm using the **Simple Encrypted Arithmetic Library (SEAL)** from Microsoft. There is a function there to read a stored `Ciphertext` variable from the memory, so that is what gives me this output. I Asked the Microsoft developing team about the output and they said that it is serialized binary data from the `Ciphertext` variable. – TalG Aug 04 '18 at 16:50
  • From the error message it appears to me that data is supposed to be serialized into utf8 text. It is also somewhat strange to use `stringstream` to store binary data. – user7860670 Aug 04 '18 at 16:51
  • @AlBundy first I want to get it to work and if it does then thought abut using prepared statements. – TalG Aug 04 '18 at 16:52
  • UTF8 is not plain binary but it has its own encoding as well as decoding pattern. That's why your binary is invalid sequence. You have to convert 8-bit ASCII into UTF8 first. Check this out => https://stackoverflow.com/a/4059934/6370128 – sandthorn Aug 04 '18 at 16:55
  • @VTT so a column using `bytea` can't except any binary data and since my database excepts `UTF8` encoding it cant work with those symbols? So I have to encode it as `base64` and then upload it into a `varchar varying` column? – TalG Aug 04 '18 at 16:56
  • I would rather say that query string can not contain binary data so you need to escape it properly. You can not just concatenate normal text with some binary block. – user7860670 Aug 04 '18 at 17:01
  • @sandthorn converting the `std::stringstream strs` into `UTF8` or `base64` should solve my problem when the destination column changes from `bytea` to `varchar varying` right? – TalG Aug 04 '18 at 17:03
  • `In short, binary strings are appropriate for storing data that the programmer thinks of as raw bytes, whereas character strings are appropriate for storing text.` ...this means you should send in raw unsigned char because it stores in raw bytes as well. – sandthorn Aug 04 '18 at 17:11
  • Why would you have to store binary as varchar? I think you can just get away with some backend function to convert those stream into printable (utf-7, base64, quoted-printable) when needed. – sandthorn Aug 04 '18 at 17:20
  • @sandthorn this information holds encrypted data. I need to sotre it for later use, when I decrypt the information at some later point it's still has to be correct. That's why I wanted to upluad the date without tempering with it. But it looks like I'll have to encode it as `base64` first. – TalG Aug 04 '18 at 17:25
  • The wrong approach. You are learning it already the wrong way. – Peter VARGA Aug 04 '18 at 21:49
  • @AlBundy Why is it preferable in this case to use a prepared statement? – TalG Aug 04 '18 at 21:56
  • Because you don't have to 1) escape it and 2) you don't have _stupid_ encoding problems between the C++ `strs.c_str()` and your data. Then it is only a _thing_ between the content and the encoding of the database connection. I am doing a lot with something similar and without prepared statements it is a pain in the .... – Peter VARGA Aug 04 '18 at 21:59

1 Answers1

1

According to manual of PostgreSQL bytea, there are 2 ways for writing a statement containing binary stream.

For a string "\x4A\xC3\xA1\xF2\x18"

  1. hex bytea : E'\\x4AC3A1F218'
  2. escaped bytea : E'J\\303\\241\\362\\030'::bytea -- escape \ as \\\\, escape ' as \' and escape non-printable as \\three-digit-octal

So you can come up with functions like these.

std::string ascii_to_hex_bytea(std::string_view sv) {
  std::ostringstream os;
  os << R"(E'\\x)" << std::hex << std::uppercase;
  for (unsigned char ch : sv) {
    os << std::setfill('0') << std::setw(2) << static_cast<uint16_t>(ch);
  }
  os << "'";
  return os.str();

}

std::string ascii_to_escaped_bytea(std::string_view sv) {
  std::ostringstream os;
  os << "E'" << std::oct;
  for (unsigned char ch : sv) {
    if (isprint(ch))
      switch (ch) {
        case('\\') : os << R"(\\\\)"; break; // escape back slash
        case('\'') : os << R"(\')"; break;   // escape single quote
        default    : os << ch;               // simply put printable char
      }
    else // escape the rest as an octal with back slash leading
      os << R"(\\)" << std::setfill('0') << std::setw(3) << static_cast<uint16_t>(ch);
  }
  os << "'::bytea";
  return os.str();

}

Suppose you have ss as stringstream with some data (for demo, we just ramdom it here)

  std::stringstream ss;
  { // random bits for length of 1024
    std::string str(1024,'\0');
    for (char* addr = str.data(); addr < str.data() + str.size(); ++addr )
      *addr = static_cast<char>(std::experimental::randint<uint16_t>(0,255) );
    ss.str(str);
  }

You can write statement using those functions

  auto hex_str = ascii_to_hex_bytea(ss.str() );
  std::cout << hex_str << "\n";

  std::string tableName{"table_name"};
  std::string statement1 = "UPDATE " + tableName + " SET byte_info = " + hex_str + " WHERE id = 1;";
  std::cout << statement1 << "\n\n";

  auto escaped_str = ascii_to_escaped_bytea(ss.str() );
  std::cout << escaped_str << "\n";

  std::string statement2 = "UPDATE " + tableName + " SET byte_info = " + escaped_str + " WHERE id = 1;";
  std::cout << statement2 << "\n";

Print

E'\\x4AC3A1F218E1ED92AB0B3966C3E99CC5BD8419B4A91D504F85AE7621525F305A...'
UPDATE table_name SET byte_info = E'\\x4AC3A1F218E1ED92AB0B3966C3E99C...' WHERE id = 1;

E'J\\303\\241\\362\\030\\341\\355\\222\\253\\0139f\\303\\351\\234\\30...'::bytea
UPDATE table_name SET byte_info = E'J\\303\\241\\362\\030\\341\\355\\...'::bytea WHERE id = 1;

godbolt.org/g/8Ctgcu

wandbox.org/permlink/eaaAWz7pCbGTLcbC

sandthorn
  • 2,770
  • 1
  • 15
  • 59
  • Thanks for the extensive help. I've used it on my stringstream and it worked, the binary data is in the database. I've noticed that some of the characters like **Ȑ** or **쩳** are not in the escaped string, were they removed in the `ascii_to_escaped_bytea()`? If I want to read this `bytea` data from the database and work on it do I have to use `SELECT encode(byte_info, 'escape') FROM table_name;` or can I just use `SELECT byte_info FROM table_name;`? – TalG Aug 05 '18 at 08:28
  • 1
    I've changed the `ascii_to_escaped_bytea()` to only escape the string and I use `"UPDATE " + tableName + " SET byte_info = "encode(E'" + escaped_str + "'::bytea, 'escape') WHERE id = 1;` if the destination column uses the **`varchar varying`** type and I use `"UPDATE " + tableName + " SET byte_info = "decode(E'" + escaped_str + "', 'escape') WHERE id = 1;` if the destination column uses the **`bytea`** type. – TalG Aug 05 '18 at 10:26
  • The PstgreSQL document for `encode()` and `decode()` https://www.postgresql.org/docs/current/static/functions-binarystring.html functions says: **Encode binary data into a textual representation. Supported formats are: base64, hex, escape. escape converts zero bytes and high-bit-set bytes to octal sequences (\nnn) and doubles backslashes.**. If I use the **`'escape'`** format, does it mean that this line `case('\\') : os << R"(\\\\)"; break;` in the `ascii_to_escaped_bytea()` function is not needed anymore? – TalG Aug 05 '18 at 11:12
  • I guess you still have to write parameter `data` of `encode(data bytea, format text)` as either `hex bytea` or `escaped bytea` anyway. So you yet have to do your own conversion anyway. I guess `encode` is used when you want to retrieve your stream from table back to the beckend as plain text. I'm not sure you can retrieve it as raw stream like `SELECT byte_info...`, you may try it yourself. – sandthorn Aug 05 '18 at 11:42
  • @TalG Another issue is that if you want to put binary stream in `var char`, I guess you have to `encode` instead of `decode`, because `encode` yields text and you need text on `var char`. Even better you can just modify the function so that the delimiter at both ends disappear: begin delim `E'`, end delim `'` or `'::bytea`, and just return the text inside single quotes that is your `var char`. – sandthorn Aug 05 '18 at 11:42
  • Thanks, I had a typo in one of my comments. I use `"UPDATE " + tableName + " SET varchar_column = "encode(E'" + escaped_str + "'::bytea, 'escape') WHERE id = 1;` when saving the `binary` into a `varchar varying` column. I've also noticed that some of the characters like `Ȑ` or `쩳` are not in the escaped string after using the `ascii_to_escaped_bytea()` function. Are the characters represented differently after using the function? – TalG Aug 05 '18 at 12:00
  • So I guess, I can just get the `varchar varying` representation of the binary from the database and work on it locally without any further conversion to something else (I'll be checking it tomorrow) Or do I have to write some function to convert it back since the `ascii_to_escaped_bytea()` uses `octal` encoding?. It is **very important** that the uploaded information in the database is equal to the information in the `string var` before using the `ascii_to_escaped_bytea()` data. – TalG Aug 05 '18 at 12:09
  • I tried to decrypt the data from the database and It's not working directly. The decryptor complains that the input is not valid. – TalG Aug 05 '18 at 14:14
  • `Ȑ` and `쩳` must be due to your codepage that override some asciis and replace with those. You can find out their positions by `cout << (unsigned int) uchar << " : " << uchar << "\n"` loop through unsigned char of range [0,255]. And add them to switch case. – sandthorn Aug 05 '18 at 14:59
  • The escaped bytea use octal in `\\000` character representation. I guess it should be a lot easier if you use hex bytea instead. It's a lot shorter and less complicate in both encode and decode. Moreover, you don't have to find out what are those problematic Ȑ and 쩳 at all. – sandthorn Aug 05 '18 at 15:17
  • The decryptor complains perhaps are due to unproperly escaped Ȑ and 쩳. – sandthorn Aug 05 '18 at 15:19
  • If you want to store a stream of bytes like `std::string{"\x4A\xC3\xA1\xF2\x18"}` as varchar, you have to store it as either `\\x4AC3A1F218` (hex bytea) or `J\\303\\241\\362\\030` (escaped bytea). You may notice that hex bytea consumes less storage than escaped bytea. – sandthorn Aug 05 '18 at 15:32
  • I'll check it later with the `hex` encoding. I think I need some kind another function `hex_to_ascii()` when reading the information back from the database for decryption, right? So the downloaded data from the database resembles the data before encoding and uploading it to database, the decryptor needs to use the same Input. – TalG Aug 05 '18 at 15:45
  • Yes, if you store as varchar, you have to convert it back after retrieve. If you store byte stream in a bytea field, I guess you can retrieve it back just using `.c_str()` like [`row["name"].c_str()`](http://pqxx.org/development/libpqxx/). I'm not sure you can try it. – sandthorn Aug 05 '18 at 15:56
  • But If I've used the function `ascii_to_hex_bytea()' for storing it in a `bytea` column, then I'll have to pass the results through the `bytea_hex_to_ascii()` anyways, or am I missing here something? The difference is between `'\\x4AC3A1F218'` for `varchar` column and `E'\\x4AC3A1F218'` for `bytea` column. But both of them were created with the `ascii_to_hex_bytea()` function. – TalG Aug 05 '18 at 16:12
  • I'm sorry I was wrong that `.c_str()` could retrieve raw bytes. It apparently can not because it expects *null-terminated* string which is can not be raw bytes that might contain some '\0' bytes. – sandthorn Aug 06 '18 at 03:39
  • Currently `bytea_hex_to_ascii()` above return `E'\\x4AC3A1F218'`, not `\\x4AC3A1F218`, even though it's quite easy to modify for doing that. – sandthorn Aug 06 '18 at 03:44