3

I've noticed gnu-binutils-strings can printout utf-16 content in a file - is it possible for the program to print out utf-8 strings? if so, which arguments are appropriate? i'm working in a python environment using subprocess and would like to work with the output from gnu-binutils-strings that a subprocess.Popen call would generate through a pipe.

ct_
  • 1,189
  • 4
  • 20
  • 34

1 Answers1

4

I'm not experienced with strings, but the version I have (2.21.51.20110605) has an 8-bit encoding option (-eS) that would work with utf-8 text. It must have to cast a wide net looking for 'text' delimited by non-printable characters (value < 32). I'd expect a lot of noise. A test on a random executable showed the -eS (8-bit) result was 5 times bigger than -es (7-bit).

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • thanks for the suggestion. this worked like a champ. i went back and double checked the man page and didn't notice '-eS' flag was an option. to test it out, i download the front page of CNN for a couple utf-8 languages and ran 'strings -eS' over the output file. worked like a champ. the ASCII html content was there as were the utf-8 encodings for the language specific news stories. also note, strings tosses out control characters. i had some concern the "printable character" aspect of the "-e" flag was going to break things. – ct_ Oct 25 '11 at 19:07
  • While it may accidentally work, it does not appear to be intentional. I have a core dump I'm searching and `strings -eS` is failing to find some UTF-8 text that I can clearly see within emacs. – hackerb9 Aug 23 '19 at 10:39
  • @hackerb9, UTF-8 is either ASCII or a sequence of 2-4 byte printable byte values. According to the docs, `strings` defaults to looking for a sequence 4 printable characters, i.e. 4 non-control characters. Are your strings shorter than this? The documentation also says that it may default to searching only in the data section of a binary unless the `-a` option is used to scan all of the file. – Eryk Sun Aug 23 '19 at 18:36
  • That's a good point. It's true that if all it is doing is removing bytes ≤ 0x20, then it ought to preserve the UTF-8 text I'm interested in. I'm not sure why the text didn't appear, perhaps a bug, perhaps I missed the needle in that huge haystack of binary data dumped by `strings -eS`. Anyhow, I've written my own version of `strings` that uses the _self-synchronization_ property of UTF-8 to find valid characters without having to worry about where the sequence starts. It works great for me. – hackerb9 Aug 24 '19 at 09:54