0

I have a bunch of text files that need cleaning up. Example

    `E..4B?@.@...
..9J5.....P0.z.n9.9.. ........
 .k#a..5
E...y^@.r...J5..

E...y_@.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..@.yr`

Is there any way sed can do this? Like notice weird patterns?

1 Answers1

4

For this answer, I will assume that you have access to standard unix/linux tools.

Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:

$ file mysteryfile 
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....

If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:

$ strings mysteryfile
Some
Recovered Text
...

The behavior of strings can be fine tuned with several options. See man strings.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • 2
    +1 for suggesting `strings`; that is quite suitable if you can't work out the source format of the document, or you don't have the tools needed to manipulate the document. It isn't perfect; it wouldn't handle UTF-16 encoded data, I think. But it's a good first step. – Jonathan Leffler Aug 06 '14 at 23:23
  • @JonathanLeffler Thanks for that! I looked into the character set issue a little further. From its manpage, `strings` claims that it supports UTF-16 if the `-el` (or -eb?) option is given (I did not test that). Also, according to this post http://stackoverflow.com/questions/7863986/gnu-binutils-strings-utf-8-instead-of-utf-16-or-ascii , `strings` works on UTF-8 if it is given the `-eS` option. I tested the UTF-8 option _lightly_ and it seemed to work. – John1024 Aug 06 '14 at 23:50
  • 2
    Interesting; I'd not looked at it in any detail in…oh, this millennium…and things change, of course. – Jonathan Leffler Aug 07 '14 at 01:09
  • Sadly, `strings` still doesn't support UTF-8. I ran across this question while finding that out. Ultimately, I just wrote my own UTF-8 strings. https://github.com/hackerb9/utf8strings – hackerb9 Aug 24 '19 at 11:32