0

I have a file containing this header FIELD1 FIELD2 : 0x30070040 and a lot of junk characters (half the file's size). To get rid of all of them I execute these commands:

dos2unix -q -n file
sed -i $'s/[^[:print:]\t]//g' file #Removing every non-printable character (yes, dos2unix was not enough)

But then I end up having a file containing this odd header. If I copy and paste it from shell it looks like this:

PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar  nfsnobodynfsnobody▒▒FIELD1   FIELD2 : 0x30070040

If I copy and paste from a text editor like VIM it looks like this:

PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar  nfsnobodynfsnobodyÿþFIELD1   FIELD2 : 0x30070040

Note the two special characters just before FIELD1.

Now I would like to end up with an header like this:

FIELD1   FIELD2

It is important to keep everything that is between FIELD1 and FIELD2 too because that is the fields separator of the file. I thought about using this:

sed -i -r '1 s/.+(FIELD1.+) : 0x.+/\1/g' file

But apparently .+FIELD1 does not match with PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar nfsnobodynfsnobody▒▒FIELD1 or PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar nfsnobodynfsnobodyÿþFIELD1 (whichever it is the true one), so I can't extract \1 from the regex.

Shouldn't . match every character? Why it does not match with whatever come before FIELD1?

jackscorrow
  • 682
  • 1
  • 9
  • 27
  • 1
    Can you add sample input file and expected output in question – anubhava Jul 30 '19 at 14:16
  • `ustar`? Is it possible that the file you're dealing with is a tar archive? – melpomene Jul 30 '19 at 14:21
  • @anubhava I've added the initial header of the file. it looks like a normal header, but if I execute the two commands I've listed the file's size halves – jackscorrow Jul 30 '19 at 14:23
  • @melpomene nope, it's a regular text file – jackscorrow Jul 30 '19 at 14:24
  • show us the characters before FIELD1 in the original file – jhnc Jul 30 '19 at 14:30
  • 2
    `ÿþ` is `FF` `FE` in Latin-1, possibly a Unicode BOM being misinterpreted by vim having guessed the wrong character set. Obviously that's where the original file contents start, and this *is* a tar archive, though it may be a corrupted one. –  Jul 30 '19 at 14:38
  • @WumpusQ.Wumbley But how can a tar file (even if corrupted) containing those many hidden characters? Shouldn't tar be a compressed file type? – jackscorrow Jul 30 '19 at 14:52
  • @WumpusQ.Wumbley you're right, by the way. I removed the Unicode BOM sequence before executing sed -i $'s/[^[:print:]\t]//g' and I end up with a correct header with no strange characters. If you want to post your solution as answer I will accept it – jackscorrow Jul 30 '19 at 14:59

0 Answers0