Extended regex "." seems not to be matching everything

Question

I have a file containing this header FIELD1 FIELD2 : 0x30070040 and a lot of junk characters (half the file's size). To get rid of all of them I execute these commands:

dos2unix -q -n file
sed -i $'s/[^[:print:]\t]//g' file #Removing every non-printable character (yes, dos2unix was not enough)

But then I end up having a file containing this odd header. If I copy and paste it from shell it looks like this:

PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar  nfsnobodynfsnobody▒▒FIELD1   FIELD2 : 0x30070040

If I copy and paste from a text editor like VIM it looks like this:

PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar  nfsnobodynfsnobodyÿþFIELD1   FIELD2 : 0x30070040

Note the two special characters just before FIELD1.

Now I would like to end up with an header like this:

FIELD1   FIELD2

It is important to keep everything that is between FIELD1 and FIELD2 too because that is the fields separator of the file. I thought about using this:

sed -i -r '1 s/.+(FIELD1.+) : 0x.+/\1/g' file

But apparently .+FIELD1 does not match with PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar nfsnobodynfsnobody▒▒FIELD1 or PFcount_01032019.txt0000777017777601777760116201541013436157760015052 0ustar nfsnobodynfsnobodyÿþFIELD1 (whichever it is the true one), so I can't extract \1 from the regex.

Shouldn't . match every character? Why it does not match with whatever come before FIELD1?

Can you add sample input file and expected output in question — anubhava, Jul 30 '19 at 14:16
`ustar`? Is it possible that the file you're dealing with is a tar archive? — melpomene, Jul 30 '19 at 14:21
@anubhava I've added the initial header of the file. it looks like a normal header, but if I execute the two commands I've listed the file's size halves — jackscorrow, Jul 30 '19 at 14:23
`ÿþ` is `FF` `FE` in Latin-1, possibly a Unicode BOM being misinterpreted by vim having guessed the wrong character set. Obviously that's where the original file contents start, and this *is* a tar archive, though it may be a corrupted one. — , Jul 30 '19 at 14:38
@WumpusQ.Wumbley But how can a tar file (even if corrupted) containing those many hidden characters? Shouldn't tar be a compressed file type? — jackscorrow, Jul 30 '19 at 14:52
@WumpusQ.Wumbley you're right, by the way. I removed the Unicode BOM sequence before executing sed -i $'s/[^[:print:]\t]//g' and I end up with a correct header with no strange characters. If you want to post your solution as answer I will accept it — jackscorrow, Jul 30 '19 at 14:59

Extended regex "." seems not to be matching everything

0 Answers0