Problem
I have been given a pipe-delimited text file that contains filenames and some indexed information from each file. My goal is to make this a tab delimited file. However, I want to know where the empty entries are. This will be done, e.g. with lorem||dolor
becoming lorem
'\t'
<empty>
'\t'
dolor
.
Let me give another couple of examples for what I've been given and what is desired:
Example with multiple lines: (N.B. There are the same number of entries on each line.)
Given:
||dolor|sit
amet,||adipiscing|
sed|do|eiusmod|tempor
Desired:
<empty> '\t' <empty> '\t' dolor '\t' sit '\n'
amet, '\t' <empty> '\t' adipiscing '\t' <empty> '\n'
sed '\t' do '\t' eiusmod '\t' tempor '\n'
Empty entries at the beginning and end.
Given:
|ut|labore||dolore||
Desired:
<empty> '\t' ut '\t' labore '/t' <empty> '\t' dolore '\t' <empty> '\t' <empty>
(I don't want the spaces; I just thought it would make the desired format more easy to read.)
The problem comes with consecutive empty entries. The files I've been given can have from 1 to 36 consecutive pipes (0 to 37 consecutive empty entries.)
Clarification
The solution doesn't have to be sed
, awk
, grep
, tr
, etc. Those are just the solutions I've looked at. A perl
or python
script (or any other idea I haven't thought of) would be welcome as well.
My attempts and research
For the attempts I made before and during my research, the commands and their output are included as an image1 and a text file2 so as to not over-clutter the question.
Links to things I looked up -- Finding consecutive pipes with sed
(and replacing any such series of pipes) : ref. here ; Counting the number of empty fields (possibly useful in knowing how many <empty>
's are needed) : ref. here ; Longest sequence : ref here ;
System information
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
$ bash --version
GNU bash, version 4.3.42(4)-release (x86_64-unknown-cygwin) ...
$
I'm running this version of Cygwin on Windows 10 (because the job requires it.)
Edit1
I was unclear on what exactly was desired.
Here's a short example showing what I would like with pipes at the beginning and end:
(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the >
only show up after you hit enter on the previous line.)
$ cat > myfile.txt<<EOF
> ||foo|||bar||
> EOF
$ <**command-to-be-used**> myfile.txt | cat -A
<empty>^I<empty>^Ifoo^I<empty>^I<empty>^Ibar^I<empty>^I<empty>$
Where the ^I
is how my version of bash
shows a '\t'
. From the answers given using some example text I gave, I realized that I would like an <empty>
at the end, after labore
(see the command below). Note that the answers received (thanks @Neil_McGuigan and @Ed_Morton) DO give a '\t'
after labore
, just not an <empty>
. This is my fault, as I was not clear enough in my original description. My apologies.
I was able to accomplish my goal with a little tweaking of @Neil_McGuigan's command. Note that, if you want to type this "line-by-line" as shown, you'll need to include a space and a \
at the end of each line.
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" |
awk '
{
$1=$1; n_empty=0;
for(i=1; i<=NF; i++)
{
if($i=="") {$i="<empty>"; n_empty++;}
};
print
}
END {print n_empty" entries are empty" | "cat 1>&2";}
' FS='|' OFS=$'\t'
| cat -A
gives the result:
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty
Once again, for those who don't want to scroll, this output is as follows:
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty
(Note that the count of empty entries being written to stderr
was not necessary, but it is nice.)
Sorry for not being clear about what I wanted.
What I Used Successfully
Thanks to @Neil_McGuigan and @Ed_Morton, I was able to get the solution for which I was searching. My final command was as follows:
$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt
$
Just in case you don't want to scroll, here is the same command:
$ awk '{$1=$1; for(i=1; i<NF; i++){ if($(i)=="")$(i)="<empty>" }; print}'
FS='|' OFS=$'\t' file_pipe-delim.txt | sed 's/\t$/\t<empty>/g' >
file_tab-delim.txt
$
Here's an example where the file is made, converted, and saved:
(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the >
only show up after you hit enter on the previous line.)
$ cat > file_pipe-delim.txt<<EOF
> ||dolor|sit
> amet,||adipiscing|
> sed|do|eiusmod|tempor
> |||
> |aliqua.|Ut|
> EOF
$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++)
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END
{print n_empty" entries are empty" | "cat 1>&2";}'
FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt
$ cat -A file_tab-delim.txt
<empty>^I<empty>^Idolor^Isit$
amet,^I<empty>^Iadipiscing^I<empty>$
sed^Ido^Ieiusmod^Itempor$
<empty>^I<empty>^I<empty>^I<empty>$
<empty>^Ialiqua.^IUt^I<empty>$
$
Finally, let's return the string that gave me trouble. We can get the desired output as follows:
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' | cat -A
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty
Now, the same command without the pipe to cat -A
, meaning that we won't see the ^I
for each '\t'
; we will just see the text as it is "tabbed."
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | \
awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) \
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END \
{print n_empty" entries are empty" | "cat 1>&2";}' \
FS='|' OFS=$'\t'
<empty> <empty> lorem ipsum <empty> sit amet, <empty> <empty> <empty>eiusmod tempor <empty> <empty> labore <empty>
9 entries are empty