Pipe-delimited file with empty entries; convert to tab-delimited with '' between

Question

Problem

I have been given a pipe-delimited text file that contains filenames and some indexed information from each file. My goal is to make this a tab delimited file. However, I want to know where the empty entries are. This will be done, e.g. with lorem||dolor becoming lorem '\t' <empty> '\t' dolor.

Let me give another couple of examples for what I've been given and what is desired:

Example with multiple lines: (N.B. There are the same number of entries on each line.)

Given:

||dolor|sit
amet,||adipiscing|
sed|do|eiusmod|tempor

Desired:

<empty> '\t' <empty> '\t' dolor '\t' sit '\n'
amet, '\t' <empty> '\t' adipiscing '\t' <empty> '\n'
sed '\t' do '\t' eiusmod '\t' tempor '\n'

Empty entries at the beginning and end.

Given:

|ut|labore||dolore||

Desired:

<empty> '\t' ut '\t' labore '/t' <empty> '\t' dolore '\t' <empty> '\t' <empty>

(I don't want the spaces; I just thought it would make the desired format more easy to read.)

The problem comes with consecutive empty entries. The files I've been given can have from 1 to 36 consecutive pipes (0 to 37 consecutive empty entries.)

Clarification

The solution doesn't have to be sed, awk, grep, tr, etc. Those are just the solutions I've looked at. A perl or python script (or any other idea I haven't thought of) would be welcome as well.

My attempts and research

For the attempts I made before and during my research, the commands and their output are included as an image¹ and a text file² so as to not over-clutter the question.

My Attempts image

My Attempts text

Links to things I looked up -- Finding consecutive pipes with sed (and replacing any such series of pipes) : ref. here ; Counting the number of empty fields (possibly useful in knowing how many <empty>'s are needed) : ref. here ; Longest sequence : ref here ;

System information

$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
$ bash --version
GNU bash, version 4.3.42(4)-release (x86_64-unknown-cygwin) ...
$

I'm running this version of Cygwin on Windows 10 (because the job requires it.)

Edit1

I was unclear on what exactly was desired.

Here's a short example showing what I would like with pipes at the beginning and end:

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.)

$ cat > myfile.txt<<EOF
> ||foo|||bar||
> EOF

$ <**command-to-be-used**> myfile.txt | cat -A
<empty>^I<empty>^Ifoo^I<empty>^I<empty>^Ibar^I<empty>^I<empty>$

Where the ^I is how my version of bash shows a '\t'. From the answers given using some example text I gave, I realized that I would like an <empty> at the end, after labore (see the command below). Note that the answers received (thanks @Neil_McGuigan and @Ed_Morton) DO give a '\t' after labore, just not an <empty>. This is my fault, as I was not clear enough in my original description. My apologies.

I was able to accomplish my goal with a little tweaking of @Neil_McGuigan's command. Note that, if you want to type this "line-by-line" as shown, you'll need to include a space and a \ at the end of each line.

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | 
  awk '
       {
         $1=$1; n_empty=0; 
         for(i=1; i<=NF; i++) 
         { 
           if($i=="") {$i="<empty>"; n_empty++;}
         }; 
         print
       }
       END {print n_empty" entries are empty" | "cat 1>&2";}
      ' FS='|' OFS=$'\t'
   | cat -A

gives the result:

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Once again, for those who don't want to scroll, this output is as follows:

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$ 9 entries are empty

(Note that the count of empty entries being written to stderr was not necessary, but it is nice.)

Sorry for not being clear about what I wanted.

What I Used Successfully

Thanks to @Neil_McGuigan and @Ed_Morton, I was able to get the solution for which I was searching. My final command was as follows:

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt

$

Just in case you don't want to scroll, here is the same command:

$ awk '{$1=$1; for(i=1; i<NF; i++){ if($(i)=="")$(i)="<empty>" }; print}'
  FS='|' OFS=$'\t' file_pipe-delim.txt | sed 's/\t$/\t<empty>/g' > 
  file_tab-delim.txt

$

Here's an example where the file is made, converted, and saved:

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.)

$ cat > file_pipe-delim.txt<<EOF
> ||dolor|sit
> amet,||adipiscing|
> sed|do|eiusmod|tempor
> |||
> |aliqua.|Ut|
> EOF

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) 
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END 
{print n_empty" entries are empty" | "cat 1>&2";}' 
FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt


$ cat -A file_tab-delim.txt
<empty>^I<empty>^Idolor^Isit$
amet,^I<empty>^Iadipiscing^I<empty>$
sed^Ido^Ieiusmod^Itempor$
<empty>^I<empty>^I<empty>^I<empty>$
<empty>^Ialiqua.^IUt^I<empty>$

$

Finally, let's return the string that gave me trouble. We can get the desired output as follows:

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' | cat -A
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Now, the same command without the pipe to cat -A, meaning that we won't see the ^I for each '\t'; we will just see the text as it is "tabbed."

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | \ 
awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) \
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END \
{print n_empty" entries are empty" | "cat 1>&2";}' \
FS='|' OFS=$'\t'

<empty> <empty> lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty>eiusmod  tempor  <empty> <empty> labore  <empty>
9 entries are empty

The trouble with fixing both empty fields in `a|||b` is that `s/||/||/g' or something similar is that the first match uses both the first two pipes, so when the scan continues, the third pipe is not paired. You can overcome that by repeating the original match: `sed -e 's/||/||/g' -e 's/||/||/g'` . However, when you're changing the delimiters too, you have to work a bit harder, but that's why there's a problem. — Jonathan Leffler, Aug 10 '16 at 18:14
Yes, I thought about that problem, which is why I hadn't tried that route.. It seems that @Ed_Morton has that figured out. — bballdave025, Aug 10 '16 at 21:06

Neil McGuigan · Accepted Answer · 2016-08-10T18:07:36.217

2

awk '
     {
       $1=$1; 
       for(i=1; i<NF; i++) { 
         if($i=="") { $i="<empty>"; empty++ }
       }; 
       print
     }
     END { print empty" empty" | "cat 1>&2"; }
' FS='|' OFS=$'\t'

Should do the trick. $1=$1 tells awk to "rebuild" the input fields so they can be used with the new OutputFieldSeparator (OFS).

print empty" empty" | "cat 1>&2" prints "n empty" to stderr. You can omit it if you like

edited Aug 10 '16 at 18:07

answered Aug 10 '16 at 17:51

Neil McGuigan

46,580
12
123
152

Thanks! It worked like a charm. It also resolved the comma issue. I appreciate the explanation of what you added as well. I can't up-vote the answer yet (not enough reputation points), but I gave it the checkmark. If there's a +1 I can give it for using `awk`, I'd love to do so. $ `echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; for(i=1; i" }; print}' FS='|' OFS=$'\t' | cat -A` `^I^Ilorem^Iipsum^I^Isit^Iamet,^I^I^I^Ieiusmod^Itempor^I^I^Ilabore^I$` – bballdave025 Aug 10 '16 at 18:00
I just realized that there is something I didn't clarify. This answer is affected by it. I'm not sure if I should edit my question, or just comment on relevant posts. The basic problem is this: I would like an `` at the end, after `labore^I`. I apologize for my lack of clarity; I've updated my question. This actually relates to a situation that I might run into with the data I have; the data is produced on a Windows machine. which means that there is not necessarily a line-feed (`'\n'`) character or any other character at the end of the file. See Edit1. – bballdave025 Aug 11 '16 at 18:21

score 1 · Answer 2 · answered Aug 10 '16 at 20:19

1

You only need to do the || -> |<empty>| substitution twice no matter how many times that pattern appears as long as you do it globally each time:

$ sed 's/||/|<empty>|/g; s/||/|<empty>|/g; s/|/\t/g' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

or if you prefer awk:

$ awk '{while(gsub(/\|\|/,"|<empty>|")); gsub(/\|/,"\t")} 1' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

With some seds you might need '$'\t'' instead of just \t.

answered Aug 10 '16 at 20:19

Ed Morton

188,023
17
78
185

1

I like this approach. It helps to have these ideas with commands included in a standard UNIX-type install, like `sed` and `awk`. You also answered a question I was asking myself about running multiple substitutions on the `||`. Thanks – bballdave025 Aug 10 '16 at 21:16
I can't make head nor tail of Edit1. There's multiple commands and inputs with a whole lot of ambiguous text, we've gone from `` to `E` etc. in the examples and I can't tell desired output from actual output, etc. Take a second to come up with 1 sample input file that demonstrates the problem you are having along with the actual output you are getting and the desired output you want to get instead and then edit your question so show us that 1 clear, concise example of the issue. You've been using `` and `\t` until now so just stick with that. – Ed Morton Aug 10 '16 at 22:44
Thanks for letting me know. I'll try to clean it up. I think I've been looking at this for too long. – bballdave025 Aug 10 '16 at 22:48
Just To clarify here, the concern I have is (I think) more of a difference with my installation of cygwin. I don't see a problem with your answer. My input/output are as follows: `$ sed 's/||/||/g; s/||/||/g; s/|/\t/g' outfile.txt lorem ipsum sit amet, eiusmod tempor labore` – bballdave025 Aug 10 '16 at 22:51
Sorry, being new to SO, I'm not sure how to separate the input and output. The input ends at `outfile.txt`. – bballdave025 Aug 10 '16 at 22:57
`$ sed 's/||/||/g; s/||/||/g; s/|/\t/g' outfile.txt | cat -A` `^I^Ilorem^Iipsum^I^Isit^Iamet,^I^I^I^Ieiusmod^Itempor^I^I^Ilabore^I$` – bballdave025 Aug 10 '16 at 22:58
`^I^Ilorem^Iipsum^I^Isit^Iamet,^I^I^I^Ieiusmo‌d^Itempor^I^I^Ilabore^I$` is the desired output. – bballdave025 Aug 10 '16 at 23:15
Edit your question to include the relevant info, don't put it in a comment where you can't format it properly. wrt your new edit to your question - get rid of all the ticks you added around every field under `I would like the following output:` unless you really want them. – Ed Morton Aug 11 '16 at 02:51
Why on earth would you name your **input** file `outfile.txt`??? Anyway, I THINK the problem is (ignoring the `|` -> `\t` part) you told us you wanted `||` to become `||` but you didn't tell us that you wanted `|$` to become `|$`. Is that right? And is it also correct that you don't want `^|` to become `^|`? So a line that just contains `||` should become `\t\t` - right? Please edit your question to get rid of all the unnecessary text and replace the whole thing with just one clear simple question with one example that captures all of your requirements. – Ed Morton Aug 11 '16 at 03:21
1

Thanks for the advice and for your patience with a first-time SO poster. I appreciate knowing about the format in which things are usually posted here. As for the `outfile.txt`, that was a dumb mistake on my part - using something from previous notes on how to use `cat` to create files. You answered the question I asked perfectly, for which I am grateful. I also appreciate the help you're giving so that I can post things more clearly. – bballdave025 Aug 11 '16 at 14:56
1

I was definitely unclear about the `|$` becoming `|`. That is the desired behavior. I'm grateful to you for pointing that out. I DO want `^|` to become `^ '\t'` (I incorrectly stated my concern in the post at 2016-8-10 22:37:18Z.) – bballdave025 Aug 11 '16 at 14:58