Measure field/column width in fixed width output - Finding delimiters?

Question

In the context of the bash shell and command output:

Is there a process/approach to help determine/measure the width of fields that appear to be fixed width? (apart from the mark one human eyeball and counting on the screen method....)
If the output appears to be fixed width, is it possible/likely that it's actually delimited by some sort of non-printing character(s)?
If so, how would I go about hunting down said character?

I'm mostly after a way to do this in bash shell/script, but I'm not averse to a programming language approach.

Sample Worst Case Data:

Name                   value 1    empty_col    simpleHeader  complex multi-header
foo                    bar                     -someVal1     1someOtherVal       
monty python           circus                  -someVal2     2someOtherVal       
exactly the field_widthNextVal                 -someVal3     3someOtherVal

My current approach: The best I have come up with is redirecting the output to a file, then using a ruler/index type of feature in the editor to manually work out field widths. I'm hoping there is a smarter/faster way...

What I'm thinking:

With Headers:
Perhaps an approach that measures from the first character 'to the next character that is encountered, after having already encountered multiple spaces'?
Without Headers:
Drawing a bit of a blank on this one....?

This strikes me as the kind of problem that was cracked about 40 years ago though, so I'm guessing there are better solutions than mine to this stuff...

Some Helpful Information:

Column Widths

fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')

This is proving to be helpful for determining column widths. I don't fully understand how it works yet to provide a complete answer, but it might be helpful to a future someone else. Source: https://unix.stackexchange.com/questions/465170/parse-output-with-dynamic-col-widths-and-empty-fields

File Examination

Redirect output to a file: command > file.data

Use hexdump or xxd against file.data to look at it's raw information. See links for some basics on those tools:

hexdump output vs xxd output

https://nwsmith.blogspot.com/2012/07/hexdump-and-xxd-output-compared.html?m=1

hexdump

https://man7.org/linux/man-pages/man1/hexdump.1.html

https://linoxide.com/linux-how-to/linux-hexdump-command-examples/

https://www.geeksforgeeks.org/hexdump-command-in-linux-with-examples/

xxd

https://linux.die.net/man/1/xxd

https://www.howtoforge.com/linux-xxd-command/

Easy way to answer 2, 3 `hexdump -Cv filename` and dump the file to look at what each character is. For 1, `awk '{ for (i=1; i<=NF; i++) printf "%d\t",length($i) } {print ""}' filename` to output the length of each field in tabular columns — David C. Rankin, Aug 03 '20 at 23:29
Nevermind, I had the last single quote outside the file name rather than before it. The awk command is working nicely now, that's a really helpful start, thanks! I notice that it is splitting fields where the field value contains multiple space separated words, any idea if it could be modified to treat those as one entity, perhaps by looking for more than a single space? — Chris, Aug 03 '20 at 23:40
`hexdump -Cv` is exactly what I was after for that purpose, thanks! — Chris, Aug 03 '20 at 23:45
***"any idea if it could be modified to treat those as one entity, perhaps by looking for more than a single space?"*** While overall you've done a very good job of posing your problem, you forgot to include a small (less that 80 chars wide X 4-5 lines) of your data. If our answer works on a small/reduced sample of your data, it will almost certainly work on larger sets that are formatted following the same "rules". right? The default FS (Field Sep) for awk is 1 or more white space chars (sp/tab) which is what DCR's example code is giving you. `hexdump` is good too! Good luck. — shellter, Aug 03 '20 at 23:49
***"hexdump -Cv is exactly what I was after"*** . If you found an interesting char being used in your data for a fields seperator, please share ;-) . Good luck! — shellter, Aug 03 '20 at 23:52
Well... If I knew what the input was -- I could make a better guess at how to split the fields with `awk`, right now they are splitting on spaces. If another character makes sense, then use `awk -F'that_char" ...` to split on `that_char`. — David C. Rankin, Aug 03 '20 at 23:56
Nothing interesting found as a delimiting char in this case, but a perfect tool for addressing the problem! I'll put together a sample of data and update the question, thanks shelter. And yep, understood David, as mentioned I'll put up a sample the command is helpful even in it's current state, you can see records that will be split into more than fields than anticipated, so it identifies that there will be an issue, and where it will be :-) — Chris, Aug 03 '20 at 23:57
After perusing the inputs I have at the moment, I think the sample data I have added reflects the worst case scenarios in terms of complexity. — Chris, Aug 04 '20 at 00:25
Good show on adding sample data! yes, `complex multi(line?)-header` and `exactly the field_widthNextVal` are the enemies of easly normalization of data. Of course the best solution is to get control of exporting the data you need and use `|` or some other visible but not-in-the-data delimiter. Failling that you need to write a two-pass process, one that captures the Number of Fields in each record and then the size of each field. In any case your looking for the "consensus" of what are the field sizes that you scanned. Good luck! — shellter, Aug 04 '20 at 00:58

Chris · Answer 1 · 2020-08-05T00:14:36.520

tl;dr:

# Determine Column Widths
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')

# Iterate
while IFS= read -r line
do
    # You can do put awk command in a separate line if this is clearer to you
    awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
    field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
    
    # Or do it all in one line if you prefer:
    field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"    

        *** Code Stuff Here ***

done <<< $(appropriate-command)

Some explanation of the above - for newbies (like me)

Okay, so I'm a complete newbie, but this is my answer, based on a grand total of about two days of clawing around in the dark. This answer is relevant to those who are also new and trying to process data in the bash shell and bash scripts.

Unlike the *nix wizards and warlocks that have presented many of the solutions you will find to specific problems (some impressively complex), this is just a simple outline to help people understand what it is that they probably don't know; that they don't know. You will have to go and look this stuff up separately, it's way to big to cover it all here.

EDIT:

I would strongly suggest just buying a book/video/course for shell scripting. You do learn a lot doing it the school of hard knocks way as I have for the last couple of days, but it's proving to be painfully slow. The devil is very much in the details with this stuff. A good structured course probably instils good habits from the get go too, rather than potentially developing your own habits/short hand 'that seems to work' but will likely and unwittingly, bite you later on.

Resources:

Bash references:

https://linux.die.net/man/1/bash

https://tldp.org/LDP/Bash-Beginners-Guide

https://www.gnu.org/software/bash/manual/html_node

Common Bash Mistakes, Traps and Pitfalls:

https://mywiki.wooledge.org/BashPitfalls

http://www.softpanorama.org/Scripting/Shellorama/Bash_debugging/typical_mistakes_in_bash_scripts.shtml

https://wiki.bash-hackers.org/scripting/newbie_traps

My take is that there is no 'one right way that works for everything' to achieve this particular task of processing fixed width command output. Notably, the fixed widths are dynamic and might changed each time the command is run. It can be done somewhat haphazardly using standard bash tools (it depends on the types of values in each field, particularly if they contain whitespace or unusual/control characters). That said, expect any fringe cases to trip up the 'one bash pipeline to parse them all' approach, unless you have really looked at your data and it's quite well sanitised.

My uninformed, basic approach:

Pre-reqs:

To get much out of all this:

Learn the basics of how IFS= read -r line (and it's variants) work, it's one way of processing multiple lines of data, one line at a time. When doing this, you need to be aware of how things are expanded differently by the shell.
Grasp the basics of process substitution and command substitution, understand when data is being manipulated in a sub-shell, otherwise it disappears on you when you think you can recall it later.
It helps to grasp what Regular Expressions (regex) are. Half of the hieroglyphics that you encounter are probably regex in action.
Even further, it helps to understand when/what/why you need to 'escape' certain characters, at certain times, as this is why there is even more \ than you would expect amongst the hieroglyphics.
When doing redirection, be aware of the difference in > (overwrites without prompting) and >> (which appends to any existing data).
Understand differences in comparison operators and conditional tests (such as used with if statements and loop conditions).
if [ cond ] is not necessarily the same as if [[ cond ]]
look into the basics of arrays, and how to load, iterate over and query their elements.
bash -x script.sh is useful for debugging. Targeted debugging of specific lines is done by using set -x lines of code to debug set +x within the script.

As for the fixed width data:

If it's delimited:

Use the delimiter. Most *nix tools use a single white space as a default delimiter, but you can typically also set a specific delimiter (google how to do it for the specific tool).

Optional Step:

If there is no obvious delimiter, you can check to see if there is some secret hidden delimiter to take advantage of. There probably isn't, but you can feel good about yourself for checking. This is done by looking at the hex data in the file. Redirect the output of a command to a file (if you don't have the data in a file already). Do it using command > file.data and then explore file.data using hexdump -Cv file.data (another tool is xxd).

If you're stuck with fixed width:

Basically to do something useful, you need to:

Read line by line (i.e. record by record).
Split the lines into their columns (i.e. field by field, this is the fixed-width aspect)
Check that you are really doing what you think you are doing; particularly if expanding or redirecting data. What you see on shell as command output, might not actually be exactly what you are presenting to your script/pipe (most commonly due to differences in how the shell expands args/variables, and tends to automatically manipulate whitespace without telling you...)
Once you know exactly what your processing pipe/script is seeing, you can then tidy up any unwanted whitespace and so forth.

Starting Guidelines:

Feed the pipe/script an entire line at a time, then chop up fields (unless you really know what you are doing). Doing the field separation inside any loops such as while IFS= read -r line; do stuff; done is less error prone (in terms of the 'what is my pipe actually seeing' problem. When I was doing it outside, it tended to produce more scenarios where the data was being modified without me understanding that it was being altered (let alone why), before it even reached the pipe/script. This obviously meant I got extremely confused as to why a pipe that worked in one setting on the command line, fell over when I 'feed the same data' in a script or by some other method (but the pipe really wasn't actually getting the same data). This comes back to preserving whitespace with fixed-width data, particularly during expansion and redireciton, process substitiution and command substitution. Typically it amounts to liberal use of double quotes when calling a variable, i.e. not $someData but "$someData". Use parenthesis to clear up which var you are talking about, i.e. ${var}bar. Similarly when capturing the entire output of a command.
If there is nothing to leverage as a delimiter, you have some choices. Hack away directly at the fixed width data using tools like:
cut -c n1-n2 this directly cuts things out, starting from character n1 through to n2.
awk '{print $1}' this uses a single space by default to separate fields and print the first field.

Or, you can try to be a bit more scientific and 'measure twic, cut once'.

You can work out the field widths fairly easily if there are headers. This line is particularly helpful (sourced from an answer I link below):

fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
echo $fieldwidths

You can also look at all the data to see what length of data you are seeing in each field, and if you are actually getting the number of fields you expect (Thanks to David C. Rankin for this one!):

awk '{ for (i=1; i<=NF; i++) printf "%d\t",length($i) } {print ""}' file.data

With that information, you can then set about chopping fields up with a bit more certainty that you are actually capturing the entire field (and only the entire field). Tool options are many and varied, but I'm finding GNU awk (gawk) and perl's unpack to be the clearest. As part of a pipe/script consider this (sub in your relevant field widths and which ever field you want out in the {print $fieldnumber} obviously):

awk 'BEGIN {FIELDWIDTHS=$10 20 30 10}{print $1}

For command output with dynamic field widths, if you feed it into a while IFS= read -r line; do; done loop, you will need to parse the output using the awk above, as each time the field widths might have changed. Since I originally couldn't get the expansion right, I built the awk command on a separate line and stored it in a variable, which I then called in the pipe. Once you have it figured out though, you can just shove it all back into one line if you want:

# Determine Column Widths:
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')

# Iterate
while IFS= read -r line
do
    # Separate the awk command if you want:
    # This uses GNU awk to split the column widths and pipes it to sed to remove leading and trailing spaces.
    awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
    field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"

    # Or do it all in one line, rather than two:
    field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"    

    if [ "${DELETIONS[0]}" == 'all' ] && [ "${#DELETIONS[@]}" -eq 1 ] && [ "$field1" != 'UUID' ]; then 
        *** Code Stuff ***
    fi
    
    *** More Code Stuff ***

done <<< $(appropriate-command)

Remove excess whitespace using various approaches:

tr -d '[:blank:] and/or tr -d '[:space:](the later eliminates new lines and vertical whitespace, not just horizontal like :blank: does. They both also remove internal whitespace).
sed s/^[ ]*//;s/[ ]*$// this cleans up only leading and trailing whitespace.

Now you should basically have clean, separated fields to work with one at a time, having started from multi-field, multi-line command output.
Once you get what is going on fairly well with the above, you can start to look into other more elegant approaches as presented in these answers:

Finding Dynamic Field Widths:

https://unix.stackexchange.com/a/465178/266125

Using perl's unpack:

https://unix.stackexchange.com/a/465204/266125

Awk and other good answers:

https://unix.stackexchange.com/questions/352185/awk-fixed-width-columns

Some stuff just can't be done in a single pass. Like the perl answer above, it basically breaks the problem down into two parts. The first is turning the fixed width data into delimited data (just chose a delimiter that doesn't occur within any of the values in your fields/records!). Once you have it as delimited data, it makes the processing substantially easier from there on out.

Had to upvote to honor your hard work and struggle, but the others were right to vote-to-close your Q. S.O. is really about "one problem -> the best answer". Often there is iteration/refiinment of the problem definition involved, sometimes the "best answer" is never found. And even sometimes (often), the same "best answer" applies to many, seemingly different problems. Your learning travel would be perfectly alright as a series of messages to news group comp.unix.shell, or with a slightly different bent on Quora (All IMHO, of course). .... — shellter, Aug 04 '20 at 14:14
Keep posting, but try to boil shell and awk Qs down to a simple ***`echo "abc" | sed 's/b*/B/'` doesn't produce the output I expect*** Q/A. Ideally copy/pastable input and code into a typical shell environment. With concision and precision in mind, please keep posting here. Also consider applying to write for Linux Journal (or others), as your focus on practicality is highly commendable! Good luck — shellter, Aug 04 '20 at 14:15
Thanks shelter, I appreciate you taking the time to provide feedback. Having finally resolved my problem (not elegantly), I can see how the question (nor answer) was insufficiently defined. The key in my case was that I was also parsing command output, not just static file data, so the column widths were actually dynamic each time you call the command. Once I realised that, the brilliant little pipe for `fieldwidths=` was essential, taken from this post: [link](https://unix.stackexchange.com/a/465178/266125) — Chris, Aug 04 '20 at 22:28
The hard thing about asking clear and concise questions related to a practical project, particularly when learning, is that you really aren't sure what the problem actually is, so to fit the SO guidelines, it's easy to fall into the XY problem, or if avoiding that, unfortunately it tends to produce really broad questions in the vein of `'my code doesn't work, why?'` I'm trying to avoid doing the latter, but often realising too late, that I was really asking the wrong question. I've just bought a bash scripting book instead, probably what I should have done at the start :-) — Chris, Aug 04 '20 at 22:37

Measure field/column width in fixed width output - Finding delimiters?

Some Helpful Information:

1 Answers1

EDIT:

Resources:

My uninformed, basic approach: