1

I am looking for a way to highlight the differences between 2 strings. The idea is to show, in a terminal, what characters were changed by iconv. Both strings are already processed to remove leading and trailing spaces, but internal spaces must be handled.

RED="$(tput setaf 1)"    ##    Short variables for the tput ->
CYA="$(tput setaf 6)"    ## -> commands to make output strings ->
CLS="$(tput sgr0)"       ## -> easier to read
str1="[String nâmè™]"    # String prior to iconv
str2="[String name[tm]]" # String after iconv -f utf-8 -t ascii//translit

Ultimately I want to automate the formatting of the differences so they are surrounded by tput color codes that I can echo to the terminal.

${str1} = Highlight in red, characters not common to both strings

${str2} = Highlight in cyan, characters not common to both strings

Wanted Output:

output1="[String n${RED}â${CLS}m${RED}è™${CLS}]"
output2="[String n${CYA}a${CLS}m${CYA}e[tm]${CLS}]"

Most diff utilities I looked at work on the line or word level. I was thinking of parsing the output of cmp for the byte# of the first diff, but I would have to re-parse for multiple differences it seems.

Anyway I think about it, it seems like it going to be an involved process so I just want to make sure I'm not missing an obvious solution or tool.

Right now I'm thinking the easiest way would be to format each string to put a single byte on a new line and then my options open up.

nstr1="$(fold -w1 <<< "$(echo "${str1}")")"
nstr2="$(fold -w1 <<< "$(echo "${str2}")")"
diff <(echo -e "${nstr1}") <(echo -e "${nstr2}")

This is as far as i got and didn't want to go further unless I was on the right track. I'm certain there is a zillion ways to do this but is there a more efficient way to go here?

akovia
  • 117
  • 8
  • My ultimate goal is stated in the "Wanted Output", but I think I get your meaning. I thought it relevant to show how I wanted to use the difference output as it might affect how to get there. I'm guessing the question is too broad. I'll remove the question if you think it best. – akovia Dec 20 '15 at 01:31
  • 1
    The approach is clear enough - a bit of a nuisance to parse the output of `diff` and construct the outputs, but still better than reinventing `diff`. – Thomas Dickey Dec 20 '15 at 02:19
  • Hopefully I made the question a bit clearer. @Thomas, it seemed like a "nuisance" to me as well which is why I thought I might be overlooking something. – akovia Dec 20 '15 at 02:51
  • Agreed - I don't recall anyone doing exactly this parser, though it reminds me of things that I've done [here](http://invisible-island.net/personal/oldprogs.html#y1983) and [here](http://invisible-island.net/diffstat/diffstat.html). – Thomas Dickey Dec 20 '15 at 02:57
  • Thanks @Thomas Dickey. That actually answers my question. I just didn't want to dive in head first if there was a vastly simpler way. I't's certainly not critical for my script, but seemed like a good project to learn from. – akovia Dec 20 '15 at 12:13

2 Answers2

3

To put it all together:

#!/usr/bin/env bash

# Using stdin input, outputs each char. on its own line, with actual newlines
# in the input represented as literal '\n'.
toSingleCharLines() {
  sed 's/\(.\)/\1\'$'\n''/g; s/\n$/\'$'\n''\\n/'
}

# Using stdin input, reassembles a string split into 1-character-per-line output
# by toSingleCharLines().
fromSingleCharLines() {
  awk '$0=="\\n" { printf "\n"; next} { printf "%s", $0 }'
}

# Prints a colored string read from stdin by interpreting embedded color references
# such as '${RED}'.
printColored() {
  local str=$(</dev/stdin)
  local RED="$(tput setaf 1)" CYA="$(tput setaf 6)" RST="$(tput sgr0)"
  str=${str//'${RED}'/${RED}}
  str=${str//'${CYA}'/${CYA}}
  str=${str//'${RST}'/${RST}}
  printf '%s\n' "$str"
}

# The non-ASCII input string.
strOrg='[String nâmè™]'

# Create its ASCII-chars.-only transliteration.
strTransLit=$(iconv -f utf-8 -t ascii//translit <<<"$strOrg")

# Print the ORIGINAL string with the characters that NEED transliteration
# highlighted in RED.
diff --changed-group-format='${RED}%=${RST}' \
  <(toSingleCharLines <<<"$strOrg") <(toSingleCharLines <<<"$strTransLit") |
    fromSingleCharLines | printColored

# Print the TRANSLITERATED string with the characters that RESULT FROM
# transliteration highlighted in CYAN.
diff --changed-group-format='${CYA}%=${RST}' \
  <(toSingleCharLines <<<"$strTransLit") <(toSingleCharLines <<<"$strOrg") |
    fromSingleCharLines | printColored

This yields:

output

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    I almost finished putting this together but you beat me to it and glad you did. I like the `sed` solution and will use it in my code. – akovia Dec 21 '15 at 11:48
  • Good work! On macOS the built-in _"Terminal"_ app when set to emulate `vt100` this results is no printed color output. However declaring the _"Terminal"_ as `ansi` or `xterm-color` results in the desired color output (i.e. change it in the _"Advanced"_ settings tab). Is there any way to get this to also work successfully when set to `vt100`, perhaps by changing value of `tput setaf` ? – RobC Jan 17 '19 at 12:12
  • 1
    Thanks, RobC, but from what I can tell `xterm-256color` is the default (try as the Guest user, for instance), which does support colors; I'm not familiar with `vt100`. – mklement0 Jan 17 '19 at 12:22
  • 1
    Yes agreed, you're correct I just restored defaults (I must have tinkered previously). I'm not fully sure on the differences, (if any at all), I've always used the term [ANSI/VT100](https://misc.flogisoft.com/bash/tip_colors_and_formatting) interchangeably regarding color/formatting codes - so was a bit surprised when _"Terminal"_ app is set to `vt100` there was no colored output. I guess this is less of concern for me now I know the default setting on macOS (it's also fine with iTerm). – RobC Jan 17 '19 at 12:42
2

Answer provided by @Thomas Dickey in comments that there was no tool or process that was vastly easier than the way I was attempting it.

Just to finish up, I was able to produce the "Wanted Output" simple enough with the following difflines.

diff --changed-group-format="\${RED}%=\${CLS}" <(echo -e "${nstr1}") <(echo -e "${nstr2}")|tr -d '\n'
diff --changed-group-format="\${CYA}%>\${CLS}" <(echo -e "${nstr1}") <(echo -e "${nstr2}")|tr -d '\n'

Unfortunately I haven't figured out how to echo the output to interpret the color codes, but that's another question.

akovia
  • 117
  • 8
  • Interesting, but how does that generate _character_-level differences? How do `nstr1` and `nstr2` relate to `str1` and `str2` from your question? – mklement0 Dec 21 '15 at 00:39
  • 1
    @mklement0: If I understand what you are asking, I couldn't use `diff` on `str1` & `str2` to find what characters to highlight as `diff` works on the line level. So `nstr1` & `nstr2` are identical to `str1` & `str2` but with a single character on each line so `diff` will work at the character level. – akovia Dec 21 '15 at 01:07
  • Ah, I see; I'd forgotten about your `fold -w1` command to create the each-character-on-its-own line representation. +1 for the approach. Perhaps you can amend your answer to show a self-contained working example. Also, your second command mistakenly contains `nstr1` _twice_. In case you aren't already aware: you can accept your own answer after 48 hours. – mklement0 Dec 21 '15 at 01:16
  • 1
    I've posted a solution that puts it all together. Note that I'm using `sed` rather than `fold` to split the strings into indiv. characters, because _GNU_ `fold` is not Unicode-aware, and splits into _bytes_ rather than characters. – mklement0 Dec 21 '15 at 03:22