0

I work with Git on Windows via TortoiseGit and currently I'm trying to use this commit-msg hook to check length of commit messages' lines.

All is fine when i write messages exclusively with ASCII characters. But when I write a message in Russian, a character counter produces a result two times greater than actual length. It looks like the counter uses a default Windows encoding or something like that while a message is being saved as UTF-8 file.

Some highlights:

  • .git/COMMIT_EDITMSG has UTF-8 encoding;
  • echo $line in my hook displays non-ASCII characters correctly;
  • ${#line} returns a value equal to actual_length * 2;
  • I tried different ways of iterating over characters in a line and each time iterator treated each byte as a separate character.

Update 1: I want to achieve my goal without adding environmental dependencies (that is, without installation of additional interpreters like Python).

Igor Melnichenko
  • 134
  • 2
  • 13

2 Answers2

1

Don't count bytes — count characters. I.e., convert (decode) input from bytes to characters in your programming language. Russian characters encoded in UTF-8 take 2 bytes. Example (in Python):

$ python

>>> len('тест')
8

>>> len(u'тест')
4

>>> len('тест'.decode('utf-8'))
4
phd
  • 82,685
  • 13
  • 120
  • 165
  • I don't want to use Python, I want to modify the shell script from the link I provided. I thought it is clear from the question, it looks like I need to clarify it – Igor Melnichenko Jan 28 '18 at 16:22
  • And of course I wouldn't ask this question if I knew how to count characters in an encoding-aware manner in Git shell :) – Igor Melnichenko Jan 28 '18 at 16:31
  • I didn't say "Use Python" — Python was only an example. What I say is basically "Use a real programming language". If you prefer Java — let it be Java. I doubt bash is suitable for the task of unicode *character* manipulation (versus bytes). – phd Jan 28 '18 at 16:58
0

For now, echo $line | iconv --from-code UTF-8 --to-code cp866 did the trick.

It covers my use case (only Cyrillic or Basic Latin characters are expected in messages) but lacks generality. I hope someone knows a cleaner solution.

Here is my current script:

#!/bin/bash
#http://chris.beams.io/posts/git-commit/#seven-rules
cnt=0

while IFS='' read -r line || [[ -n "$line" ]]; do
  cnt=$((cnt+1))
  cp866_line=`echo $line | iconv --from-code UTF-8 --to-code cp866`

  if [ $? -eq 0 ]; then
    length=${#cp866_line}
  else
    length=${#line}
  fi

  if [ $cnt -eq 1 ]; then
    # Checking if subject exceeds 50 characters
    if [ $length -gt 50 ]; then
      echo "Your subject line exceeds 50 characters"
      exit 1
    fi
    i=$(($length-1))
    last_char=${line:$i:1}
    # Subject line must not end with a period
    if [[ $last_char == "." ]]; then
      echo "Your subject line ends with a period"
      exit 1
    fi
  elif [ $cnt -eq 2 ]; then
    # Subject must be followed by a blank line
    if [ $length -ne 0 ]; then
      echo "Your subject line is followed by a non-empty line"
      exit 1
    fi
  else
    # Any line in body must not exceed 72 characters
    if [ $length -gt 72 ]; then
      echo "The line \"$line\" exceeds 72 characters"
      exit 1
    fi
  fi
done < "$1"
Igor Melnichenko
  • 134
  • 2
  • 13