How to count empty translations in .po with grep (or other LSB tool)?

Question

I can perform search of empty translations in vim with command like this:

/""\n\n

But my task is to find number of non-translated strings. Any ideas how to do this with standard tools which every linux box should have (no separate packages please).

Here is example of .po file containing 2 translated and 2 non-translated string (long and short variant).

msgid "translated string"
msgstr "some translation"

msgid "non-translated string"
msgstr ""

msgid ""
"Some long translated string which starts from new line "
"and can last for few lines"
msgstr ""
"Translation of some long string which starts from new line "
"and lasts for few lines"

msgid ""
"Some long NON-translated string which starts from new line "
"and can last for few lines"
msgstr ""

Steve · Accepted Answer · 2013-02-16T14:14:55.413

7

Here's one way using awk:

awk '$NF == "msgstr \"\"" { c++ } END { print c }' FS="\n" RS= file

Results:

Explanation:

Put awk in paragraph mode. Then test the last line in each block. If the last line matches the pattern exactly, count it. Then, at the end of the script, print out the count. If you later decide you want to count the number of translated strings, simply change == to !=. HTH.

From the comments below, to handle empty lines containing whitespace:

You'll need to use a regular expression, like: RS="\n{2,}|\n([ \t]*\n)+|\n$" (this could be simplified perhaps). However, it should be noted that the ability for RS to be a regex is a GNU awk extension. Other awk's will fail to handle multi-character record separators in some way. Fortunately, the above file format looks fairly rigid, so handling lines containing whitespace shouldn't be necessary.

If faced with separators including whitespace, the quick fix is a call to sed:

< file sed 's/^ *$//' | awk ...

edited Feb 16 '13 at 14:14

answered Feb 10 '13 at 14:57

Steve

51,466
13
89
103

1

Tested on two files - seems to work ok. I'll test a little bit more and if there would be no issues you'll get your bounty and my best thanks :) – Sergey P. aka azure Feb 10 '13 at 18:11
Thats a very good solution (took me a while to understand it). But to be 100% sure, is there a way of having `RS` to handle if empty lines have some spaces? – 244an Feb 16 '13 at 01:37
@244an: Thanks. Yes, but only `gawk` can handle multi-character record separators. SO YMMV. Please see the update above. – Steve Feb 16 '13 at 14:26
@Steve, can the script be modified to print 0 if no non-translated strings are found? Current version just outputs empty line. – Sergey P. aka azure Feb 21 '13 at 09:04
@SergeyP.akaazure: Yes. Simply change `print c` to `print c ? c : "0"`. That's a [ternary operator](http://en.wikipedia.org/wiki/Ternary_operation). HTH. – Steve Feb 21 '13 at 10:34

mr.spuratic · Answer 2 · 2013-02-13T17:17:33.020

I suggest using the available gettext tools, instead of trying to parse .po files directly:

$ msggrep -v -T -e "." test.po 
msgid "non-translated string"
msgstr ""

msgid ""
"Some long NON-translated string which starts from new line and can last for "
"few lines"
msgstr ""

The msggrep flags are:

-v invert match
-T apply next pattern to msgstr
-e search pattern

i.e. show any msgstr which does not match /./, and is therefore empty.

Since msggrep doesn't have -c, the count in a one-liner is:

 msggrep -v -T -e "." test.po  | grep -c ^msgstr

(msggrep has been part of the gettext package since v0.11, Jan 2002. LSB Core aka ISO/IEC 23360-1:2006(E) only mandates the gettext and msgfmt binaries, but I've yet to see a system without it, so it should hopefully meet your requirements.)

F. Hauri - Give Up GitHub · Answer 3 · 2013-02-13T16:08:23.953

As awk (nice) solution is already given, there is 4 other ways:

All commands was tested with your sample and a good .po file.

Using `sed`

sed -ne '/msgstr ""/{N;s/\n$//p}' <poFile | wc -l
2

Explained: Each time I found msgstr "", I merge next line, than if I could suppress a newline as last character of my strings/\n$//, I print them p. For finaly count the number of lines.

Bash only

Without the use of any binary other than bash:

total=0
while read line;do
    if [ "$line" == 'msgstr ""' ] ;then
        read line
        [ -z "$line" ] && ((total++))
      fi
  done <poFile
echo $total
2

Explained: Each time I found msgstr "", I read next line, than if empty, I increment my counter.

Other bash way

mapfile -t line <poFile
count=0
for ((i=${#line[@]};i--;));do
    [ -z "${line[i]}" ] && [ "${line[i-1]}" == 'msgstr ""' ] && ((count++))
  done
echo $count
2

Explained: read the entire .po file in one array, than browse array for empty field where previous field contain msgstr "", increment counter, than print.

Perl (in command line mode)

perl -ne '$t++if/^$/&&$l=~/msgstr\s""\s*$/;$l=$_;END{printf"%d\n",$t}' <poFile
2

Explained: Each time I found an empty line and previous line (stored in variable $l) contain msgstr "" then I increment the counter.

Dash (not bash!)

count=0
while read line ; do
    [ "$line" = "" ] && [ "$prev" = 'msgstr ""' ] && true $((count=count+1))
    prev="$line"
  done <poFile
echo $count
2

Based on perl sample, this work on both bash and dash

imp25 · Answer 4 · 2013-01-28T04:24:07.837

1

~~Try:~~

grep -c '^""$'

it counts the lines where the only content is two ".

EDIT:

Following from your comment I see that the above does not meet your needs. To perform a multi-line match you could use GNU grep in the following way:

grep -Pzo '^msgstr ""\n\n' en.po | grep -c msgstr

This was tested and found to work using GNU grep 2.14. I however do not know if GNU grep is standard enough for you.

Explanation of 1st grep:

-P activate the Perl regex extension.

-z replace the newline at the end of line with a null, allowing grep to keep track of new lines.

-o print 'only-matching', required because -z is in use; otherwise we'd print the whole file.

Explanation of 2nd grep:

-c count the number of lines matching, in this case msgstr. This has to be in a separate grep statement as -c would return 1 if used with -z.

edited Jan 28 '13 at 04:24

answered Jan 25 '13 at 14:50

imp25

2,327
16
23

`msgstr ""` - this is a line for non translated string. And it would not count with such grep invocation. – Sergey P. aka azure Jan 25 '13 at 20:22
grep -Pzo '^msgstr ""\n\n' language/locale/en_US/LC_MESSAGES/messages.po | grep -c msgstr 0 And file contains many string but no translations. Example: msgid "User Name" msgstr "" msgid "Password" msgstr "" msgid "Forgot Password ?" msgstr "" – Sergey P. aka azure Jan 29 '13 at 14:02
Does `grep -Pzo '^msgstr ""\n\n' language/locale/en_US/LC_MESSAGES/messages.po` produce any output? If not try and revise the seach string `^msgstr ""\n\n`, starting with removing the requirement for there to be no text before `msgstr` (remove the ^). – imp25 Jan 29 '13 at 14:38
`grep -Pzo '^msgstr ""\n\n' language/locale/en_US/LC_MESSAGES/messages.po | grep -c msgstr 0` And file contains many string but no translations. Example: ` msgid "User Name" msgstr "" msgid "Password" msgstr "" msgid "Forgot Password ?" msgstr ""` – Sergey P. aka azure Jan 29 '13 at 14:38
yes, it can produce one-string message that file does contains the string matching to the pattern – Sergey P. aka azure Jan 29 '13 at 14:40
In that case, edit the search string as I suggested, and report back what you've tried. Ideally updating your original post. To truely match the vim string you could use the pattern `'""\n\n'` – imp25 Jan 29 '13 at 14:42
I tried `$ LANG="" grep -Pzo '^msgstr ""\n\n' messages.po` Binary file messages.po matches` – Sergey P. aka azure Jan 30 '13 at 10:28
do you have any other suggestions? – Sergey P. aka azure Feb 05 '13 at 06:07

John Kugelman · Answer 5 · 2013-01-25T14:58:09.217

grep -n ^msg your.po | grep -v '""' | uniq -D -f1

This looks for lines starting with msg, ignores ones that are just empty strings (""), and then uses uniq to look for duplicate lines (ignoring the msgid/msgstr field).

Sample output from a CUPS file:

$ grep -n ^msg /usr/share/locale/es/cups_es.po | grep -v '""' | uniq -D -f1
3742:msgid "ParamCustominCutInterval"
3743:msgstr "ParamCustominCutInterval"
3745:msgid "ParamCustominTearInterval"
3746:msgstr "ParamCustominTearInterval"
3858:msgid "Quarto"
3859:msgstr "Quarto"
3967:msgid "Stylus Color Series"
3968:msgstr "Stylus Color Series"
3970:msgid "Stylus Photo Series"
3971:msgstr "Stylus Photo Series"
3973:msgid "Super A"
3974:msgstr "Super A"

How to count empty translations in .po with grep (or other LSB tool)?

5 Answers5

Using sed

Bash only

Perl (in command line mode)

Dash (not bash!)

Using `sed`