gettext .po file comparing/working with strings between files

Question

I want to do some translations on a project that consists of multiple files for different apps. However to easily make things consistent across all files it would be useful with a translation tool that can load in a bunch of .po files and e.g. cross check files for identical or similar reference msgid strings to make sure the translations. And perhaps also allow to translate multiple files/strings in one go if the reference is the same.

Does something like this exist..?

score 0 · Answer 1 · answered Nov 14 '14 at 15:21

I had to do the same thing for a project (CiviCRM). One suggestion I received was to check with OpenRefine, which presumably has a few tools to find similar strings, but I wanted to automate the process with something simple, so I wrote a short script.

Fair warning, this is not the most efficient, and it can take a while to run on big projects (we have around 16 000 strings in CiviCRM).

For reference: https://github.com/civicrm/l10n/blob/master/bin/find-similar-strings.php

And since SO doesn't like links as answers, more details here:

#!/usr/bin/php
<?php

/**
* Reads from STDIN and finds similar-looking strings.
*
* Usage:
* cat *.pot | ../bin/find-similar-strings.php
*
* Context:
* http://forum.civicrm.org/index.php/topic,34805.0.html
*/

// Default match threshold is 90% match.
$threshold = (! empty($argv[1]) ? $argv[1] : 90);

// Read all input from stdin.
$src = file_get_contents("php://stdin");

// http://stackoverflow.com/a/1070937/2387700
// Extract all "msgid" strings (they can be multi-line).
preg_match_all('/msgid\s+\"([^\"]*)\"/', $src, $matches);
$msgids = $matches[1];

// Sort the strings alphabetically, to make them easier to compare.
// sort($msgids);
foreach ($msgids as $key1 => $msgid1) {
  foreach ($msgids as $key2 => $msgid2) {
    $percent = 0;
    if ($msgid1 && $msgid2 && $msgid1 != $msgid2) {
      if (similar_text($msgid1, $msgid2, $percent)) {
        if ($percent > $threshold) {
          $percent = (int) $percent;
          echo "$msgid1 [$percent %]\n";
          echo "$msgid2 \n\n";
        }
      }
    }
  }

  // To avoid going through the strings twice, we unset the string
  // si that the inner-loop goes faster.
  unset($msgids[$key1]);
}

This will load the .pot file (source strings, but I guess you could run it on a .po file as well), and loop through all the strings one by one.

I was hesitating to sort the strings alphabetically, but I find more than a few instances where strings had an incorrect space prepended to them, typo, etc.

Another possible improvement would be to first check for the length of the string, and skip strings that are of very different lengths.

gettext .po file comparing/working with strings between files

1 Answers1