Chromium translations explained: part 2
In the first part of this series of posts about the Chromium translations, I covered Grit, the format of translations used by upstream for Chromium (and Google Chrome, ChromeOS..). In another post, I recently explained the release management of this project, showing that multiple branches evolve in parallel, inside the so called Channels. In this part, I will cover the interaction with Launchpad, and show how the strings are converted back and fourth, how the Launchpad contributed strings are merged with the upstream strings, and the various problems that came up since contributions started to flow.
The idea of importing the Chromium translation into Launchpad came up during the Ubuntu Lucid cycle when Chromium was considered as default browser to replace Firefox. Obviously, it didn’t happen. One of the arguments against Chromium was that it was not translatable by the community. I never worked on translations before. As packager of Chromium, I knew that Grit was used, and that Launchpad only knew about GetText (and that it will not change anytime soon). I had no idea of what it would mean to create the bi-directional converter, or how it would work (the import/export, the merging, etc.). Once it became clear that no one else would do it, I decided to give it a try. After all, how hard could that be? That was in early October 2010.
For me, the only way such thing could work was that:
- it must be automatic, even the merging
- it must benefit all branches, not only the stable builds like most projects do
- everything contributed in Launchpad must be usable by upstream and other projects
So all those conditions have to be achievable somehow. What was needed was to:
- select a branch to follow
- define a usable workflow
- write the converter
- select the templates to expose in Launchpad
and see how it goes.
Choice of a branch to follow
As explained in a previous post, Chromium lives in a trunk which has 3 main branches: dev, beta and stable.
It looks like this:
I had to decide which branch the Launchpad translations should track. After pondering stable vs trunk for a while, I decided to go with trunk. The reason is that the release cycle is now of 6 weeks. That’s awfully short even if you do the string freeze at the branching date (the earliest date possible). It is probably achievable by contracted translators, but I didn’t see it happening with the community translators. They need more time. So I went on with trunk, with the idea to merge the strings downward in the more stable branches each time there is an upstream release (the orange dots in the diagram above). The delta after the first 2 cycles should be minimal (strings that were in a branch but are no longer in trunk).
Being automatic means the extraction of translation files must be done by the package itself or by the packager’s assistant (in my case, Drobotik). In most projects, pkgstriptranslations is used at build time but here, it could not work because a/ my builds of trunk happen in a PPA, which doesn’t allow that and b/ the package doesn’t contain the expected gettext files.
Fortunately, Launchpad is able to work with a bzr branch for the import. That branch must contain the gettext templates (.pot files) and the corresponding upstream translation files (.po). It is also possible to get the results in the same or in another bzr branch. In the case of another export branch, it only contains the improved “.po” files, not the templates (as they are not modified). So for Chromium, it was just a matter of populating a bzr branch with generated gettext files for the export (LP import), and reading the other branch back for the import (LP export).
Here is how it currently looks like:
Here are the steps:
To feed Launchpad:
- after each daily build of trunk (where all the Grit files are updated along with the rest of the tree), the converter (in light blue) takes the selected templates (grd) and translation files (xtb) and turn them into gettext pot and po files (the step 1 in yellow in the diagram). The result is committed into a bzr branch (only when there is something new). Note that this is not a merge, only the upstream translations are committed in this branch.
- that bzr branch is pushed to Launchpad to the location configured as “Import Branch” (yellow 2).
- Launchpad/Rosetta gets the commit, and merges it (that step takes minutes to hours).
- Some random time later, Launchpad/Rosetta commits this merge into the configured “Export Branch”.
Note: this happens after the source package is created, because for it to happen during the process would mean that a/ the logic be in the packaging and b/ that the commit and the push will be triggered (and fail) for anyone trying to build the package, which is sure not desirable.
To receive from Launchpad:
- during a build (any build, trunk/dev/beta/stable), when the source tarball is created (by the get-orig-source rule), the export branch is fetched from Launchpad (yellow 3)
- the converter runs with all the selected grd templates for the considered branch, reads the gettext files just fetched in the previous step and merges the strings matching the (possibly older) template (yellow 4)
- after doing that, it generates the improved grit files, and a series of patches that are both bundled in the source tarball and made public (yellow 5).
Note: the choice of bundling the translations improvements into the source tarball was made because having them in the packaging branch (debian/ dir) would have meant to let the get-orig-source rule commit changes into its own branch. The drawback is that for stable builds, where the updates are less frequent, it is not (easily) possible to refresh the strings without creating a new tarball.
A first problem is already visible here: we have no control over the Launchpad export. It happens usually too late for a Launchpad contributed string to catch the next daily build (it takes 2 cycles).
If the choice of trunk for the import proves to be inefficient for the stable builds in the official repositories, it should be pretty easy to feed Launchpad with the merge of all branches stacked with the freshest ones on top. Launchpad will then have a pool of strings and each build will still be able to extract what it needs. It should represents only 1 or 2 dozens extra strings (less than 2%) based on the current figures.
It has to deal with the Grit and GetText formats, which both have their own pros and cons:
- The Grit files are XML files. It means that the strings they contain must be properly encoded so that XML parsers (like the python minidom and sax parsers) are happy with them. It applies to numeric character references (& #1234; and ) and XML entities such as < ". Exposing those encoded characters in Launchpad doesn’t make sense, so they have to be decoded, and re-encoded when they are merged back to Grit.
- placeholders (for variables) are named, and the translated strings must have the same named placeholders as in the templates. In addition to the description, placeholders also have a sample value (to give an even better idea of the context to translators). Gettext has no way to expose those.
- if either the `msgid’ or the `msgstr’ entry begins/ends with a ‘\n’, both must begins/ends with a ‘\n’
- no control chars allowed other than \t \n \r. Some langs use things like \d or \v or even regular ‘\’ which must be quoted
- lines are folded almost randomly depending on what tool wrote them (so a diff between the gettext imports and exports is usually unreadable)
- Grit strings could be made conditional (per lang, platform, various defines) but GetText has no way to do that (so I had to pass that as comments and of course Launchpad cannot do anything useful with them)
Some strings need real carriage returns (\n), but both in Grit and in GetText, a ‘\n’ is meaningless, so they have to be quoted.. (and python has its own weird way of dealing with \n and \\ in strings and regexes).
There is also a difference regarding the lang codes between upstream and Launchpad:
- underscores (ubuntu: zh_CN, upstream: zh-CN)
- lang aliases (ubuntu: pt, pt_BR, upstream: pt-PT, pt-BR)
- more specific lang codes (no vs nb)
Also, because the final goal is to submit the strings back to upstream, the end-result must be a minimal diff, suitable to be a re-viewable patch. It means the whole process must be a bijection: grit2gettext(gettext2grit(string)) == string
Fortunately, it proved to be possible without exposing too much complexity in the gettext files.
The first few weeks exposed all kind of unexpected problems, forcing me to add extra layers of protection.
First, various unit tests, to make sure all new problems are covered for good and that changes don’t introduce regressions. The converter also performs a lot of sanity checks to make sure it will not send bogus strings to either Launchpad, or to the package and to upstream. There is a problem though. When a bogus string is detected, besides skipping it, how should the error be reported back to the translator? As it’s an external process, Launchpad is not aware of the error, and cannot pass it to the translators. Translators see this string as done in Launchpad so they have no reason to come back to it. As of today, this issue is still unsolved.
If you are interested by the details of implementation, the code is here. All in all, it was a lot of guess work, trial and error, sometimes fun, sometimes frustrating. I also filed a handful of bugs against Rosetta, most of which are now fixed.
Among the 30 grd files (templates) that the upstream tree carries, only a handful contain strings that are translatable, and that are worth importing into Launchpad.
It is the biggest template of all (~2680 strings). It is also the most complex one, as it has a lot of conditions for platform specific (Linux, Mac, Windows, ChromeOS) or lang specific strings. It contains everything that is not related to a/ the branding (Chrome vs Chromium), b/ the developer tool and c/ webkit. It should not contain Chrome or Chromium branding specific strings, they should be in the next two templates. If you find some, you should file a bug.
It contains (~100) strings that are related to the Chromium branding, it should not contain anything Google Chrome related.
Similarly, it is related to the Google Chrome branding. Both this and the previous template must contain the same set of strings (unless it is about a Chrome only feature). I initially imported this template into Launchpad and later dropped it for two reasons: a/ it is not used by Chromium and b/ it became obvious that Google will never accept contributions for it (more on this in Part 3)
Strings related to the new policy template feature that can be used to pre-configure Chromium system-wide.
A small template containing (~50) UI Strings, mostly used in Dialog windows (Close, Open File, Cut, Paste..)
Also coming from Webkit, but it is ahead of the Webkit already translated in Launchpad. It is small enough (~60 strings) to not mandate a complex gateway to tie those two together.
Of course, figures vary, especially in generated_resources, which is the most active template. Here is what we get from upstream for the 50 langs supported (without any contribution from Launchpad):
What about the other grd files? As explained in Part 1, Grit is not only about translations, it is also about resources. So Grit is also able to bundle files within the final lang-packs (.pak files). As such, when we want to create lang-packs for langs not supported by upstream (like for example Galician and Basque), the grd mentioned above and a few others must be updated to generate the corresponding .pak files (and in addition to that, the main build rules file – a gyp file called build/common.gypi - must also be patched to include those new langs). The converter is able to do all that automatically.
There are also other grd files that could be used to tweak the UI per lang. For example chrome/app/resources/locale_settings_linux.grd or chrome/app/resources/locale_settings.grd. They could be used to change a font, a font size, or tweak the width/height of a widget per lang. I’m currently not exposing those in Launchpad. While it could fit into the Rosetta UI, it seems to be far too technical so if it ever needs to be changed, it is best to do it manually with a patch and quickly upstreamed. File a bug if you feel you need to change something in those templates.
Upstream vs Launchpad: about a month ago, Launchpad introduced some presumably minor changes in the way translations are handled. The effect on the Chromium translations was disastrous. There used to be a clear way to identify translations as being “from upstream” vs “new in Launchpad” vs “updated in Launchpad”. Once those changes landed, the thousands of strings “updated in Launchpad” turned green as being “from upstream”. I reacted quickly to that by extending my converter to produce my own dashboard as I depend on those figures. It worked fine for a few days, then the situation worsen when upstream landed a huge batch of translations from its contracted translators. All those strings took precedence over the community strings (which moved to “need review” state). That in itself annoyed some of the most active translators, enough for them to reach me and ask me what was going on, but it was still quite acceptable. The situation became even worse last week, when all the improved strings simply disappeared from Launchpad. I’ve been told by my Launchpad contact that nothing has been lost (even if it had, I still have everything in the bzr branches so you don’t have to worry if you have contributed) but yet, something is definitely weird. I hope we will find a solution quickly.
Error reporting: as we have seen, because the conversion happens outside of Launchpad (on my own hardware), there is no easy way to inform the translators when an error is detected. Even if I expose those errors in my dashboard, it won’t reach enough translators as it is foreign to their own workflow. Maybe the LP API could be extended to tag strings, preferably with a reason, as needing review.
Propagation delay: when a translator submits a string in Launchpad, say at 6pm UTC, the next daily build (at 3am UTC the next day) will not see it as Launchpad will only export that string hours later (often around 5 or 6am UTC). It means it takes 2 days, while 1 should be enough. A faster export, or a better scheduling would be nice to have.
Conditional strings: some strings are wrapped in an <if expr=”…”></if> condition. Such conditions could not be exposed in Launchpad. It means that in order to reach the golden 100%, most translators would translate those strings even when they are not supposed to. For example, there are some strings with “lang not in [ ‘ar’, ‘ro’, ‘lv’]” that Romanian translators translated. I resorted to pruning those out during the final merges, but it is still confusing to expose those strings to translators.
Some other conditions are platform specific, like “os == ‘darwin’ or pp_ifdef(‘chromeos’)“, those are more a matter of priorities. If you are a translator with little time to dedicate to this project and you are only willing to contribute to Chromium on Linux, you may wish to skip those strings. This particular problem could theoretically be solved by splitting the template but there are so many conditions that it is difficult to achieve. Maybe those tests could be turned to tags, and those tags could be used as filters in Launchpad.
Once again, a long post, with way too much information. If you read this far, please let me know what you think and what else you want to see covered in the third (and last) post.
I would be curious to know if other projects have to deal with such conversions (I only know of Firefox and its xpi langpacks).