1# This document contains text in Perl "POD" format.
2# Use a POD viewer like perldoc or perlman to render it.
3
4=head1 NAME
5
6Locale::Maketext::TPJ13 -- article about software localization
7
8=head1 SYNOPSIS
9
10  # This an article, not a module.
11
12=head1 DESCRIPTION
13
14The following article by Sean M. Burke and Jordan Lachler
15first appeared in I<The Perl Journal> #13
16and is copyright 1999 The Perl Journal. It appears
17courtesy of Jon Orwant and The Perl Journal.  This document may be
18distributed under the same terms as Perl itself.
19
20=head1 Localization and Perl: gettext breaks, Maketext fixes
21
22by Sean M. Burke and Jordan Lachler
23
24This article points out cases where gettext (a common system for
25localizing software interfaces -- i.e., making them work in the user's
26language of choice) fails because of basic differences between human
27languages.  This article then describes Maketext, a new system capable
28of correctly treating these differences.
29
30=head2 A Localization Horror Story: It Could Happen To You
31
32=over
33
34"There are a number of languages spoken by human beings in this
35world."
36
37-- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
38Identification of Languages"
39
40=back
41
42Imagine that your task for the day is to localize a piece of software
43-- and luckily for you, the only output the program emits is two
44messages, like this:
45
46  I scanned 12 directories.
47
48  Your query matched 10 files in 4 directories.
49
50So how hard could that be?  You look at the code that
51produces the first item, and it reads:
52
53  printf("I scanned %g directories.",
54         $directory_count);
55
56You think about that, and realize that it doesn't even work right for
57English, as it can produce this output:
58
59  I scanned 1 directories.
60
61So you rewrite it to read:
62
63  printf("I scanned %g %s.",
64         $directory_count,
65         $directory_count == 1 ?
66           "directory" : "directories",
67  );
68
69...which does the Right Thing.  (In case you don't recall, "%g" is for
70locale-specific number interpolation, and "%s" is for string
71interpolation.)
72
73But you still have to localize it for all the languages you're
74producing this software for, so you pull Locale::gettext off of CPAN
75so you can access the C<gettext> C functions you've heard are standard
76for localization tasks.
77
78And you write:
79
80  printf(gettext("I scanned %g %s."),
81         $dir_scan_count,
82         $dir_scan_count == 1 ?
83           gettext("directory") : gettext("directories"),
84  );
85
86But you then read in the gettext manual (Drepper, Miller, and Pinard 1995)
87that this is not a good idea, since how a single word like "directory"
88or "directories" is translated may depend on context -- and this is
89true, since in a case language like German or Russian, you'd may need
90these words with a different case ending in the first instance (where the
91word is the object of a verb) than in the second instance, which you haven't even
92gotten to yet (where the word is the object of a preposition, "in %g
93directories") -- assuming these keep the same syntax when translated
94into those languages.
95
96So, on the advice of the gettext manual, you rewrite:
97
98  printf( $dir_scan_count == 1 ?
99           gettext("I scanned %g directory.") :
100           gettext("I scanned %g directories."),
101         $dir_scan_count );
102
103So, you email your various translators (the boss decides that the
104languages du jour are Chinese, Arabic, Russian, and Italian, so you
105have one translator for each), asking for translations for "I scanned
106%g directory." and "I scanned %g directories.".  When they reply,
107you'll put that in the lexicons for gettext to use when it localizes
108your software, so that when the user is running under the "zh"
109(Chinese) locale, gettext("I scanned %g directory.") will return the
110appropriate Chinese text, with a "%g" in there where printf can then
111interpolate $dir_scan.
112
113Your Chinese translator emails right back -- he says both of these
114phrases translate to the same thing in Chinese, because, in linguistic
115jargon, Chinese "doesn't have number as a grammatical category" --
116whereas English does.  That is, English has grammatical rules that
117refer to "number", i.e., whether something is grammatically singular
118or plural; and one of these rules is the one that forces nouns to take
119a plural suffix (generally "s") when in a plural context, as they are when
120they follow a number other than "one" (including, oddly enough, "zero").
121Chinese has no such rules, and so has just the one phrase where English
122has two.  But, no problem, you can have this one Chinese phrase appear
123as the translation for the two English phrases in the "zh" gettext
124lexicon for your program.
125
126Emboldened by this, you dive into the second phrase that your software
127needs to output: "Your query matched 10 files in 4 directories.".  You notice
128that if you want to treat phrases as indivisible, as the gettext
129manual wisely advises, you need four cases now, instead of two, to
130cover the permutations of singular and plural on the two items,
131$dir_count and $file_count.  So you try this:
132
133  printf( $file_count == 1 ?
134    ( $directory_count == 1 ?
135     gettext("Your query matched %g file in %g directory.") :
136     gettext("Your query matched %g file in %g directories.") ) :
137    ( $directory_count == 1 ?
138     gettext("Your query matched %g files in %g directory.") :
139     gettext("Your query matched %g files in %g directories.") ),
140   $file_count, $directory_count,
141  );
142
143(The case of "1 file in 2 [or more] directories" could, I suppose,
144occur in the case of symlinking or something of the sort.)
145
146It occurs to you that this is not the prettiest code you've ever
147written, but this seems the way to go.  You mail off to the
148translators asking for translations for these four cases.  The
149Chinese guy replies with the one phrase that these all translate to in
150Chinese, and that phrase has two "%g"s in it, as it should -- but
151there's a problem.  He translates it word-for-word back: "In %g
152directories contains %g files match your query."  The %g
153slots are in an order reverse to what they are in English.  You wonder
154how you'll get gettext to handle that.
155
156But you put it aside for the moment, and optimistically hope that the
157other translators won't have this problem, and that their languages
158will be better behaved -- i.e., that they will be just like English.
159
160But the Arabic translator is the next to write back.  First off, your
161code for "I scanned %g directory." or "I scanned %g directories."
162assumes there's only singular or plural.  But, to use linguistic
163jargon again, Arabic has grammatical number, like English (but unlike
164Chinese), but it's a three-term category: singular, dual, and plural.
165In other words, the way you say "directory" depends on whether there's
166one directory, or I<two> of them, or I<more than two> of them.  Your
167test of C<($directory == 1)> no longer does the job.  And it means
168that where English's grammatical category of number necessitates
169only the two permutations of the first sentence based on "directory
170[singular]" and "directories [plural]", Arabic has three -- and,
171worse, in the second sentence ("Your query matched %g file in %g
172directory."), where English has four, Arabic has nine.  You sense
173an unwelcome, exponential trend taking shape.
174
175Your Italian translator emails you back and says that "I searched 0
176directories" (a possible English output of your program) is stilted,
177and if you think that's fine English, that's your problem, but that
178I<just will not do> in the language of Dante.  He insists that where
179$directory_count is 0, your program should produce the Italian text
180for "I I<didn't> scan I<any> directories.".  And ditto for "I didn't
181match any files in any directories", although he says the last part
182about "in any directories" should probably just be left off.
183
184You wonder how you'll get gettext to handle this; to accommodate the
185ways Arabic, Chinese, and Italian deal with numbers in just these few
186very simple phrases, you need to write code that will ask gettext for
187different queries depending on whether the numerical values in
188question are 1, 2, more than 2, or in some cases 0, and you still haven't
189figured out the problem with the different word order in Chinese.
190
191Then your Russian translator calls on the phone, to I<personally> tell
192you the bad news about how really unpleasant your life is about to
193become:
194
195Russian, like German or Latin, is an inflectional language; that is, nouns
196and adjectives have to take endings that depend on their case
197(i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of
198what role they have in syntax of the sentence --
199as well as on the grammatical gender (i.e., masculine, feminine, neuter)
200and number (i.e., singular or plural) of the noun, as well as on the
201declension class of the noun.  But unlike with most other inflected languages,
202putting a number-phrase (like "ten" or "forty-three", or their Arabic
203numeral equivalents) in front of noun in Russian can change the case and
204number that noun is, and therefore the endings you have to put on it.
205
206He elaborates:  In "I scanned %g directories", you'd I<expect>
207"directories" to be in the accusative case (since it is the direct
208object in the sentence) and the plural number,
209except where $directory_count is 1, then you'd expect the singular, of
210course.  Just like Latin or German.  I<But!>  Where $directory_count %
21110 is 1 ("%" for modulo, remember), assuming $directory count is an
212integer, and except where $directory_count % 100 is 11, "directories"
213is forced to become grammatically singular, which means it gets the
214ending for the accusative singular...  You begin to visualize the code
215it'd take to test for the problem so far, I<and still work for Chinese
216and Arabic and Italian>, and how many gettext items that'd take, but
217he keeps going...  But where $directory_count % 10 is 2, 3, or 4
218(except where $directory_count % 100 is 12, 13, or 14), the word for
219"directories" is forced to be genitive singular -- which means another
220ending... The room begins to spin around you, slowly at first...  But
221with I<all other> integer values, since "directory" is an inanimate
222noun, when preceded by a number and in the nominative or accusative
223cases (as it is here, just your luck!), it does stay plural, but it is
224forced into the genitive case -- yet another ending...  And
225you never hear him get to the part about how you're going to run into
226similar (but maybe subtly different) problems with other Slavic
227languages like Polish, because the floor comes up to meet you, and you
228fade into unconsciousness.
229
230
231The above cautionary tale relates how an attempt at localization can
232lead from programmer consternation, to program obfuscation, to a need
233for sedation.  But careful evaluation shows that your choice of tools
234merely needed further consideration.
235
236=head2 The Linguistic View
237
238=over
239
240"It is more complicated than you think."
241
242-- The Eighth Networking Truth, from RFC 1925
243
244=back
245
246The field of Linguistics has expended a great deal of effort over the
247past century trying to find grammatical patterns which hold across
248languages; it's been a constant process
249of people making generalizations that should apply to all languages,
250only to find out that, all too often, these generalizations fail --
251sometimes failing for just a few languages, sometimes whole classes of
252languages, and sometimes nearly every language in the world except
253English.  Broad statistical trends are evident in what the "average
254language" is like as far as what its rules can look like, must look
255like, and cannot look like.  But the "average language" is just as
256unreal a concept as the "average person" -- it runs up against the
257fact no language (or person) is, in fact, average.  The wisdom of past
258experience leads us to believe that any given language can do whatever
259it wants, in any order, with appeal to any kind of grammatical
260categories wants -- case, number, tense, real or metaphoric
261characteristics of the things that words refer to, arbitrary or
262predictable classifications of words based on what endings or prefixes
263they can take, degree or means of certainty about the truth of
264statements expressed, and so on, ad infinitum.
265
266Mercifully, most localization tasks are a matter of finding ways to
267translate whole phrases, generally sentences, where the context is
268relatively set, and where the only variation in content is I<usually>
269in a number being expressed -- as in the example sentences above.
270Translating specific, fully-formed sentences is, in practice, fairly
271foolproof -- which is good, because that's what's in the phrasebooks
272that so many tourists rely on.  Now, a given phrase (whether in a
273phrasebook or in a gettext lexicon) in one language I<might> have a
274greater or lesser applicability than that phrase's translation into
275another language -- for example, strictly speaking, in Arabic, the
276"your" in "Your query matched..." would take a different form
277depending on whether the user is male or female; so the Arabic
278translation "your[feminine] query" is applicable in fewer cases than
279the corresponding English phrase, which doesn't distinguish the user's
280gender.  (In practice, it's not feasible to have a program know the
281user's gender, so the masculine "you" in Arabic is usually used, by
282default.)
283
284But in general, such surprises are rare when entire sentences are
285being translated, especially when the functional context is restricted
286to that of a computer interacting with a user either to convey a fact
287or to prompt for a piece of information.  So, for purposes of
288localization, translation by phrase (generally by sentence) is both the
289simplest and the least problematic.
290
291=head2 Breaking gettext
292
293=over
294
295"It Has To Work."
296
297-- First Networking Truth, RFC 1925
298
299=back
300
301Consider that sentences in a tourist phrasebook are of two types: ones
302like "How do I get to the marketplace?" that don't have any blanks to
303fill in, and ones like "How much do these ___ cost?", where there's
304one or more blanks to fill in (and these are usually linked to a
305list of words that you can put in that blank: "fish", "potatoes",
306"tomatoes", etc.).  The ones with no blanks are no problem, but the
307fill-in-the-blank ones may not be really straightforward. If it's a
308Swahili phrasebook, for example, the authors probably didn't bother to
309tell you the complicated ways that the verb "cost" changes its
310inflectional prefix depending on the noun you're putting in the blank.
311The trader in the marketplace will still understand what you're saying if
312you say "how much do these potatoes cost?" with the wrong
313inflectional prefix on "cost".  After all, I<you> can't speak proper Swahili,
314I<you're> just a tourist.  But while tourists can be stupid, computers
315are supposed to be smart; the computer should be able to fill in the
316blank, and still have the results be grammatical.
317
318In other words, a phrasebook entry takes some values as parameters
319(the things that you fill in the blank or blanks), and provides a value
320based on these parameters, where the way you get that final value from
321the given values can, properly speaking, involve an arbitrarily
322complex series of operations.  (In the case of Chinese, it'd be not at
323all complex, at least in cases like the examples at the beginning of
324this article; whereas in the case of Russian it'd be a rather complex
325series of operations.  And in some languages, the
326complexity could be spread around differently: while the act of
327putting a number-expression in front of a noun phrase might not be
328complex by itself, it may change how you have to, for example, inflect
329a verb elsewhere in the sentence.  This is what in syntax is called
330"long-distance dependencies".)
331
332This talk of parameters and arbitrary complexity is just another way
333to say that an entry in a phrasebook is what in a programming language
334would be called a "function".  Just so you don't miss it, this is the
335crux of this article: I<A phrase is a function; a phrasebook is a
336bunch of functions.>
337
338The reason that using gettext runs into walls (as in the above
339second-person horror story) is that you're trying to use a string (or
340worse, a choice among a bunch of strings) to do what you really need a
341function for -- which is futile.  Preforming (s)printf interpolation
342on the strings which you get back from gettext does allow you to do I<some>
343common things passably well... sometimes... sort of; but, to paraphrase
344what some people say about C<csh> script programming, "it fools you
345into thinking you can use it for real things, but you can't, and you
346don't discover this until you've already spent too much time trying,
347and by then it's too late."
348
349=head2 Replacing gettext
350
351So, what needs to replace gettext is a system that supports lexicons
352of functions instead of lexicons of strings.  An entry in a lexicon
353from such a system should I<not> look like this:
354
355  "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
356
357[\xE9 is e-acute in Latin-1.  Some pod renderers would
358scream if I used the actual character here. -- SB]
359
360but instead like this, bearing in mind that this is just a first stab:
361
362  sub I_found_X1_files_in_X2_directories {
363    my( $files, $dirs ) = @_[0,1];
364    $files = sprintf("%g %s", $files,
365      $files == 1 ? 'fichier' : 'fichiers');
366    $dirs = sprintf("%g %s", $dirs,
367      $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
368    return "J'ai trouv\xE9 $files dans $dirs.";
369  }
370
371Now, there's no particularly obvious way to store anything but strings
372in a gettext lexicon; so it looks like we just have to start over and
373make something better, from scratch.  I call my shot at a
374gettext-replacement system "Maketext", or, in CPAN terms,
375Locale::Maketext.
376
377When designing Maketext, I chose to plan its main features in terms of
378"buzzword compliance".  And here are the buzzwords:
379
380=head2 Buzzwords: Abstraction and Encapsulation
381
382The complexity of the language you're trying to output a phrase in is
383entirely abstracted inside (and encapsulated within) the Maketext module
384for that interface.  When you call:
385
386  print $lang->maketext("You have [quant,_1,piece] of new mail.",
387                       scalar(@messages));
388
389you don't know (and in fact can't easily find out) whether this will
390involve lots of figuring, as in Russian (if $lang is a handle to the
391Russian module), or relatively little, as in Chinese.  That kind of
392abstraction and encapsulation may encourage other pleasant buzzwords
393like modularization and stratification, depending on what design
394decisions you make.
395
396=head2 Buzzword: Isomorphism
397
398"Isomorphism" means "having the same structure or form"; in discussions
399of program design, the word takes on the special, specific meaning that
400your implementation of a solution to a problem I<has the same
401structure> as, say, an informal verbal description of the solution, or
402maybe of the problem itself.  Isomorphism is, all things considered,
403a good thing -- it's what problem-solving (and solution-implementing)
404should look like.
405
406What's wrong the with gettext-using code like this...
407
408  printf( $file_count == 1 ?
409    ( $directory_count == 1 ?
410     "Your query matched %g file in %g directory." :
411     "Your query matched %g file in %g directories." ) :
412    ( $directory_count == 1 ?
413     "Your query matched %g files in %g directory." :
414     "Your query matched %g files in %g directories." ),
415   $file_count, $directory_count,
416  );
417
418is first off that it's not well abstracted -- these ways of testing
419for grammatical number (as in the expressions like C<foo == 1 ?
420singular_form : plural_form>) should be abstracted to each language
421module, since how you get grammatical number is language-specific.
422
423But second off, it's not isomorphic -- the "solution" (i.e., the
424phrasebook entries) for Chinese maps from these four English phrases to
425the one Chinese phrase that fits for all of them.  In other words, the
426informal solution would be "The way to say what you want in Chinese is
427with the one phrase 'For your question, in Y directories you would
428find X files'" -- and so the implemented solution should be,
429isomorphically, just a straightforward way to spit out that one
430phrase, with numerals properly interpolated.  It shouldn't have to map
431from the complexity of other languages to the simplicity of this one.
432
433=head2 Buzzword: Inheritance
434
435There's a great deal of reuse possible for sharing of phrases between
436modules for related dialects, or for sharing of auxiliary functions
437between related languages.  (By "auxiliary functions", I mean
438functions that don't produce phrase-text, but which, say, return an
439answer to "does this number require a plural noun after it?".  Such
440auxiliary functions would be used in the internal logic of functions
441that actually do produce phrase-text.)
442
443In the case of sharing phrases, consider that you have an interface
444already localized for American English (probably by having been
445written with that as the native locale, but that's incidental).
446Localizing it for UK English should, in practical terms, be just a
447matter of running it past a British person with the instructions to
448indicate what few phrases would benefit from a change in spelling or
449possibly minor rewording.  In that case, you should be able to put in
450the UK English localization module I<only> those phrases that are
451UK-specific, and for all the rest, I<inherit> from the American
452English module.  (And I expect this same situation would apply with
453Brazilian and Continental Portugese, possibly with some I<very>
454closely related languages like Czech and Slovak, and possibly with the
455slightly different "versions" of written Mandarin Chinese, as I hear exist in
456Taiwan and mainland China.)
457
458As to sharing of auxiliary functions, consider the problem of Russian
459numbers from the beginning of this article; obviously, you'd want to
460write only once the hairy code that, given a numeric value, would
461return some specification of which case and number a given quantified
462noun should use.  But suppose that you discover, while localizing an
463interface for, say, Ukrainian (a Slavic language related to Russian,
464spoken by several million people, many of whom would be relieved to
465find that your Web site's or software's interface is available in
466their language), that the rules in Ukrainian are the same as in Russian
467for quantification, and probably for many other grammatical functions.
468While there may well be no phrases in common between Russian and
469Ukrainian, you could still choose to have the Ukrainian module inherit
470from the Russian module, just for the sake of inheriting all the
471various grammatical methods.  Or, probably better organizationally,
472you could move those functions to a module called C<_E_Slavic> or
473something, which Russian and Ukrainian could inherit useful functions
474from, but which would (presumably) provide no lexicon.
475
476=head2 Buzzword: Concision
477
478Okay, concision isn't a buzzword.  But it should be, so I decree that
479as a new buzzword, "concision" means that simple common things should
480be expressible in very few lines (or maybe even just a few characters)
481of code -- call it a special case of "making simple things easy and
482hard things possible", and see also the role it played in the
483MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
484
485Consider our first stab at an entry in our "phrasebook of functions":
486
487  sub I_found_X1_files_in_X2_directories {
488    my( $files, $dirs ) = @_[0,1];
489    $files = sprintf("%g %s", $files,
490      $files == 1 ? 'fichier' : 'fichiers');
491    $dirs = sprintf("%g %s", $dirs,
492      $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
493    return "J'ai trouv\xE9 $files dans $dirs.";
494  }
495
496You may sense that a lexicon (to use a non-committal catch-all term for a
497collection of things you know how to say, regardless of whether they're
498phrases or words) consisting of functions I<expressed> as above would
499make for rather long-winded and repetitive code -- even if you wisely
500rewrote this to have quantification (as we call adding a number
501expression to a noun phrase) be a function called like:
502
503  sub I_found_X1_files_in_X2_directories {
504    my( $files, $dirs ) = @_[0,1];
505    $files = quant($files, "fichier");
506    $dirs =  quant($dirs,  "r\xE9pertoire");
507    return "J'ai trouv\xE9 $files dans $dirs.";
508  }
509
510And you may also sense that you do not want to bother your translators
511with having to write Perl code -- you'd much rather that they spend
512their I<very costly time> on just translation.  And this is to say
513nothing of the near impossibility of finding a commercial translator
514who would know even simple Perl.
515
516In a first-hack implementation of Maketext, each language-module's
517lexicon looked like this:
518
519 %Lexicon = (
520   "I found %g files in %g directories"
521   => sub {
522      my( $files, $dirs ) = @_[0,1];
523      $files = quant($files, "fichier");
524      $dirs =  quant($dirs,  "r\xE9pertoire");
525      return "J'ai trouv\xE9 $files dans $dirs.";
526    },
527  ... and so on with other phrase => sub mappings ...
528 );
529
530but I immediately went looking for some more concise way to basically
531denote the same phrase-function -- a way that would also serve to
532concisely denote I<most> phrase-functions in the lexicon for I<most>
533languages.  After much time and even some actual thought, I decided on
534this system:
535
536* Where a value in a %Lexicon hash is a contentful string instead of
537an anonymous sub (or, conceivably, a coderef), it would be interpreted
538as a sort of shorthand expression of what the sub does.  When accessed
539for the first time in a session, it is parsed, turned into Perl code,
540and then eval'd into an anonymous sub; then that sub replaces the
541original string in that lexicon.  (That way, the work of parsing and
542evaling the shorthand form for a given phrase is done no more than
543once per session.)
544
545* Calls to C<maketext> (as Maketext's main function is called) happen
546thru a "language session handle", notionally very much like an IO
547handle, in that you open one at the start of the session, and use it
548for "sending signals" to an object in order to have it return the text
549you want.
550
551So, this:
552
553  $lang->maketext("You have [quant,_1,piece] of new mail.",
554                 scalar(@messages));
555
556basically means this: look in the lexicon for $lang (which may inherit
557from any number of other lexicons), and find the function that we
558happen to associate with the string "You have [quant,_1,piece] of new
559mail" (which is, and should be, a functioning "shorthand" for this
560function in the native locale -- English in this case).  If you find
561such a function, call it with $lang as its first parameter (as if it
562were a method), and then a copy of scalar(@messages) as its second,
563and then return that value.  If that function was found, but was in
564string shorthand instead of being a fully specified function, parse it
565and make it into a function before calling it the first time.
566
567* The shorthand uses code in brackets to indicate method calls that
568should be performed.  A full explanation is not in order here, but a
569few examples will suffice:
570
571  "You have [quant,_1,piece] of new mail."
572
573The above code is shorthand for, and will be interpreted as,
574this:
575
576  sub {
577    my $handle = $_[0];
578    my(@params) = @_;
579    return join '',
580      "You have ",
581      $handle->quant($params[1], 'piece'),
582      "of new mail.";
583  }
584
585where "quant" is the name of a method you're using to quantify the
586noun "piece" with the number $params[0].
587
588A string with no brackety calls, like this:
589
590  "Your search expression was malformed."
591
592is somewhat of a degenerate case, and just gets turned into:
593
594  sub { return "Your search expression was malformed." }
595
596However, not everything you can write in Perl code can be written in
597the above shorthand system -- not by a long shot.  For example, consider
598the Italian translator from the beginning of this article, who wanted
599the Italian for "I didn't find any files" as a special case, instead
600of "I found 0 files".  That couldn't be specified (at least not easily
601or simply) in our shorthand system, and it would have to be written
602out in full, like this:
603
604  sub {  # pretend the English strings are in Italian
605    my($handle, $files, $dirs) = @_[0,1,2];
606    return "I didn't find any files" unless $files;
607    return join '',
608      "I found ",
609      $handle->quant($files, 'file'),
610      " in ",
611      $handle->quant($dirs,  'directory'),
612      ".";
613  }
614
615Next to a lexicon full of shorthand code, that sort of sticks out like a
616sore thumb -- but this I<is> a special case, after all; and at least
617it's possible, if not as concise as usual.
618
619As to how you'd implement the Russian example from the beginning of
620the article, well, There's More Than One Way To Do It, but it could be
621something like this (using English words for Russian, just so you know
622what's going on):
623
624  "I [quant,_1,directory,accusative] scanned."
625
626This shifts the burden of complexity off to the quant method.  That
627method's parameters are: the numeric value it's going to use to
628quantify something; the Russian word it's going to quantify; and the
629parameter "accusative", which you're using to mean that this
630sentence's syntax wants a noun in the accusative case there, although
631that quantification method may have to overrule, for grammatical
632reasons you may recall from the beginning of this article.
633
634Now, the Russian quant method here is responsible not only for
635implementing the strange logic necessary for figuring out how Russian
636number-phrases impose case and number on their noun-phrases, but also
637for inflecting the Russian word for "directory".  How that inflection
638is to be carried out is no small issue, and among the solutions I've
639seen, some (like variations on a simple lookup in a hash where all
640possible forms are provided for all necessary words) are
641straightforward but I<can> become cumbersome when you need to inflect
642more than a few dozen words; and other solutions (like using
643algorithms to model the inflections, storing only root forms and
644irregularities) I<can> involve more overhead than is justifiable for
645all but the largest lexicons.
646
647Mercifully, this design decision becomes crucial only in the hairiest
648of inflected languages, of which Russian is by no means the I<worst> case
649scenario, but is worse than most.  Most languages have simpler
650inflection systems; for example, in English or Swahili, there are
651generally no more than two possible inflected forms for a given noun
652("error/errors"; "kosa/makosa"), and the
653rules for producing these forms are fairly simple -- or at least,
654simple rules can be formulated that work for most words, and you can
655then treat the exceptions as just "irregular", at least relative to
656your ad hoc rules.  A simpler inflection system (simpler rules, fewer
657forms) means that design decisions are less crucial to maintaining
658sanity, whereas the same decisions could incur
659overhead-versus-scalability problems in languages like Russian.  It
660may I<also> be likely that code (possibly in Perl, as with
661Lingua::EN::Inflect, for English nouns) has already
662been written for the language in question, whether simple or complex.
663
664Moreover, a third possibility may even be simpler than anything
665discussed above: "Just require that all possible (or at least
666applicable) forms be provided in the call to the given language's quant
667method, as in:"
668
669  "I found [quant,_1,file,files]."
670
671That way, quant just has to chose which form it needs, without having
672to look up or generate anything.  While possibly not optimal for
673Russian, this should work well for most other languages, where
674quantification is not as complicated an operation.
675
676=head2 The Devil in the Details
677
678There's plenty more to Maketext than described above -- for example,
679there's the details of how language tags ("en-US", "i-pwn", "fi",
680etc.) or locale IDs ("en_US") interact with actual module naming
681("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
682details of how to record (and possibly negotiate) what character
683encoding Maketext will return text in (UTF8? Latin-1? KOI8?).  There's
684the interesting fact that Maketext is for localization, but nowhere
685actually has a "C<use locale;>" anywhere in it.  For the curious,
686there's the somewhat frightening details of how I actually
687implement something like data inheritance so that searches across
688modules' %Lexicon hashes can parallel how Perl implements method
689inheritance.
690
691And, most importantly, there's all the practical details of how to
692actually go about deriving from Maketext so you can use it for your
693interfaces, and the various tools and conventions for starting out and
694maintaining individual language modules.
695
696That is all covered in the documentation for Locale::Maketext and the
697modules that come with it, available in CPAN.  After having read this
698article, which covers the why's of Maketext, the documentation,
699which covers the how's of it, should be quite straightforward.
700
701=head2 The Proof in the Pudding: Localizing Web Sites
702
703Maketext and gettext have a notable difference: gettext is in C,
704accessible thru C library calls, whereas Maketext is in Perl, and
705really can't work without a Perl interpreter (although I suppose
706something like it could be written for C).  Accidents of history (and
707not necessarily lucky ones) have made C++ the most common language for
708the implementation of applications like word processors, Web browsers,
709and even many in-house applications like custom query systems.  Current
710conditions make it somewhat unlikely that the next one of any of these
711kinds of applications will be written in Perl, albeit clearly more for
712reasons of custom and inertia than out of consideration of what is the
713right tool for the job.
714
715However, other accidents of history have made Perl a well-accepted
716language for design of server-side programs (generally in CGI form)
717for Web site interfaces.  Localization of static pages in Web sites is
718trivial, feasible either with simple language-negotiation features in
719servers like Apache, or with some kind of server-side inclusions of
720language-appropriate text into layout templates.  However, I think
721that the localization of Perl-based search systems (or other kinds of
722dynamic content) in Web sites, be they public or access-restricted,
723is where Maketext will see the greatest use.
724
725I presume that it would be only the exceptional Web site that gets
726localized for English I<and> Chinese I<and> Italian I<and> Arabic
727I<and> Russian, to recall the languages from the beginning of this
728article -- to say nothing of German, Spanish, French, Japanese,
729Finnish, and Hindi, to name a few languages that benefit from large
730numbers of programmers or Web viewers or both.
731
732However, the ever-increasing internationalization of the Web (whether
733measured in terms of amount of content, of numbers of content writers
734or programmers, or of size of content audiences) makes it increasingly
735likely that the interface to the average Web-based dynamic content
736service will be localized for two or maybe three languages.  It is my
737hope that Maketext will make that task as simple as possible, and will
738remove previous barriers to localization for languages dissimilar to
739English.
740
741 __END__
742
743Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics
744from Northwestern University; he specializes in language technology.
745Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of
746Linguistics at the University of New Mexico; he specializes in
747morphology and pedagogy of North American native languages.
748
749=head2 References
750
751Alvestrand, Harald Tveit.  1995.  I<RFC 1766: Tags for the
752Identification of Languages.>
753C<L<http://www.ietf.org/rfc/rfc1766.txt>>
754[Now see RFC 3066.]
755
756Callon, Ross, editor.  1996.  I<RFC 1925: The Twelve
757Networking Truths.>
758C<L<http://www.ietf.org/rfc/rfc1925.txt>>
759
760Drepper, Ulrich, Peter Miller,
761and FranE<ccedil>ois Pinard.  1995-2001.  GNU
762C<gettext>.  Available in C<L<ftp://prep.ai.mit.edu/pub/gnu/>>, with
763extensive docs in the distribution tarball.  [Since
764I wrote this article in 1998, I now see that the
765gettext docs are now trying more to come to terms with
766plurality.  Whether useful conclusions have come from it
767is another question altogether. -- SMB, May 2001]
768
769Forbes, Nevill.  1964.  I<Russian Grammar.>  Third Edition, revised
770by J. C. Dumbreck.  Oxford University Press.
771
772=cut
773
774#End
775
776