1<?xml version="1.0" standalone="yes"?>
2<!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN"
3     "http://www.boost.org/tools/boostbook/dtd/boostbook.dtd"
4[
5    <!ENTITY % entities SYSTEM "program_options.ent" >
6    %entities;
7]>
8<section id="program_options.design">
9  <title>Design Discussion</title>
10
11  <para>This section focuses on some of the design questions.
12  </para>
13
14  <section id="program_options.design.unicode">
15
16    <title>Unicode Support</title>
17
18    <para>Unicode support was one of the features specifically requested
19      during the formal review. Throughout this document "Unicode support" is
20      a synonym for "wchar_t" support, assuming that "wchar_t" always uses
21      Unicode encoding.  Also, when talking about "ascii" (in lowercase) we'll
22      not mean strict 7-bit ASCII encoding, but rather "char" strings in local
23      8-bit encoding.
24    </para>
25
26    <para>
27      Generally, &quot;Unicode support&quot; can mean
28      many things, but for the program_options library it means that:
29
30      <itemizedlist>
31        <listitem>
32          <para>Each parser should accept either <code>char*</code>
33          or <code>wchar_t*</code>, correctly split the input into option
34          names and option values and return the data.
35          </para>
36        </listitem>
37        <listitem>
38          <para>For each option, it should be possible to specify whether the conversion
39            from string to value uses ascii or Unicode.
40          </para>
41        </listitem>
42        <listitem>
43          <para>The library guarantees that:
44            <itemizedlist>
45              <listitem>
46                <para>ascii input is passed to an ascii value without change
47                </para>
48              </listitem>
49              <listitem>
50                <para>Unicode input is passed to a Unicode value without change</para>
51              </listitem>
52              <listitem>
53                <para>ascii input passed to a Unicode value, and Unicode input
54                  passed to an ascii value will be converted using a codecvt
55                  facet (which may be specified by the user).
56                </para>
57              </listitem>
58            </itemizedlist>
59          </para>
60        </listitem>
61      </itemizedlist>
62    </para>
63
64    <para>The important point is that it's possible to have some "ascii
65      options" together with "Unicode options". There are two reasons for
66      this. First, for a given type you might not have the code to extract the
67      value from Unicode string and it's not good to require that such code be written.
68      Second, imagine a reusable library which has some options and exposes
69      options description in its interface. If <emphasis>all</emphasis>
70      options are either ascii or Unicode, and the library does not use any
71      Unicode strings, then the author is likely to use ascii options, making
72      the library unusable inside Unicode
73      applications. Essentially, it would be necessary to provide two versions
74      of the library -- ascii and Unicode.
75    </para>
76
77    <para>Another important point is that ascii strings are passed though
78      without modification. In other words, it's not possible to just convert
79      ascii to Unicode and process the Unicode further. The problem is that the
80      default conversion mechanism -- the <code>codecvt</code> facet -- might
81      not work with 8-bit input without additional setup.
82    </para>
83
84    <para>The Unicode support outlined above is not complete. For example, we
85      don't support Unicode option names. Unicode support is hard and
86      requires a Boost-wide solution. Even comparing two arbitrary Unicode
87      strings is non-trivial. Finally, using Unicode in option names is
88      related to internationalization, which has it's own
89      complexities. E.g. if option names depend on current locale, then all
90      program parts and other parts which use the name must be
91      internationalized too.
92    </para>
93
94    <para>The primary question in implementing the Unicode support is whether
95      to use templates and <code>std::basic_string</code> or to use some
96      internal encoding and convert between internal and external encodings on
97      the interface boundaries.
98    </para>
99
100    <para>The choice, mostly, is between code size and execution
101      speed. A templated solution would either link library code into every
102      application that uses the library (thereby making shared library
103      impossible), or provide explicit instantiations in the shared library
104      (increasing its size). The solution based on internal encoding would
105      necessarily make conversions in a number of places and will be somewhat slower.
106      Since speed is generally not an issue for this library, the second
107      solution looks more attractive, but we'll take a closer look at
108      individual components.
109    </para>
110
111    <para>For the parsers component, we have three choices:
112      <itemizedlist>
113        <listitem>
114          <para>Use a fully templated implementation: given a string of a
115            certain type, a parser will return a &parsed_options; instance
116            with strings of the same type (i.e. the &parsed_options; class
117            will be templated).</para>
118        </listitem>
119        <listitem>
120          <para>Use internal encoding: same as above, but strings will be converted to and
121            from the internal encoding.</para>
122        </listitem>
123        <listitem>
124          <para>Use and partly expose the internal encoding: same as above,
125            but the strings in the &parsed_options; instance will be in the
126            internal encoding. This might avoid a conversion if
127            &parsed_options; instance is passed directly to other components,
128            but can be also dangerous or confusing for a user.
129          </para>
130        </listitem>
131      </itemizedlist>
132    </para>
133
134    <para>The second solution appears to be the best -- it does not increase
135    the code size much and is cleaner than the third. To avoid extra
136    conversions, the Unicode version of &parsed_options; can also store
137    strings in internal encoding.
138    </para>
139
140    <para>For the options descriptions component, we don't have much
141      choice. Since it's not desirable to have either all options use ascii or all
142      of them use Unicode, but rather have some ascii and some Unicode options, the
143      interface of the &value_semantic; must work with both. The only way is
144      to pass an additional flag telling if strings use ascii or internal encoding.
145      The instance of &value_semantic; can then convert into some
146      other encoding if needed.
147    </para>
148
149    <para>For the storage component, the only affected function is &store;.
150      For Unicode input, the &store; function should convert the value to the
151      internal encoding.  It should also inform the &value_semantic; class
152      about the used encoding.
153    </para>
154
155    <para>Finally, what internal encoding should we use? The
156    alternatives are:
157    <code>std::wstring</code> (using UCS-4 encoding) and
158    <code>std::string</code> (using UTF-8 encoding). The difference between
159    alternatives is:
160      <itemizedlist>
161        <listitem>
162          <para>Speed: UTF-8 is a bit slower</para>
163        </listitem>
164        <listitem>
165          <para>Space: UTF-8 takes less space when input is ascii</para>
166        </listitem>
167        <listitem>
168          <para>Code size: UTF-8 requires additional conversion code. However,
169            it allows one to use existing parsers without converting them to
170            <code>std::wstring</code> and such conversion is likely to create a
171            number of new instantiations.
172          </para>
173        </listitem>
174
175      </itemizedlist>
176      There's no clear leader, but the last point seems important, so UTF-8
177      will be used.
178    </para>
179
180    <para>Choosing the UTF-8 encoding allows the use of existing parsers,
181      because 7-bit ascii characters retain their values in UTF-8,
182      so searching for 7-bit strings is simple. However, there are
183      two subtle issues:
184      <itemizedlist>
185        <listitem>
186          <para>We need to assume the character literals use ascii encoding
187          and that inputs use Unicode encoding.</para>
188        </listitem>
189        <listitem>
190          <para>A Unicode character (say '=') can be followed by 'composing
191          character' and the combination is not the same as just '=', so a
192          simple search for '=' might find the wrong character.
193          </para>
194        </listitem>
195      </itemizedlist>
196      Neither of these issues appear to be critical in practice, since ascii is
197      almost universal encoding and since composing characters following '=' (and
198      other characters with special meaning to the library) are not likely to appear.
199    </para>
200
201  </section>
202
203
204</section>
205
206<!--
207     Local Variables:
208     mode: xml
209     sgml-indent-data: t
210     sgml-parent-document: ("program_options.xml" "section")
211     sgml-set-face: t
212     End:
213-->