1<?xml version="1.0" standalone="yes"?> 2<!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN" 3 "http://www.boost.org/tools/boostbook/dtd/boostbook.dtd" 4[ 5 <!ENTITY % entities SYSTEM "program_options.ent" > 6 %entities; 7]> 8<section id="program_options.design"> 9 <title>Design Discussion</title> 10 11 <para>This section focuses on some of the design questions. 12 </para> 13 14 <section id="program_options.design.unicode"> 15 16 <title>Unicode Support</title> 17 18 <para>Unicode support was one of the features specifically requested 19 during the formal review. Throughout this document "Unicode support" is 20 a synonym for "wchar_t" support, assuming that "wchar_t" always uses 21 Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll 22 not mean strict 7-bit ASCII encoding, but rather "char" strings in local 23 8-bit encoding. 24 </para> 25 26 <para> 27 Generally, "Unicode support" can mean 28 many things, but for the program_options library it means that: 29 30 <itemizedlist> 31 <listitem> 32 <para>Each parser should accept either <code>char*</code> 33 or <code>wchar_t*</code>, correctly split the input into option 34 names and option values and return the data. 35 </para> 36 </listitem> 37 <listitem> 38 <para>For each option, it should be possible to specify whether the conversion 39 from string to value uses ascii or Unicode. 40 </para> 41 </listitem> 42 <listitem> 43 <para>The library guarantees that: 44 <itemizedlist> 45 <listitem> 46 <para>ascii input is passed to an ascii value without change 47 </para> 48 </listitem> 49 <listitem> 50 <para>Unicode input is passed to a Unicode value without change</para> 51 </listitem> 52 <listitem> 53 <para>ascii input passed to a Unicode value, and Unicode input 54 passed to an ascii value will be converted using a codecvt 55 facet (which may be specified by the user). 56 </para> 57 </listitem> 58 </itemizedlist> 59 </para> 60 </listitem> 61 </itemizedlist> 62 </para> 63 64 <para>The important point is that it's possible to have some "ascii 65 options" together with "Unicode options". There are two reasons for 66 this. First, for a given type you might not have the code to extract the 67 value from Unicode string and it's not good to require that such code be written. 68 Second, imagine a reusable library which has some options and exposes 69 options description in its interface. If <emphasis>all</emphasis> 70 options are either ascii or Unicode, and the library does not use any 71 Unicode strings, then the author is likely to use ascii options, making 72 the library unusable inside Unicode 73 applications. Essentially, it would be necessary to provide two versions 74 of the library -- ascii and Unicode. 75 </para> 76 77 <para>Another important point is that ascii strings are passed though 78 without modification. In other words, it's not possible to just convert 79 ascii to Unicode and process the Unicode further. The problem is that the 80 default conversion mechanism -- the <code>codecvt</code> facet -- might 81 not work with 8-bit input without additional setup. 82 </para> 83 84 <para>The Unicode support outlined above is not complete. For example, we 85 don't support Unicode option names. Unicode support is hard and 86 requires a Boost-wide solution. Even comparing two arbitrary Unicode 87 strings is non-trivial. Finally, using Unicode in option names is 88 related to internationalization, which has it's own 89 complexities. E.g. if option names depend on current locale, then all 90 program parts and other parts which use the name must be 91 internationalized too. 92 </para> 93 94 <para>The primary question in implementing the Unicode support is whether 95 to use templates and <code>std::basic_string</code> or to use some 96 internal encoding and convert between internal and external encodings on 97 the interface boundaries. 98 </para> 99 100 <para>The choice, mostly, is between code size and execution 101 speed. A templated solution would either link library code into every 102 application that uses the library (thereby making shared library 103 impossible), or provide explicit instantiations in the shared library 104 (increasing its size). The solution based on internal encoding would 105 necessarily make conversions in a number of places and will be somewhat slower. 106 Since speed is generally not an issue for this library, the second 107 solution looks more attractive, but we'll take a closer look at 108 individual components. 109 </para> 110 111 <para>For the parsers component, we have three choices: 112 <itemizedlist> 113 <listitem> 114 <para>Use a fully templated implementation: given a string of a 115 certain type, a parser will return a &parsed_options; instance 116 with strings of the same type (i.e. the &parsed_options; class 117 will be templated).</para> 118 </listitem> 119 <listitem> 120 <para>Use internal encoding: same as above, but strings will be converted to and 121 from the internal encoding.</para> 122 </listitem> 123 <listitem> 124 <para>Use and partly expose the internal encoding: same as above, 125 but the strings in the &parsed_options; instance will be in the 126 internal encoding. This might avoid a conversion if 127 &parsed_options; instance is passed directly to other components, 128 but can be also dangerous or confusing for a user. 129 </para> 130 </listitem> 131 </itemizedlist> 132 </para> 133 134 <para>The second solution appears to be the best -- it does not increase 135 the code size much and is cleaner than the third. To avoid extra 136 conversions, the Unicode version of &parsed_options; can also store 137 strings in internal encoding. 138 </para> 139 140 <para>For the options descriptions component, we don't have much 141 choice. Since it's not desirable to have either all options use ascii or all 142 of them use Unicode, but rather have some ascii and some Unicode options, the 143 interface of the &value_semantic; must work with both. The only way is 144 to pass an additional flag telling if strings use ascii or internal encoding. 145 The instance of &value_semantic; can then convert into some 146 other encoding if needed. 147 </para> 148 149 <para>For the storage component, the only affected function is &store;. 150 For Unicode input, the &store; function should convert the value to the 151 internal encoding. It should also inform the &value_semantic; class 152 about the used encoding. 153 </para> 154 155 <para>Finally, what internal encoding should we use? The 156 alternatives are: 157 <code>std::wstring</code> (using UCS-4 encoding) and 158 <code>std::string</code> (using UTF-8 encoding). The difference between 159 alternatives is: 160 <itemizedlist> 161 <listitem> 162 <para>Speed: UTF-8 is a bit slower</para> 163 </listitem> 164 <listitem> 165 <para>Space: UTF-8 takes less space when input is ascii</para> 166 </listitem> 167 <listitem> 168 <para>Code size: UTF-8 requires additional conversion code. However, 169 it allows one to use existing parsers without converting them to 170 <code>std::wstring</code> and such conversion is likely to create a 171 number of new instantiations. 172 </para> 173 </listitem> 174 175 </itemizedlist> 176 There's no clear leader, but the last point seems important, so UTF-8 177 will be used. 178 </para> 179 180 <para>Choosing the UTF-8 encoding allows the use of existing parsers, 181 because 7-bit ascii characters retain their values in UTF-8, 182 so searching for 7-bit strings is simple. However, there are 183 two subtle issues: 184 <itemizedlist> 185 <listitem> 186 <para>We need to assume the character literals use ascii encoding 187 and that inputs use Unicode encoding.</para> 188 </listitem> 189 <listitem> 190 <para>A Unicode character (say '=') can be followed by 'composing 191 character' and the combination is not the same as just '=', so a 192 simple search for '=' might find the wrong character. 193 </para> 194 </listitem> 195 </itemizedlist> 196 Neither of these issues appear to be critical in practice, since ascii is 197 almost universal encoding and since composing characters following '=' (and 198 other characters with special meaning to the library) are not likely to appear. 199 </para> 200 201 </section> 202 203 204</section> 205 206<!-- 207 Local Variables: 208 mode: xml 209 sgml-indent-data: t 210 sgml-parent-document: ("program_options.xml" "section") 211 sgml-set-face: t 212 End: 213-->