1------------------------------------------------------------------------------ 2-- -- 3-- GNAT RUN-TIME COMPONENTS -- 4-- -- 5-- A D A . W I D E _ C H A R A C T E R S . U N I C O D E -- 6-- -- 7-- S p e c -- 8-- -- 9-- Copyright (C) 2005-2020, Free Software Foundation, Inc. -- 10-- -- 11-- GNAT is free software; you can redistribute it and/or modify it under -- 12-- terms of the GNU General Public License as published by the Free Soft- -- 13-- ware Foundation; either version 3, or (at your option) any later ver- -- 14-- sion. GNAT is distributed in the hope that it will be useful, but WITH- -- 15-- OUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -- 16-- or FITNESS FOR A PARTICULAR PURPOSE. -- 17-- -- 18-- As a special exception under Section 7 of GPL version 3, you are granted -- 19-- additional permissions described in the GCC Runtime Library Exception, -- 20-- version 3.1, as published by the Free Software Foundation. -- 21-- -- 22-- You should have received a copy of the GNU General Public License and -- 23-- a copy of the GCC Runtime Library Exception along with this program; -- 24-- see the files COPYING3 and COPYING.RUNTIME respectively. If not, see -- 25-- <http://www.gnu.org/licenses/>. -- 26-- -- 27-- GNAT was originally developed by the GNAT team at New York University. -- 28-- Extensive contributions were provided by Ada Core Technologies Inc. -- 29-- -- 30------------------------------------------------------------------------------ 31 32-- Unicode categorization routines for Wide_Character. Note that this 33-- package is strictly speaking Ada 2005 (since it is a child of an 34-- Ada 2005 unit), but we make it available in Ada 95 mode, since it 35-- only deals with wide characters. 36 37with System.UTF_32; 38 39package Ada.Wide_Characters.Unicode is 40 pragma Pure; 41 42 -- The following type defines the categories from the unicode definitions. 43 -- The one addition we make is Fe, which represents the characters FFFE 44 -- and FFFF in any of the planes. 45 46 type Category is new System.UTF_32.Category; 47 -- Cc Other, Control 48 -- Cf Other, Format 49 -- Cn Other, Not Assigned 50 -- Co Other, Private Use 51 -- Cs Other, Surrogate 52 -- Ll Letter, Lowercase 53 -- Lm Letter, Modifier 54 -- Lo Letter, Other 55 -- Lt Letter, Titlecase 56 -- Lu Letter, Uppercase 57 -- Mc Mark, Spacing Combining 58 -- Me Mark, Enclosing 59 -- Mn Mark, Nonspacing 60 -- Nd Number, Decimal Digit 61 -- Nl Number, Letter 62 -- No Number, Other 63 -- Pc Punctuation, Connector 64 -- Pd Punctuation, Dash 65 -- Pe Punctuation, Close 66 -- Pf Punctuation, Final quote 67 -- Pi Punctuation, Initial quote 68 -- Po Punctuation, Other 69 -- Ps Punctuation, Open 70 -- Sc Symbol, Currency 71 -- Sk Symbol, Modifier 72 -- Sm Symbol, Math 73 -- So Symbol, Other 74 -- Zl Separator, Line 75 -- Zp Separator, Paragraph 76 -- Zs Separator, Space 77 -- Fe relative position FFFE/FFFF in plane 78 79 function Get_Category (U : Wide_Character) return Category; 80 pragma Inline (Get_Category); 81 -- Given a Wide_Character, returns corresponding Category, or Cn if the 82 -- code does not have an assigned unicode category. 83 84 -- The following functions perform category tests corresponding to lexical 85 -- classes defined in the Ada standard. There are two interfaces for each 86 -- function. The second takes a Category (e.g. returned by Get_Category). 87 -- The first takes a Wide_Character. The form taking the Wide_Character is 88 -- typically more efficient than calling Get_Category, but if several 89 -- different tests are to be performed on the same code, it is more 90 -- efficient to use Get_Category to get the category, then test the 91 -- resulting category. 92 93 function Is_Letter (U : Wide_Character) return Boolean; 94 function Is_Letter (C : Category) return Boolean; 95 pragma Inline (Is_Letter); 96 -- Returns true iff U is a letter that can be used to start an identifier, 97 -- or if C is one of the corresponding categories, which are the following: 98 -- Letter, Uppercase (Lu) 99 -- Letter, Lowercase (Ll) 100 -- Letter, Titlecase (Lt) 101 -- Letter, Modifier (Lm) 102 -- Letter, Other (Lo) 103 -- Number, Letter (Nl) 104 105 function Is_Digit (U : Wide_Character) return Boolean; 106 function Is_Digit (C : Category) return Boolean; 107 pragma Inline (Is_Digit); 108 -- Returns true iff U is a digit that can be used to extend an identifer, 109 -- or if C is one of the corresponding categories, which are the following: 110 -- Number, Decimal_Digit (Nd) 111 112 function Is_Line_Terminator (U : Wide_Character) return Boolean; 113 pragma Inline (Is_Line_Terminator); 114 -- Returns true iff U is an allowed line terminator for source programs, 115 -- if U is in the category Zp (Separator, Paragaph), or Zs (Separator, 116 -- Line), or if U is a conventional line terminator (CR, LF, VT, FF). 117 -- There is no category version for this function, since the set of 118 -- characters does not correspond to a set of Unicode categories. 119 120 function Is_Mark (U : Wide_Character) return Boolean; 121 function Is_Mark (C : Category) return Boolean; 122 pragma Inline (Is_Mark); 123 -- Returns true iff U is a mark character which can be used to extend an 124 -- identifier, or if C is one of the corresponding categories, which are 125 -- the following: 126 -- Mark, Non-Spacing (Mn) 127 -- Mark, Spacing Combining (Mc) 128 129 function Is_Other (U : Wide_Character) return Boolean; 130 function Is_Other (C : Category) return Boolean; 131 pragma Inline (Is_Other); 132 -- Returns true iff U is an other format character, which means that it 133 -- can be used to extend an identifier, but is ignored for the purposes of 134 -- matching of identifiers, or if C is one of the corresponding categories, 135 -- which are the following: 136 -- Other, Format (Cf) 137 138 function Is_Punctuation (U : Wide_Character) return Boolean; 139 function Is_Punctuation (C : Category) return Boolean; 140 pragma Inline (Is_Punctuation); 141 -- Returns true iff U is a punctuation character that can be used to 142 -- separate pices of an identifier, or if C is one of the corresponding 143 -- categories, which are the following: 144 -- Punctuation, Connector (Pc) 145 146 function Is_Space (U : Wide_Character) return Boolean; 147 function Is_Space (C : Category) return Boolean; 148 pragma Inline (Is_Space); 149 -- Returns true iff U is considered a space to be ignored, or if C is one 150 -- of the corresponding categories, which are the following: 151 -- Separator, Space (Zs) 152 153 function Is_NFKC (U : Wide_Character) return Boolean; 154 pragma Inline (Is_NFKC); 155 -- Returns True if the Wide_Character designated by U could be present 156 -- in a string normalized to Normalization Form KC (as defined by Clause 157 -- 21 of ISO/IEC 10646:2017), otherwise returns False. 158 159 function Is_Non_Graphic (U : Wide_Character) return Boolean; 160 function Is_Non_Graphic (C : Category) return Boolean; 161 pragma Inline (Is_Non_Graphic); 162 -- Returns true iff U is considered to be a non-graphic character, or if C 163 -- is one of the corresponding categories, which are the following: 164 -- Other, Control (Cc) 165 -- Other, Private Use (Co) 166 -- Other, Surrogate (Cs) 167 -- Separator, Line (Zl) 168 -- Separator, Paragraph (Zp) 169 -- FFFE or FFFF positions in any plane (Fe) 170 -- 171 -- Note that the Ada category format effector is subsumed by the above 172 -- list of Unicode categories. 173 -- 174 -- Note that Other, Unassiged (Cn) is quite deliberately not included 175 -- in the list of categories above. This means that should any of these 176 -- code positions be defined in future with graphic characters they will 177 -- be allowed without a need to change implementations or the standard. 178 -- 179 -- Note that Other, Format (Cf) is also quite deliberately not included 180 -- in the list of categories above. This means that these characters can 181 -- be included in character and string literals. 182 183 function Is_Basic (U : Wide_Character) return Boolean; 184 pragma Inline (Is_Basic); 185 -- Returns True if the Wide_Character designated by Item has no 186 -- Decomposition Mapping in the code charts of ISO/IEC 10646:2017, 187 -- otherwise returns False. 188 189 function To_Basic (U : Wide_Character) return Wide_Character; 190 pragma Inline (To_Basic); 191 -- Returns the Wide_Character whose code point is given by the first value 192 -- of its Decomposition Mapping in the code charts of ISO/IEC 10646:2017 if 193 -- any, returns Item otherwise. 194 195 -- The following function is used to fold to upper case, as required by 196 -- the Ada 2005 standard rules for identifier case folding. Two 197 -- identifiers are equivalent if they are identical after folding all 198 -- letters to upper case using this routine. A corresponding function to 199 -- fold to lower case is also provided. 200 201 function To_Lower_Case (U : Wide_Character) return Wide_Character; 202 pragma Inline (To_Lower_Case); 203 -- If U represents an upper case letter, returns the corresponding lower 204 -- case letter, otherwise U is returned unchanged. The folding is locale 205 -- independent as defined by documents referenced in the note in section 206 -- 1 of ISO/IEC 10646:2003 207 208 function To_Upper_Case (U : Wide_Character) return Wide_Character; 209 pragma Inline (To_Upper_Case); 210 -- If U represents a lower case letter, returns the corresponding upper 211 -- case letter, otherwise U is returned unchanged. The folding is locale 212 -- independent as defined by documents referenced in the note in section 213 -- 1 of ISO/IEC 10646:2003 214 215end Ada.Wide_Characters.Unicode; 216