1 %***************************************************************************% 2 % % 3 % Copyright (C) 2005, 2006 Sampo Pyysalo, Sophie Aubin % 4 % Copyright (C) 2009, 2012 Linas Vepstas % 5 % See file "LICENSE" for information about commercial use of this system % 6 % % 7 %***************************************************************************% 8 9% This file contains regular expressions that are used to match 10% tokens not found in the dictionary. Each regex is given a name which 11% determines the disjuncts assigned when the regex matches; this name 12% must be defined in the dictionary along with the appropriate disjuncts. 13% Note that the order of the regular expressions matters: matches will 14% be attempted in the order in which the regexs appear in this file, 15% and only the first match will be used. 16% 17% Regex'es that are preceded by !, if they match a token, stop 18% further match tries of the same regex name. Thus, they can serve 19% as a kind of a negative look-ahead. 20 21% Numbers. 22% XXX, we need to add utf8 U+00A0 "no-break space" 23% 24% Allows at most two colons in hour-minute-second HH:MM:SS expressions 25% Allows at most two digits between colons 26<HMS-TIME>: /^[0-9][0-9]?(:[0-9][0-9]?(:[0-9][0-9]?)?)?$/ 27 28% e.g. 1950's leading number can be higher, for science fiction. 29% Must be four digits, or possible three. Must end in s, 's ’s 30<DECADE-DATE>: /^([1-4][0-9][0-9]|[1-9][0-9])0(s|'s|’s)$/ 31 32% Similar to above, but does not end in s. Only allows four digits. 33% We process this before NUMBERS below, so that this is matched first. 34<YEAR-DATE>: /^([1-4][0-9]{3}|[1-9][0-9]{0,2})$/ 35 36% Day-of-month names; this regex will match before the one below. 37<DAY-ORDINALS>: /^(1st|2nd|3rd|[4-9]th|1[0-9]th|2(0th|1st|2nd|3rd|[4-9]th)|30th|31st)$/ 38 39% Ordinal numbers; everything except 1st through 13th 40% is handled by regex. 41<ORDINALS>: /^[1-9][0-9]*(0th|1st|2nd|3rd|[4-9]th)$/ 42 43% Allows any number of commas or periods 44% Be careful not match the period at the end of a sentence; 45% for example: "It happened in 1942." 46<NUMBERS>: /^[0-9,.]*[0-9]$/ 47% This parses signed numbers and ranges, e.g. "-5" and "5-10" and "9+/-6.5" 48<NUMBERS>: /^[0-9.,-]*[0-9](\+\/-[0-9.,-]*[0-9])?$/ 49% Parses simple fractions e.g. "1/60" with no decimal points or anything fancy 50<FRACTION>: /^[0-9]+\/[0-9]+$/ 51% "10(3)" exponent (used in PubMed) 52<NUMBERS>: /^[0-9.,-]*[0-9][0-9.,-]*\([0-9:.,-]*[0-9][0-9.,-]*\)$/ 53 54% Roman numerals 55% The first expr has the problem that it matches an empty string. The 56% cure for this is to use look-ahead, but neither the Gnu nor the BSD 57% regex libs support look-ahead. I can't think of a better solution. 58<ROMAN-NUMERAL-WORDS>: /^M*(CM|D?C{0,3}|CD)(XC|L?X{0,3}|XL)(IX|V?I{0,3}|IV)$/ 59% ROMAN-NUMERAL-WORDS: /^(?=(M|C|D|L|X|V|I)+)M*(CM|D?C{0,3}|CD)(XC|L?X{0,3}|XL)(IX|V?I{0,3}|IV)$/ 60% ROMAN-NUMERAL-WORDS: /^(?=.+)M*(CM|D?C{0,3}|CD)(XC|L?X{0,3}|XL)(IX|V?I{0,3}|IV)$/ 61 62% Strings of initials. e.g "Dr. J.G.D. Smith lives on Main St." 63% Make it at least two letters long, as otherwise it clobbers 64% single-letter handling in the dict, which is different. 65<INITIALS>: /^[A-Z]\.([A-Z]\.)+$/ 66 67% Strings of two or more upper-case letters. These might be initials, 68% but are more likely to be titles (e.g. MD LLD JD) and might also 69% be part numbers (see below, PART-NUMBER:) 70<ALL-UPPER>: /^[A-Z]([A-Z])+$/ 71 72% Greek letters with numbers 73<GREEK-LETTER-AND-NUMBER>: /^(alpha|beta|gamma|delta|epsilon|zeta|eta|theta|iota|kappa|lambda|mu|nu|xi|omicron|pi|rho|sigma|tau|upsilon|phi|chi|psi|omega)-?[0-9]+$/ 74<PL-GREEK-LETTER-AND-NUMBER>: /^(alpha|beta|gamma|delta|epsilon|zeta|eta|theta|iota|kappa|lambda|mu|nu|xi|omicron|pi|rho|sigma|tau|upsilon|phi|chi|psi|omega)s-?[0-9]+$/ 75 76% Some "safe" derived units. Simple units are in dictionary. 77% The idea here is for the regex to match something that is almost 78% certainly part of a derived unit, and allow the rest to be 79% anything; this way we can capture difficult derived units such 80% as "mg/kg/day" and even oddities such as "micrograms/mouse/day" 81% without listing them explicitly. 82% TODO: add more. 83% Some (real) misses from these: 84% micrograms.kg-1.h-1 microM-1 J/cm2 %/day mN/m cm/yr 85% m/s days/week ml/s degrees/sec cm/sec cm/s mm/s N/mm (is that a unit?) 86% cuts/minute clicks/s beats/minute x/week W/kg/W %/patient-year 87% microIU/ml degrees/s counts/mm2 cells/mm3 tumors/mouse 88% mm/sec ml/hr mJ/cm(2) m2/g amol/mm2 animals/group 89% h-1 min-1 day-1 cm-1 mg-1 kg-1 mg.m-2.min-1 ms.cm-1 g-1 90% sec-1 ms-1 ml.min.-1kg-1 ml.hr-1 91% also, both kilometer and kilometers seem to be absent(!) 92% remember "mm"! 93 94% grams/anything 95<UNITS>: /^([npmk]|milli|micro|nano|pico|femto|atto|kilo|mega|tera)?(g|grams?)\// 96 97% mol/anything 98<UNITS>: /^([fnmp]|milli|micro|nano|pico|femto|atto|mu)?mol(es)?\// 99 100% common endings 101<UNITS>: /^[a-zA-Z\/.]+\/((m|micro)?[lLg]|mg|kg|mol|min|day|h|hr)$/ 102 103% common endings, except in the style "mg.kg-1" instead of "mg/kg". 104<UNITS>: /^[a-zA-Z\/.1-]+\.((m|micro)?[lLg]|mg|kg|mol|min|day|h|hr)(-1|\(-1\))$/ 105 106% combinations of numbers and units, e.g. "50-kDa", "1-2h" 107% TODO: Clean up and check that these are up-to-date wrt the 108% dictionary-recognized units; this is quite a mess currently. 109% TODO: Extend the "number" part of the regex to allow anything 110% that the NUMBER regex matches. 111% One problem here is a failure to split up the expression ... 112% e.g. "2hr" becomes 2 - ND - hr with the ND link. But 2-hr is treated 113% as a single word ('I is a 2-hr wait') 114% NUMBER-AND-UNIT: /^[0-9.,-]+(msec|s|min|hour|h|hr|day|week|wk|month|year|yr|kDa|kilodalton|base|kilobase|base-pair|kD|kd|kDa|bp|nt|kb|mm|mg|cm|nm|g|Hz|ms|kg|ml|mL|km|microm|\%)$/ 115% Comment out above, it screws up handling of unit suffixes, for 116% example: "Zangbert stock fell 30% to $2.50 yesterday." 117 118% fold-words. Matches NUMBER-fold, where NUMBER can be either numeric 119% or a spelled-out number, and the hyphen is optional. Note that for 120% spelled-out numbers, anything is allowed between the "initial" number 121% and "fold" to catch e.g. "two-to-three fold" ("fourteen" etc. are absent 122% as the prefix "four" is sufficient to match). 123<FOLD-WORDS>: /^[0-9.,:-]*[0-9]([0-9.,:-]|\([0-9.,:-]*[0-9][0-9.,:-]*\)|\+\/-)*-?fold$/ 124<FOLD-WORDS>: /^(one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fifteen|twenty|thirty|fifty|hundred|thousand|million).*fold$/ 125 126% Plural proper nouns. 127% Make sure that apostrophe-s is split out correctly. 128<PL-CAPITALIZED-WORDS>: /^[[:upper:]].*[^iuoys'’]s$/ 129 130% Other proper nouns. 131% We demand that these end with an alphanumeric, i.e. explicitly 132% reject punctuation. We don't want this regex to "swallow" any trailing 133% commas, colons, or periods/question-marks at the end of sentences. 134% In addition, this must not swallow words ending in 's 'll etc. 135% (... any affix, for that matter ...) and so no embedded apostrophe 136<CAPITALIZED-WORDS>: /^[[:upper:]][^'’]*[^[:punct:]]$/ 137 138% SUFFIX GUESSING 139% For all suffix-guessing patterns, we insist that the pattern start 140% with an alphanumeric. This is needed to guarantee that the 141% prefix-stripping code works correctly, as otherwise, the regex will 142% gobble the prefix. So for example: "We left (carrying the dog) and 143% Fred followed." Since "(carrying" is not in the dict, we need to be 144% sure to not match the leading paren so that it will get tripped. 145% 146<ING-WORDS>: /^\w.+ing$/ 147 148% Plurals or verb-s. Make sure that apostrophe-s is split out correctly. 149% e.g. "The subject's name is John Doe." should be 150% +--Ds--+---YS--+--Ds-+ 151% | | | | 152% the subject.n 's.p name.n 153<S-WORDS>: /^\w.+[^iuoys'’]s$/ 154 155% Verbs ending -ed. 156<ED-WORDS>: /^\w.+ed$/ 157 158% Advebs ending -ly. 159<LY-WORDS>: /^\w.+ly$/ 160 161% Nouns ending in -ism, -asm (chiliasm .. ) Usually mass nouns 162% Stubbed out for now; I'm not convinced this improves accuracy. 163% ISM-WORDS: /^\w.+asm$/ 164% ISM-WORDS: /^\w.+ism$/ 165 166% Corresponding count noun version of above (chiliast...) 167% AST-WORDS: /^\w.+ast$/ 168% AST-WORDS: /^\w.+ist$/ 169 170% Corresponding adjectival form of above 171<ADJ-WORDS>: /^\w.+astic$/ 172<ADJ-WORDS>: /^\w.+istic$/ 173 174% Nouns ending -ation stubbed out in BioLG, stub out here ... 175%ATION-WORDS: /^\w.+ation$/ 176 177% Extension by LIPN 11/10/2005 178% nouns -- typically seen in (bio-)chemistry texts 179% synthetase, kinase 180% 5-(hydroxymethyl)-2’-deoxyuridine 181% hydroxyethyl, hydroxymethyl 182% septation, reguion 183% isomaltotetraose, isomaltotriose 184% glycosylphosphatidylinositol 185% iodide, oligodeoxynucleotide 186% chronicity, hypochromicity 187<MC-NOUN-WORDS>: /^\w.+ase$/ 188<MC-NOUN-WORDS>: /^\w.+ene$/ 189<MC-NOUN-WORDS>: /^\w.+ine?$/ 190<MC-NOUN-WORDS>: /^\w.+yl$/ 191<MC-NOUN-WORDS>: /^\w.+ion$/ 192<MC-NOUN-WORDS>: /^\w.+ose$/ 193<MC-NOUN-WORDS>: /^\w.+ol$/ 194<MC-NOUN-WORDS>: /^\w.+ide$/ 195<MC-NOUN-WORDS>: /^\w.+ity$/ 196 197% Can take TOn+. Must appear after above, to avoid clash with +ity 198<NOUN-TO-WORDS>: /^\w.+ty$/ 199<NOUN-TO-WORDS>: /^\w.+cy$/ 200<NOUN-TO-WORDS>: /^\w.+nce$/ 201 202% replicon, intron 203<C-NOUN-WORDS>: /^\w.+o[rn]$/ 204 205% adjectives 206% exogenous, heterologous 207% intermolecular, intramolecular 208% glycolytic, ribonucleic, uronic 209% ribosomal, ribsosomal 210% nonpermissive, thermosensitive 211% inducible, metastable 212<ADJ-WORDS>: /^\w.+ous$/ 213<ADJ-WORDS>: /^\w.+ar$/ 214<ADJ-WORDS>: /^\w.+ic$/ 215<ADJ-WORDS>: /^\w.+al$/ 216<ADJ-WORDS>: /^\w.+ive$/ 217<ADJ-WORDS>: /^\w.+ble$/ 218 219% Usually capitalized place names: Georgian, Norwegian 220<ADJ-WORDS>: /^\w.+ian$/ 221 222% latin (postposed) adjectives 223% influenzae, tarentolae 224% pentosaceus, luteus, carnosus 225<LATIN-ADJ-WORDS>: /^\w.+ae$/ 226<LATIN-ADJ-WORDS>: /^\w.+us$/ % must appear after -ous in this file 227 228% latin (postposed) adjectives or latin plural noun 229% brevis, israelensis 230% japonicum, tabacum, xylinum 231<LATIN-ADJ-P-NOUN-WORDS>: /^\w.+is?$/ 232<LATIN-ADJ-S-NOUN-WORDS>: /^\w.+um$/ 233 234 235% Hyphenated words. In the original LG morpho-guessing system that 236% predated the regex-based system, hyphenated words were detected 237% before ING-WORDS, S-WORDS etc., causing e.g. "cross-linked" to be 238% treated as a HYPHENATED-WORD (a generic adjective/noun), and 239% never a verb. To return to this ordering, move this regex just 240% after the CAPITALIZED-WORDS regex. 241% We also match on commas, dots, brackets: 242% n-amino-3-azabicyclo[3.3.0]octane 243% 3'-Amino-2',3'-dideoxyguanosine 244% N-Phenylsulphonyl-N'-(3-azabicycloalkyl) 245% []...] means "match right-bracket" 246% Explicitly call out (5'|3') so that we don't all a generic match to 'll 247% /^[[:alnum:]][][:alnum:],:.\[-]*-[][:alnum:],:.\[-]*[[:alnum:]]$/ 248<HYPHENATED-WORDS>: !/--/ 249<HYPHENATED-WORDS>: !/[[:punct:]]$/ 250<HYPHENATED-WORDS>: 251 /^([[:alnum:]]|5'|3'|2'|N')([][:alnum:],:.()[-]|5'|3'|2'|N')*-[][:alnum:],:.()[-]*[[:alnum:]]*$/ 252 253% Emoticon checks must come *after* the above, so that the above take precedence. 254% See Wikipedia List_of_emoticons (also the References section). 255% 256% Emoticons must be entirely made of punctuation, length 2 or longer ;) 257% XXX [:punct:] is strangely broken, I have to add ;-< explicitly 258% XXX: Don't use [:punct:]. Do NOT include period!! 259% XXX: The problem with below is that 5. 7. 8. get recognized as emoticons, 260% which then prevents splitting for list numbers. (e.g "step 5. Do this.") 261% 262% Arghh. Other valid number expressions are clobbered by the emoticons. 263% For example: $5 $7 8% The quick fix is to remove the numbers. 264% Other breakages: The below clobbers "Bob, who ..." because it 265% matches Bob, as an emoticon. 266% 267% EMOTICON: /^[[:punct:];BDOpTX0578C☆ಠ●@◎~][[:punct:]<bcdDLmoOpPSTvX0358ಠっ○ 。゜✿☆*レツ◕●≧∇≦□◇@◎∩ω旦ヨ彡ミ‿◠ ̄ー~━-]+$/ 268% EMOTICON: /^[!"#$%&'()*+,\-/:;<=>?@[\\\]^_`{|}~;BDOpTX0578C☆ಠ●@◎~][!"#$%&'()*+,\-/:;<=>?@[\\\]^_`{|}~<bcdDLmoOpPSTvX0358ಠっ○ 。゜✿☆*レツ◕●≧∇≦□◇@◎∩ω旦ヨ彡ミ‿◠ ̄ー~━-]+$/ 269<EMOTICON>: !/^"|[[:alnum:]]+"$/ 270% "◠" is matched by [:punct:] using "libc" or "tre", but not using PCRE. 271% Hence it been added to the leading character subexpression. (Maybe 272% there are additional such characters.) 273<EMOTICON>: /^[[:punct:];BC☆ಠ●@◎~◠][-!"#$%&'()+,:;<=>?@[\\^_`{|}~<cdDLmoOpPSTvXಠっ○ 。゜✿☆*レツ◕●≧∇≦□◇@◎∩ω旦ヨ彡ミ‿◠ ̄ー~━-]+$/ 274 275% Part numbers should not match words with punctuation at their end. 276% Else sentences like "I saw him on January 21, 1990" have problems. 277% They should contain at least one number, and should not have dashes at their 278% start or end. A $ sign at the start is also too confusing. 279% The current regex system and the syntax of this file are not expressive enough 280% for things that should not be included. For example, we cannot prevent several 281% sequential "#" or dashes. It may match a word consisting of number+units, but 282% separate_word() will generate an alternative anyway. 283% The second part of this regex is for NNN-NNN in sentences like 284% "The plane is a 747-400". However, such words currently match NUMBERS. 285<PART-NUMBER>: 286 /^[A-Z0-9#][A-Z0-9$\/#]*[A-Z0-9$\/#,.-]*[0-9][A-Z0-9$\/#,.-]*[A-Z0-9$\/#]+$|^[1-9][0-9]+[\/-][0-9+]$/ 287 288% Single, stand-alone "quoted" "words" (so-called "scare" quotes). 289<QUOTED-WORD>: /^"[[:alnum:].-]+"$/ 290 291% Sequence of punctuation marks. If some mark appears in the affix table 292% such as a period, comma, dash or underscore, and there's a sequence of 293% these, then treat it as a "fill-in-the-blank" placeholder. 294% This matters only for punc. appearing in the affix table, since the 295% tokenizer explicitly mangles based on these punctuation marks. 296% 297% Look for at least four in a row. 298<UNKNOWN-WORD>: /^[.,-]{4}[.,-]*$/ 299