1% 2% Affixes get stripped off the left and right side of words 3% i.e. spaces are inserted between the affix and the word itself. 4% 5% Some of the funky UTF-8 parenthesis are used in Asian texts. 6% In order to allow single straight quote ' and double straight quote '' 7% to be stripped off from both the left and the right, they are 8% distinguished by the suffix .x and .y (as as Mr.x Mrs.x or Jr.y Sr.y) 9% 10% 。is an end-of-sentence marker used in Japanese texts. 11 12% Punctuation appearing on the right-side of words. 13% Note: the ellipsis ....y must appear *before* the dot ".", else the 14% splitting won't work right. 15")" "}" "]" ">" "".y" » 〉 ) 〕 》 】 ] 』 」 "’’" "’" “ ''.y '.y `.y 16"%" "," ....y "." 。.y ‧ ":" ";" "?" "!" ‽ ؟ ? ! ….y "”" ━.y –.y ー.y ‐.y 、.y 17~ ¢ ₵ ™ ℠ 18 : RPUNC+; 19 20% Punctuation appearing on the left-side of words. 21"(" "{" "[" "<" "".x" « 〈 ( 〔 《 【 [ 『 「 、.x `.x `` „ ‘ ''.x '.x ….x ....x 22¿ ¡ "$" US$ USD C$ 23£ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점 24† †† ‡ § ¶ © ® ℗ № "#" 25* • ⁂ ❧ ☞ ◊ ※ ○ 。.x ゜ ✿ ☆ * ◕ ● ∇ □ ◇ @ ◎ 26–.x ━.x ー.x -- - ‧.x 27 : LPUNC+; 28 29 30% The below is a quoted list, used during tokenization. Do NOT put 31% spaces in between the various quotation marks!! 32""«»《》【】『』`„“": QUOTES+; 33 34% The below is a quoted list, used during tokenization. Do NOT put 35% spaces in between the various symbols!! 36"()¿¡†‡§¶©®℗№#*•⁂❧☞◊※○。゜✿☆*◕●∇□◇@◎–━ー---‧": BULLETS+; 37