• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

words/H03-May-2022-91,93091,768

4.0.affixH A D03-May-20222.7 KiB7667

4.0.constituent-knowledgeH A D03-May-20222.4 KiB147135

4.0.dialectH A D03-May-20222.7 KiB7565

4.0.dictH A D03-May-2022449 KiB13,59011,880

4.0.dict.m4H A D03-May-2022362.9 KiB11,1819,720

4.0.knowledgeH A D03-May-202215.2 KiB348301

4.0.regexH A D03-May-202213.5 KiB299257

Makefile.amH A D03-May-2022641 2921

Makefile.inH A D03-May-202222.6 KiB763673

READMEH A D03-May-20227.1 KiB174131

corpus-basic.batchH A D03-May-202244.6 KiB1,051998

corpus-biolg.batchH A D03-May-202212.8 KiB413339

corpus-fix-long.batchH A D03-May-20227.6 KiB6950

corpus-fixes.batchH A D03-May-2022183.3 KiB6,3335,414

corpus-voa.batchH A D03-May-20228.6 KiB117114

tiny.dictH A D03-May-20225.5 KiB158126

README

1
2Dictionary Data
3---------------
4Research notes.
5
6There are currently 63 data files in the 'words' directory.
7Of these, 8 are not distinct (*biolg*, *medical*) and so there
8are effectively just 55 "clusters" here.
9
10There are 1754 semicolons in 4.0.dict and 1772 colons.  This implies
11that there are approx 1650 to 1700 word clusters in 4.0.dict
12since many of the semi-colons appear in lines that merely define
13new classes.
14
15A better count of the contents of 4.0.dict yields 1430 distinct clusters.
16
17There seem to be 86863 word forms in the dicts
18
19Example cluster from Siva's dataset:
20
21cluster469
22   bets.n -- ../blah/words.n.2.s
23   doubts.n -- ../blah/blah-29
24   excuses.n -- ../blah/blah-34
25   foes.n -- ../blah/words.n.2.s
26   warnings.n -- ../blah/blah-29
27
28Actual disjunct usage:
29
30select inflected_word, disjunct, count, log_cond_probability from disjuncts where inflected_word='bets.n' order by log_cond_probability;
31
32 bets.n         | Jp- Dmc-                      |   5.38320328295231 |     2.68897695164809
33 bets.n         | Op-                           |   6.59906960930676 |     2.79728207561233
34 bets.n         | Op- Dmc-                      |   4.49985344521703 |     2.94756384236018
35 bets.n         | Jp- A- MXp+ MXp+              |   2.94644784927368 |      3.5584651263364
36 bets.n         | Jp- A-                        |    2.8032719194889 |     3.63033016407109
37 bets.n         | Op- Mv+                       |   2.38083738088607 |     3.86597277463304
38
39 doubts.n       | Op-                           |    14.7235374869777 |     2.53482148983126
40 doubts.n       | Op- Dmc-                      |    12.8798744678498 |     2.75360123030737
41 doubts.n       | Jp- A-                        |    3.70244218036532 |     4.39933529761974
42 doubts.n       | Op- A-                        |    4.28538444498555 |     4.52871084843059
43 doubts.n       | Opt-                          |    2.90120184421541 |     4.75116183218627
44 doubts.n       | Jp- Dmc-                      |    2.40070396848023 |     5.02435498790713
45
46
47 excuses.n      | Op- Dmc-                 |   5.50880998373031 |     2.32890577902052
48 excuses.n      | Op-                      |   5.03419046103953 |     2.45888667993668
49 excuses.n      | Jp- Dmc-                 |   4.23024629056454 |     2.70990481825512
50 excuses.n      | Op- TOn+                 |   1.90192013978957 |     3.86318980988967
51 excuses.n      | Op- AN- TOn+             |   1.79344245046377 |      3.9479150280805
52 excuses.n      | Opn-                     |   1.65557911992073 |     4.06331052106999
53
54 foes.n         | Op- Dmc-                    |    7.72758442535996 |     3.08401721340472
55 foes.n         | Jp-                         |    5.78156289178878 |     3.50257518460873
56 foes.n         | Jp- Dmc-                    |    8.53048111009413 |     3.55652688759394
57 foes.n         | Op-                         |    4.24155412614344 |     3.94944175213513
58
59 warnings.n     | Op-                                 |    13.1191083714365 |     2.73150374115749
60 warnings.n     | Op- Dmc-                            |    12.4493113420905 |     2.80710747394272
61 warnings.n     | Jp- Dmc-                            |    8.38247973471882 |     3.37772441764546
62
63
64Here's another curious one:
65cluster992
66   banker.n
67   fisherman.n
68   illustrator.n
69   lyricist.n
70   mechanic.n
71   periodical.n
72   psychiatrist.n
73   sculptor.n
74
75all from words.n.1 -- thus does not broaden coverage ... but are very
76nearly all a profession!
77 mechanic.n     | Js- Ds-                              |    13.7642659600825 |     2.88500665850946
78 mechanic.n     | Os- Ds- AN-                          |    7.06177791953084 |     3.84783097573959
79 mechanic.n     | AN+                                  |    6.95599334826693 |     3.86960587427955
80 mechanic.n     | Js- Ds- AN-                          |    6.24886311846786 |     4.02426868916609
81 mechanic.n     | Ost- Ds- R+ Bs+                      |    5.70536887645721 |     4.15554226141072
82
83 fisherman.n    | Ost- Ds-                           |    6.96868003904821 |     3.15873229404902
84 fisherman.n    | Js- Ds-                            |    6.63831343245697 |     3.22880096148911
85 fisherman.n    | Ost- Ds- A-                        |    5.21447241306305 |     3.57709641838825
86 fisherman.n    | AN+                                |    5.15915525704624 |     3.59248284744609
87
88 illustrator.n  | Js- Ds-                     |    23.8048364557326 |     2.60514384269322
89 illustrator.n  | Ost- Ds-                    |    16.1435659294952 |     3.55061043888198
90 illustrator.n  | Ost- Ds- A-                 |    12.5473636660028 |      3.5717400794719
91 illustrator.n  | Ost- Ds- R+ Bs+             |    6.37835476174951 |     4.43506613246927
92 illustrator.n  | Ost- Ds- AN-                |    6.57567423582078 |     4.43628494235792
93 illustrator.n  | AN+                         |    5.92789142578842 |     4.54073145072105
94
95 periodical.n   | AN+                            |    13.523933645105 |     2.25884735662492
96 periodical.n   | Ost- Ds- R+ Bs+                |   4.69391736388206 |     3.78549785079099
97 periodical.n   | Os- Ds-                        |   3.54950597882271 |     4.18867205040734
98 periodical.n   | Os- Ds- Mv+                    |    4.4908520579338 |     4.32151671611172
99 periodical.n   | Js- Ds-                        |   3.46312434598804 |     4.51594109897193
100
101
102Examined 1165 clusters, recorded 626
103Examined 13026 words, and 2218422 disjuncts
104Average 11.181116 words/cluster; average 3543.805112 dj's/recored-cluster
105
106real	3m42.396s
107user	3m35.157s
108
109recorded 628
110recorded 622
111
112Examined 1165 clusters, recorded 622
113Examined 12952 words, and 2239866 disjuncts
114Average 11.117597 words/cluster; average 3601.070740 dj's/recored-cluster
115Got 74 mismatch warnings
116
117fixes w/o: 226           w/: 225
118bilog w/o: 38  w/: 38
119
120
121
122To get the full-length list --
123
124Disjunct *d1 = build_disjuncts_for_dict_node(dn); -- but is obsolete ...
125free_disjuncts(d1)
126
127instead, use build_sentence_disjuncts() which use build_disjuncts_for_X_node()
128
129
130make float pt:
131in build-disjuncts.c == done
132todo -- build_disjuncts_for_X_node == done
133build_clause == done
134build_disjunct == done
135build_sentence_disjuncts -- preparation.c
136
137but preparation.c ...
138
139prepare_to_parse from api.c
140sentence_parse
141and retry from link-parser with more null counts.
142
143======================
144Historical trends:
145
146enwiki/A: grep  -- version 4.3.5
147num_skipped_words= * | wc  773352
148num_skipped_words="0" 388819  or 50.3%
149num_skipped_words="1" 148214  or 19.2%
150num_skipped_words="2"  83234  or 10.8%
151num_skipped_words="3"  43957  or  5.7%
152num_skipped_words="4"  28998  or  3.8%
153num_skipped_words="5"  19677  or  2.5%
154
155enwiki/E: grep  --- version 4.3.5 or so
156num_skipped_words= * | wc 980218
157num_skipped_words="0" 479076 or 48.9%
158num_skipped_words="1" 190183 or 19.4%
159num_skipped_words="2" 107265 or 10.9%
160num_skipped_words="3"  56875 or  5.8%
161num_skipped_words="4"  39240 or  4.0%
162num_skipped_words="5"  27431 or  2.8%
163
164enwiki/J: grep  --- version -4.5.7 or so
165num_skipped_words= * | wc 1744284
166num_skipped_words="0" 914187 or 52.4%
167num_skipped_words="1" 332653 or 19.1%
168num_skipped_words="2" 176185 or 10.1%
169num_skipped_words="3"  87241 or  5.0%
170num_skipped_words="4"  57509 or  3.3%
171num_skipped_words="5"  38483 or  2.2%
172
173
174