1NAME 2 CHANGES - Revision history for WordNet::Similarity 3 4DESCRIPTION 5 Version 2.07 (Released 10/05/2015) 6 (1) Fix make test error in lesktrace.t due to overlap results returning 7 in unpredictable orders - problem is documented here : 8 <https://rt.cpan.org/Ticket/Display.html?id=86437> and fix is 9 provided by Phil Goetz, philgoetz@gmail.com and involves sorting 10 overlaps in lesk.pm to guarantee order in testing. Note that keys 11 had to be regenerated after this fix installed using perl t/trace.t 12 --key (TDP) 13 14 (2) Install patch to fix WordNet version detection issues in Windows. 15 Problem description and patch provided here : 16 <https://rt.cpan.org/Ticket/Display.html?id=79065> 17 18 (3) add doc/update-pod.sh in order to create plain text documentation 19 (TDP) 20 21 (4) fix WordNet download location in install.pod (TDP) 22 23 (5) update prereqs in Makefile.PL (TDP) 24 25 Version 2.05 (Released 06/16/2008) 26 (1) Created new module WordNet::Similarity::FrequencyCounter containing 27 common support code for information content programs. (Sid) 28 29 (2) Updated all the frequency counting programs in /utils (*Freq.pl) to 30 use the common code in WordNet::Similarity::FrequencyCounter. (Sid) 31 32 (3) Changed the default path to Perl from /usr/local/bin to /usr/bin in 33 all scripts and tests in the package. (Sid) 34 35 (4) Fixed incorrect handling of BNC header information. (Sid) 36 37 (5) Modified the compoundify() method in WordNet::Tools to include 38 compounds containing special characters (period, hyphen, 39 forward-slash, single-quote). (Sid) 40 41 (6) Updated compoundify() to handle larger compounds. (Sid) 42 43 * 04/23/08 44 45 (1) Fixed the "excessive ROOTs" bug in *Freq.pl. (Sid) 46 47 (2) Fixed the extra verb concept counts in *Freq.pl. (Sid) 48 49 Version 2.04 (Released 04/19/2008) 50 * 04/17/08 51 52 (1) Reorganized similarity_server initialization. (Sid) 53 54 (2) The similarity server now prints more intuitive messages. (Sid) 55 56 (3) Attached timestamps to log messages. (Sid) 57 58 (4) Added additional checks to input strings from clients. (Sid) 59 60 * 04/12/08 61 62 (1) Added more detailed description of information content to 63 rawtextFreq.pl, and made minor copy editing and formatting 64 changes to other /utils files (TDP) 65 66 (2) Made minor copy editing and formatting changes to files in /doc 67 (TDP) 68 69 * 04/10/08 70 71 (1) Moved get_wn_info, stem and vectorFile modules under WordNet, 72 i.e., they are now WordNet::get_wn_info, WordNet::stem and 73 WordNet::vectorFile. (Sid) 74 75 (2) Updated all the modules and programs using the above modules. 76 (Sid) 77 78 (3) Added copyright notices in all module and program headers. (Sid) 79 80 (4) Added method getCompoundsList() to WordNet::Tools. (Sid) 81 82 (5) Made a more distrtibutable version of simialrity_server. The 83 similarity_server is now "daemonized", and is installed in 84 /usr/bin along with the other utils. (Sid) 85 86 * 03/23/08 87 88 (1) Added SIGNATURE to distrribution to enable package verification. 89 (Sid) 90 91 (2) Updated MANIFEST to reflect new SIGNATURE. (Sid) 92 93 (3) Set the LICENSE to gpl in META.yml and Makefile.PL. (Sid) 94 95 * 03/17/08 96 97 (1) Added NO_META option to Makefile.PL to prevent automatic 98 generation of META.yml during 'make dist'. (Sid) 99 100 (2) Removed unused variable "loaded" from Makefile.PL. (Sid) 101 102 Version 2.03 (Released 03/11/2008) 103 * 03/07/08 104 105 (1) Removed all references to WordNet::QueryData from Makefile.PL. 106 This is based on the following advice present in the 107 ExtUtils::MakeMaker documentation: "Module installation tools 108 have ways of resolving unmet dependencies but to do that they 109 need a Makefile". By checking for the presence of 110 WordNet::QueryData during 'perl Makefile.PL', we are preventing 111 any opportunity for automated dependency resolution. (Sid) 112 113 (2) The WordNet path (if specified by the WNHOME option during 'perl 114 Makefile.PL') is not checked for validity beforehand, and is now 115 directly provided as-is to build/Infocontent.PL and 116 build/Depthfiles.PL. In case of a WNHOME error, now 'make' 117 should fail instead of 'perl Makefile.PL' (which is more 118 appropriate). (Sid) 119 120 (3) Corrected a typo in DepthFinder.pm synopsis that refered to 121 getTaxonomyRoot rather than getTaxonomies. Removed some cut and 122 paste documentation from the templated used for GlossFinder.pm 123 and PathFinder.pm (Ted) 124 125 (4) Made synopsis examples WordNet version independent by not hard 126 coding offsets, etc. Did this in Depthfinder.pm, PathFinder.pm, 127 ICFinder, and GlossFinder.pm (Ted) 128 129 (5) Made minor changes in path names and file names in the /samples 130 directory and the /config-files subdirectory. (Ted) 131 132 Version 2.02 (Released 03/04/2008) 133 * 03/04/08 134 135 (1) Applied patch from Ben Haskell to fix a bug report (submitted by 136 Quang Do Xuan) about failing self-similarity of tilde#n#1 using 137 wup and lch measures. (Sid) 138 139 (2) Added tests for above bug to t/wup.t and t/lch.t. (Sid) 140 141 (3) Added WordNet::Similarity package version info to similarity.pl 142 --version. (Sid) 143 144 * 01/31/08 145 146 (1) Changed some default options in the similarity_server.conf 147 configuration. (Sid) 148 149 (2) Reformatted some of the similarity_server code. (Sid) 150 151 * 01/10/08 152 153 (1) Reduced version requirements of some of the PREREQ_PM modules. 154 (Sid) 155 156 (2) Changed WordNet::QueryData requirements to v1.40 in the 157 documentation. (Sid) 158 159 Version 2.01 (Released 10/14/2007) 160 * 10/13/07 161 162 (1) Fixed error in loading WordNet::Tools for similarity_server.pl. 163 (Sid) 164 165 (2) Removed the use of default (hardcoded) stoplist and word-vectors 166 file for similarity_server.pl. (Sid) 167 168 (3) Print WordNet hash-code instead of WordNet version, for 169 similarity.cgi WordNet version information. (Sid) 170 171 * 10/09/07 172 173 (1) Updated the Pathfinder code to handle loops in the WordNet is-a 174 hierarchy (like the one in WN3.0). (Sid) 175 176 (2) Updated MANIFEST, changelog and documentation to reflect the new 177 changes. (Sid) 178 179 * 10/08/07 180 181 (1) The modules now are not dependent on the version() method of 182 WordNet::QueryData (which is no longer reliable). Instead they 183 now use a 'hash-code' representing a specific WordNet 184 distribution. (Sid) 185 186 (2) Added module WordNet::Tools which provides the hashCode and 187 compoundify methods used by most of the other modules and 188 utilities. (Sid) 189 190 (3) Completely modified the build procedure to generate data files 191 during the 'make' step instead of the 'perl Makefile.PL' step. 192 (Sid) 193 194 (4) Removed the WordNet version numbers appended to synsetdepths.dat 195 and treedepths.dat. (Sid) 196 197 (5) Added two "build" utilities -- build/Infocontent.PL and 198 build/Depthfiles.PL -- which are run during the 'make' step to 199 generate data files. (Sid) 200 201 (6) The default WordNet version is now v3.0. Changed all 202 documentation, code and examples to reflect this. (Sid) 203 204 (7) The package now requires WordNet::QueryData version 1.46 or 205 above. (Sid) 206 207 (8) Revised all tests and test-keys for the new code and new version 208 of WordNet and QueryData. (Sid) 209 210 (9) Removed the multiple pieces of code implementing "compoundify" 211 and moved it all into a single method in WordNet::Tools. (Sid) 212 213 * 10/04/07 214 215 (1) Included a default word vectors file in the distribution and 216 eliminated the creation of a default word vectors file at 217 install time. (Sid) 218 219 * 02/25/07 220 221 (1) Fixed documentation where module WordNet::Similarity::path was 222 referred to as WordNet::Similarity::edge (old name). (Sid) 223 224 * 01/30/07 225 226 (1) Fixed wnDepths.pl man-page to display the wnpath option 227 consistently in the usage and the description. (Sid) 228 229 (2) Fixed the "deep recursion" error (only with WN3.0) in the 230 findWPSDepths() subroutine in the wnDepths.pl script. (Sid) 231 232 Version 1.04 (Released 12/13/2006) 233 * 12/13/06 234 235 (1) Fixed major bug reported in vector_pairs, where every alternate 236 function is skipped because of a loop variable being incremented 237 twice. (Sid) 238 239 * 04/21/06 240 241 (1) The web-interface was still not working for the vector measure, 242 because only one side of the client-server interface had been 243 updated. Updated the similarity server with code to support 244 both, vector and vector_pairs measures. (Sid) 245 246 (2) Updated the description of the Gloss Vector measure in 247 measures.html (web interface). (Sid) 248 249 Version 1.03 (Released 04/14/2006) 250 * 04/14/06 251 252 (1) Applied Ben Haskell's patch to ICFinder.pm (to make the 253 behaviour of the probability() and IC() functions consistent 254 with their comments). 255 256 * 04/05/06 257 258 (1) Updated the names for the Extended Gloss Overlaps measure and 259 the Gloss Vector measure in the documentation. (Sid) 260 261 * 02/19/06 262 263 (1) Updated PODs for all modules. (Sid) 264 265 (2) Added tests for POD errors and for POD coverage. (Sid) 266 267 * 03/31/06 268 269 (1) Changed "hash-style" constants (Perl v5.8) to single line 270 constants (Perl v5.6) for compatibility with Perl v5.6.0. (Sid) 271 272 Version 1.02 (Released 02/07/2006) 273 * 02/06/06 274 275 (1) Added utility rankFormat.pl for ranking the output of 276 similarity.pl and making the output suitable for input to 277 rank.pl (to compute Spearman's correlation coefficient) of the 278 Text::NSP package. (Sid) 279 280 * 01/15/06 281 282 (1) Fixed issue in lesk.pm where undefined values for $wc1 and $wc2 283 caused errors with the normalize option. (Sid) 284 285 (2) Fixed minor UI issues in wnDepths.pl. (Sid) 286 287 Version 1.01 (Released 12/21/2005) 288 * 12/09/05 289 290 (1) Modified get_wn_info.pm with Wybo Wiersma's changes. (Sid) 291 292 (2) Modified lesk.pm, vector.pm and vector_pairs.pm to be compatible 293 with above changes. (Sid) 294 295 * 12/07/05 296 297 (1) Updated all utilities to use WordNet 2.1 (WordNet::QueryData 298 1.39 or above). (Sid) 299 300 (2) Updated all modules and test cases for WordNet 2.1. (Sid) 301 302 * 12/05/05 303 304 (1) Changed order of authors in package documentation. (Sid) 305 306 Version 0.16 (Released 12/12/2005) 307 * 12/01/05 308 309 (1) Added Wybo Wiersma's super-gloss caching code to GlossFinder.pm. 310 (Sid) 311 312 (2) Updated documentation to reflect above changes. (Sid) 313 314 Version 0.15 (Re-released 12/11/2005) 315 * 12/11/05 316 317 (1) tar file unpacked as WordNet-Similarity for June 12, v 0.15, now 318 unpacks as WordNet-Similarity-0.15, which is consistent with all 319 previous versions. (Ted) 320 321 (2) Similarity.pm version was shown as 0.14, is now 0.15. Our 322 general convention for modules is that their version number only 323 change when the module itself changes, so the module version 324 number can tell you when was the last time a module changed. 325 However, for Similarity.pm this is needlessly confusing, so it 326 will always carry the same version number as the release. (Ted) 327 328 Version 0.15 (Released 6/12/2005) 329 * 06/10/05 330 331 (1) Fixed a minor bug in MANIFEST. (Sid) 332 333 (2) Updated modules.pod and developers.pod to reflect new software 334 architecture. (Jason) 335 336 Version 0.14 (Released 6/9/2005) 337 * 06/08/05 338 339 (1) Re-introduced the previous (non-pairwise-comparison) vector. 340 (Sid) 341 342 (2) Updated documentation and test cases to support the new vector 343 measure. (Sid) 344 345 (3) Added default relation file for new vector measure. (Sid) 346 347 (4) Expunged erroneous references to LCSFinder, esp. in test 348 scripts. (JM) 349 350 Version 0.13 (Released 5/9/2005) 351 * 04/21/05 352 353 (1) removed LCSFinder module; moved LCS methods to DepthFinder, 354 ICFinder, and PathFinder (JM) 355 356 (2) renamed vector measure vector_pairs (JM) 357 358 * 03/24/05 359 360 (1) Modified the documentation to reflect the relation file format 361 for vector and for lesk. (Sid) 362 363 * 03/02/05 364 365 (1) Set up selective test cases for "make test", depending upon the 366 default data files installed by user. (Sid) 367 368 * 02/24/05 369 370 (1) Reinstated default relation files for vector and lesk. In case 371 the default relation files (vector-relation.dat and 372 lesk-relation.dat) are missing, both modules would default to 373 the glosexample-glosexample relation. (Sid) 374 375 (2) Modified Makefile.PL to query the user before installing default 376 data files. (Sid) 377 378 (3) Removed infocontent file generation code from Makefile.PL. Now 379 Makefile.PL simply calls utilities from the /utils directory 380 (wnDepths.pl, semCorFreq.pl and wordVectors.pl) to generate the 381 all default data files. (Sid) 382 383 (4) Installation process now generates a default word vectors file. 384 The vectordb configuration variable for vector is now optional. 385 (Sid) 386 387 (5) Earlier, the WNHOME option was given to Makefile.PL as --WNHOME 388 <path>, whereas the PREFIX option was written as PREFIX=<path>. 389 This inconsistent (and potentially confusing) notation has now 390 been fixed. Now, the WNHOME option is provided to Makefile.PL as 391 WNHOME=<path>. (Sid) 392 393 (6) Added some basic tests for vector in t/vector.t. 394 395 * 12/11/04 396 397 (1) Created WordNet::Similarity::GlossFinder.pm, a super-class of 398 WordNet::Similarity::vector and WordNet::Similarity::lesk. (Sid) 399 400 (2) Removed default relation file for lesk. Vector and lesk both 401 default to glosexample-glosexample. (Sid) 402 403 Version 0.12 (Released 10/29/04) 404 * 10/29/04 405 406 (1) Added vector to the CGI interface. (JM) 407 408 (2) Incorporated a configuration file into similarity_server.pl. 409 (JM) 410 411 * 10/28/04 412 413 (1) Removed readDB.pl. (JM) 414 415 * 10/27/04 416 417 (1) Modified string overlap finding in lesk to use the 418 Text::OverlapFinder module. Removed string_compare.pm. This 419 fixed an old bug where the relatedness of word1 and word2 wasn't 420 always equal to the relatedness of word2 and word1. (JM) 421 422 (2) Updated Makefile.PL, INSTALL, and doc/install.pod to reflect new 423 dependency on Text::OverlapFinder. (JM) 424 425 (3) Removed lib/dbInterface.pm and lib/string_compare.pm from 426 MANIFEST. (JM) 427 428 * 10/19/04 429 430 (1) Word vectors no longer stored in a BerkeleyDB database, a plain 431 text file is now used. Modified wordVectors.pl, 432 WordNet::Similarity::vector to use the plain text word vectors 433 file. New module vectorFile.pm now used to access this plain 434 text database. Module dbInterace.pm is obsolete. (Sid) 435 436 (2) Modified Makefile.PL to no longer check for BerkeleyDB 437 dependency. All modules are installed. (Sid) 438 439 Version 0.11 (Released 09/23/04) 440 * 09/23/04 441 442 (1) Fixed bug in wup that allowed some relatedness scores to be 443 greater than 1. This bug is discussed in the archives of the 444 mailing list. (JM) 445 446 Version 0.10 (Released 09/03/04) 447 * 09/01/04 448 449 (1) Modified vector to look like the other measures. It now is 450 derived from WordNet::Similarity.pm. (Sid) 451 452 (2) Updated the MANIFEST. (Sid) 453 454 (3) Fixed some minor typos in Makefile.PL. (Sid) 455 456 (4) Added single test case (for vector) to t/access.t. (Sid) 457 458 (5) Fixed config option name conflict in WordNet::Similarity.pm. 459 (JM) 460 461 (6) Fixed WNHOME and WNSEARCHDIR related bugs. (JM) 462 463 (7) Updated documentation for the web interface. (JM) 464 465 Version 0.09 (Released 05/19/04) 466 * 05/19/04 467 468 (1) Fixed over-counting problem in *Freq.pl programs. Under certain 469 conditions, word senses would sometimes get counted twice. (JM) 470 471 (2) Updated *Freq.pl programs to use WordNet 2.0. (JM) 472 473 (3) Input files to rawtextFreq.pl are now specified with the 474 --infile option. (JM) 475 476 (4) Improved speed of compound identification in rawtextFreq.pl by 477 adding ',', ';', and ':' to the list of characters that we 478 consider to be the end of a sentence (compound identification 479 time is proportional to the square of the length of the 480 sentence). (JM) 481 482 Version 0.08 (Released 04/28/04) 483 * 04/28/2004 484 485 (1) Created a CGI-based web interface for the relatedness modules. 486 (JM) 487 488 * 04/19/2004 489 490 (1) Fixed problem with path to Perl interpreter in Makefile.PL. This 491 was causing problems during installation if there was no 492 /usr/local/bin/perl. (JM) 493 494 (2) wnDepths.pl had forgotten that on Windows some filenames are 495 different; for example, data.noun is noun.dat. (JM) 496 497 Version 0.07 (Released 03/24/04) 498 * 03/23/2004 499 500 (1) In /t, save diff files between 0.06 and 0.07. Make sure to run 501 diff tests for path/0.07 and edge/0.06. 502 503 * 03/16/2004 504 505 (1) make sure that every .pm and .pl file has the same GNU copyleft 506 language. Use PathFinder.pm as a template. 507 508 (2) make sure that documentation is clear that vector and lesk 509 require different format relation files (ie they are not 510 interchangeable). 511 512 (3) convert README into a series of pod documents in doc directory. 513 In the intro.pod, provide a table of contents like structure 514 (much like perldoc perl does). 515 516 Make sure that each pod documents follows the cpan style (name, 517 synopsis, etc.) This should be true of any pod documentation in 518 the package. 519 520 (4) Modify INSTALL to describe local install correctly. In 521 particular, the description of how to do a 'use lib' or -I may 522 need adjustment. 523 524 * 03/12/2004 525 526 (1) Make developers.pod into a self contained document that provides 527 a step by step tutorial on how to write a measure of 528 relatedness. The file NewStats.txt in NSP provides an example of 529 the style of presentation that is expected. 530 531 (2) developers.pod should be a tutorial that explains how to create 532 a new measure. It should take the reader through a complete 533 example, such as creating a measure that returns the sum of the 534 information content of the concpets found in the shortest path 535 between two concepts. This should include an example of how to 536 use all of the available configuration options, and also adding 537 a new one. 538 539 * 03/11/2004 540 541 (1) document measure modules (lch.pm, wup.pm, etc.) with information 542 about effect of hypo root node. (Take discussion from email 543 explaining why it has an effect, and why it doesn't have an 544 effect) and make it a part of the .pm perldoc. This will 545 eventually be used in thesis writing, so it should be complete 546 and detailed. Of particular important is the behavior of lch.pm, 547 but all of the modules should have their expected behaviour with 548 and without the hypo root node clearly documented. Also, you 549 should note what the behavior was in 0.06 for both nouns and 550 verbs, and if this has changed. 551 552 * 03/09/2004 553 554 (1) lch.pm does not yet support not having a hypo root. Remember 555 that the lack of hypo root will change (potentially) the max 556 path length found for each taxonomy. 557 558 * 03/08/2004 559 560 (1) depth finding code should be contained with DepthFinder.pm. We 561 should not do any depth finding on the fly, rather that should 562 all be precomputed (like we do info content). That includes the 563 depth of individual concepts, and the max depths of taxonomies. 564 565 (2) When wup.pm encounters two or more paths to the root, the trace 566 output "condenses" those paths into a single path. It would be 567 better to show all paths in the trace (as res does, for 568 example). Also, make sure that the depth reported in such cases 569 is always the minimum (shortest path to root). 570 571 * 03/05/2004 572 573 (1) Modify wnDepths such that it shows both the depths of individual 574 concepts, as well as the max distance from a root node. In the 575 case of multiple inheritance, wndepths should show the depth of 576 the concept in each case, and also the relevant root node. 577 wnDepths should sort these depths from shortest to longest. The 578 output of wndepths should be formatted like infocontent.dat, 579 anticipating an eventual merger. 580 581 * 03/02/2004 582 583 (1) in docs, update/replace current discussion of modules. Include 584 example usage as well. Make sure that path length is clearly 585 defined for lch, edge, and wup. 586 587 * 02/25/2004 588 589 (1) In PathFinder.pm, Infocontent.pm, Similarity.pm, and 590 LCSFinder.pm each function should be documented in perldoc form 591 such that their input, output and basic functionality is 592 described. This should then appear in the DESCRIPTION portion of 593 the perldoc. The SYNOPSIS should contain examples or templates 594 of each function being used. 595 596 * 02/23/2004 597 598 (1) redo random pairs testing such that we have 60 noun-noun pairs, 599 25 verb-verb pairs, and 15 mixed pairs. 600 601 * 02/20/2004 602 603 (1) Revisit the distance versus similarity issue in jcn.pm. It maybe 604 be that simply inverting the distance is too extreme a solution. 605 One possibility is to make it a linear transformation via 606 maxdist - dist instead. (JM - we'll stick with inverting the 607 distance, but added a discussion of this issue to the 608 documentation) 609 610 * 02/18/2004 611 612 (1) document all multiple inheritance issues that are being handled 613 for measures. 614 615 * 02/16/2004 616 617 (1) validateSynset should check wps format fairly closely, and issue 618 descriptive errors if the wps is ill formed. Words can 619 apparently be about anything (except #) but pos should be lower 620 case nvra, and senses should be digits. Error messages should 621 point out which field is the problem, or if there are too few or 622 too many fields. 623 624 (2) place all hypo root handling node code in PathFinder.pm. The 625 measures should not have any hypo root handling code in them. 626 627 (3) PathFinder.pm should include a function getAllPaths.pm that 628 returns all paths between two concepts, their length, and their 629 "tops" (the candidate LCSs). This should be used as the main 630 source of input for the getLCS* functions, and for 631 getShortestPath. 632 633 (4) remove all "input verifcation" code from the measures. That 634 should be inherited from Similarity.pm. 635 636 (5) There is replicated code in the measure modules that checks 637 validity of input. This should be removed to a common module 638 that can be called by all of the measures. Any other replicated 639 code should be removed as well. The goal of 0.07 is to largely 640 eliminate replicated code via the use of inheritance, and to 641 make the writing of new measures simpler. 642 643 * 02/13/2004 644 645 (1) add pod/perldoc to lib/ICFinder.pm. Should also be done for all 646 other files as they are modified for other reasons. In 647 particular, introductory material that appears in source code 648 comments, author information, GPL, etc. should be moved into pod 649 and removed from source code comments. See similarity.pl for an 650 example. 651 652 (2) path should use getShortestPath from PathFinder.pm. 653 654 * 02/09/2004 655 656 (1) getLCSDepth, getLCSInfo, getLCSPath should appear in 657 LCSFinder.pm, which should inherit from both ICFinder and 658 Pathfiner. 659 660 (2) The measures (lch, path, jcn, lin, res, wup) should default to 661 having the hypo root node turned on (for both nouns and verbs). 662 This will eventually be true of hso, but is not currently. hypo 663 root nodes could also be used for lesk and vector, although they 664 are not currently. 665 666 * 02/04/2004 667 668 (1) Wps and offsets will be supported internally. The user can 669 request either mode via an option to getRelatedness. offset is 670 our default. profiling has shown wps to be somewhat faster, in 671 that it makes fewer calls to getSense, although it does make 672 some. For input, we only support wps. For trace output we 673 support wps and offset. For output we support wps and offset. 674 675 * 01/29/2004 676 677 (1) modify option in config files such that an option without a 678 value reverts to the default in all cases (except vectordb). 679 680 * 01/24/2004 681 682 (1) Provide support for undefined values in the path finding and 683 info content measures (path, wup, lch, res, lin, jcn). If two 684 concepts are not in the same taxonomy then an error should be 685 issued and a large negative integer should be returned. This can 686 occur in two cases, between the same part of speech (noun-noun, 687 verb-verb), or between nouns and verbs. Distinct error messsages 688 should be indicated in both cases. 689 690 * 01/20/2004 691 692 (1) Clean up configuration file examples (in samples). Make them 693 consistent by having a master list (all-options.conf) that is 694 what we make changes to. Then specific example files can be 695 created via copy and paste. Make sure all possible options for a 696 measure are included, and that the explanations describe all 697 possible values as well as default handling. (TDP updated 698 all-options.conf on 12/10/03, use this as source of cut and 699 paste). 700 701 * 01/19/2004 702 703 (1) Create test scripts that can be run to verify the correctness of 704 output - they should include "correct" answers that can be 705 compared to (automatically) and rerun as the system changes. We 706 should use the CPAN module Test::More, and create .t files in a 707 /t directory that test specific situations/problems, etc. The .t 708 files themselves should be documented with an explanation of 709 what is being tested. We should have lots of smaller, specific 710 .t tests (rather than a few big test files). Whenever a bug is 711 found and fixed, a .t file should be created that tests the fix, 712 and this should be mentioned in the source code comments where 713 the fix is made (this fix is tested by t/xyz.t). 714 715 Make sure that the testing system can be easily 716 extended/modified, and that it can support the use of multiple 717 input files and configuration files. We should have multiple *.t 718 files to run our tests, and each module and utility should have 719 at least its own *.t file (maybe more than one in some cases). 720 We should also have *.t files that are dedicated to particular 721 situations that affect a number of measures (like what happens 722 when info content is zero for one concept, what happens if one 723 of the concepts being compared is the lcs of the other, what if 724 the two concepts are the same (self similarity), and so forth. 725 726 (2) Test cases for configuration file handling should include: 727 728 repeated options in configuration file, as in 729 730 trace::0 731 trace::1 732 733 bad values in configuration file, as in 734 735 trace::nothankyou 736 737 bad options in configuration file, as in 738 739 tracer::0 740 741 (3) Test cases for similarity.pl should include: 742 743 ill formed file input for similarity.pl, as in 744 745 cat#dog#1 cat#n#2 746 cat#n#n cat#n#2 747 cat 748 749 (4) Test cases for measures should include: 750 751 show that wps and offset methods of path finding are equivalent 752 753 check trace output for each of the measures. use wps format, as 754 that is subject to fewer changes than offsets. 755 756 a "big" file of word pairs (maybe 100 pairs) that run all the 757 measures and compare values to what is obtained in 0.6. If there 758 are differences, let's see what they are. 759 760 (5) Test cases for information content programs should include: 761 762 an information content file based on one of our resident text 763 files that is large enough to be interesting (readme, gpl, etc.) 764 as computed in 0.6/0.7 (should be the same). This can be used as 765 a reference point when we make changes in future. 766 767 Information content computed with a very small number of 768 concepts, to expose the counting problem that ted mentions 769 below. 770 771 (6) Test cases for wnDepth... 772 773 Generate output for 0.07 to use as a point of reference. A few 774 specific manual checks would be good too (leather_carp, entity, 775 etc.) 776 777 (7) run tests to determine where the system now provides different 778 results from version 0.06 - make sure to document these cases 779 (that are different). 780 781 * 01/12/2004 782 783 (1) document configuration options extensively in a separate pod 784 called doc/config.pod. Organize such that you have options that 785 are used with all measures, and then those that are used with 786 certain classes of measures. Then, use this as a master copy to 787 update .pm files with. 788 789 * 01/09/2004 790 791 (1) modify option handling such that multiple occurrences of an 792 option in a config file cause an error. For example 793 794 trace:: 795 trace::1 796 797 should cause an error. 798 799 * 12/17/2003 800 801 (1) SemCor1.7Freq.pl and SemTagFreq.pl need to be renamed. They are 802 now called semCorRawFreq.pl and SemCorFreq.pl. semCorRawFreq.pl 803 counts without sense tags and SemCorFreq.pl counts the sense 804 tags. (TDP) 805 806 * 12/09/2003 807 808 (1) In similarity.pl cache error strings that indicate that two 809 input synsets are from different parts of speech so that we only 810 print out a warning once for each unique word1#pos1 word2#pos2 811 combination (JM) 812 813 (2) 814 815 (a) Enhance similarity.pl file handling (for input files). 816 Comments should be allowed - this will help in creation of 817 test data (we can explain in the comment what "case" is 818 being tested by a particular set of pairs. Use standard perl 819 commenting style line starting with a # is a comment. Note 820 that I don't think we can use the convention of # anywhere 821 in a line as being the start of a comment (due to w#p#s) but 822 I think any line that starts with a # can be safely treated 823 as a comment. (JM -- we are using // to indicated the start 824 of a comment) 825 826 (b) Enhance similarity.pl file handling (for input files). At 827 present if a single word (not a pair) appears on a line, no 828 error is issued. It silently ignores this case. This should 829 result in an error to the effect that the input format is 830 invalid, only one word. Also, I'm not sure what happens if 831 you have more than two words on a line. An error of some 832 sort would also be necessary in that case. Also, I am not 833 sure if similarity.pl checks to see that the words pairs are 834 "well formed", that is to say do they adhere to the word, 835 word#pos, or word#pos#number format. It would be good to 836 have a simple check that verifies we have alphanumeric 837 words, pos of n, v, a, or r, and numeric numbers. (JM) 838 839 * 12/08/2003 840 841 (1) Clean up configuration file examples (in samples). Make them 842 consistent by having a master list (all-options.conf) that is 843 what we make changes to. Then specific example files can be 844 created via copy and paste. Make sure all possible options for a 845 measure are included, and that the explanations describe all 846 possible values as well as default handling. (JM)(TDP updated 847 all-options.conf on 12/10/03, use this as source of cut and 848 paste). 849 850 (2) Determine if it is feasible (not too difficult or time 851 consuming) to modify --version option so it can display both the 852 version of similarity.pl and the version of the module used when 853 --type is specified. (JM -- version will show module version as 854 well if a module is specified) 855 856 * 12/05/2003 857 858 (1) all configuration options are now printed to traceString after 859 module initialization. (JM) 860 861 (2) explain the distinction between compounds and collocations 862 raised in sample README. (Drop the distinction, and clarify what 863 we mean by Wordnet compounds. TDP Dec 3). (JM) 864 865 * 12/04/2003 866 867 (1) document caching for random (normally random uses an unlimited 868 cache size) (JM -- random now uses the same default as all other 869 measures) 870 871 (2) determine a reasonable default cache size. Should not be 872 unlimited. Current default is 1000, maybe it can be increased to 873 5000 or 10000. Let lesk with trace be the standard as to what is 874 reasonable. (JM -- default is now 5,000). 875 876 (3) Improve error handling when processing config files. Make sure 877 the values specified are valid and that filenames refer to 878 extant files. All options should allow the value to be omitted, 879 in which case the default is used. (JM) 880 881 * 12/01/2003 882 883 (1) Adjust Makefile.PL to account for new contents of samples 884 directory. Added entries to MANIFEST as well. (JM) 885 886 (2) update samples/sample.pl to run with the new files (and 887 organization) provided in the samples directory. This was also a 888 problem in 0.06, where it did not run for hso properly due to a 889 mismatch in the name specified in sample.pl and the 890 configuration file. 891 892 (3) Rename infocontent.dat in Makefile.PL to use our standard name 893 for semcor information content files. Name should reflect 894 options used in computing information content values (if any). 895 JM 896 897 (4) relation.dat is in lib/WordNet. Should be referred to as 898 lesk-relation.dat. Should also have vector-relation.dat I would 899 think. (if not, what does vector do?). JM (vector doesn't try 900 finding a default relation file--it fails silently). 901 902 (5) /sample/vector-relation.dat is wrong. Calls itself 903 LeskRelationFile. JM 904 905 (6) In intro.pod, provide instruction on how to convert to html or 906 whatever if user wishes (just point them to documentation that 907 describes this elsewhere even). JM 908 909 * 11/28/2003 910 911 (1) remove wordnet 1.7.1 compounds from samples directory. (TDP) 912 913 (2) change comment in Similarity.pm to explain the pluses and 914 minuses of using/not using a unique root node. (JM) 915 916 * 11/26/2003 917 918 (1) added info content files in samples/Infocontent 919 920 (2) changed version numbers to 0.07 in all modules and utils 921 922 (3) fixed bug in wup: if user supplies car#n#1 and auto#n#1, the LCS 923 found by wup is motor_vehicle#n#1, not car#n#1 924 925 (4) added POD to all programs in /samples 926 927 * 11/24/2003 928 929 (1) added documentation (in the form of POD) to /doc 930 931 * 11/21/2003 932 933 (1) added /doc directory to contain documentation 934 935 * 11/18/2003 936 937 (1) ensured that each measure initializes a part-of-speech list in 938 _initialize 939 940 (2) all measures (except vector) now use fetchFromCache and 941 storeToCache 942 943 (3) updated README: 944 945 (a) Replaces most references to WordNet 1.7.1 with 2.0 946 947 (b) Add some documentation on how to write a new measure 948 949 (4) added an INSTALL file 950 951 (5) cleaned up /samples. relation.dat is now named lesk-relation.dat 952 and added vector-relation.dat. A sample config file is also 953 provided for each measure (in /samples/config-files) 954 955 * 11/15/2003 956 957 (1) updated jcn, hso, random, and lesk to use the funcitions that 958 have been moved to Similarity.pm (such as the cache management 959 functions). 960 961 (2) cleaned up the /samples directory. Removed outdated files. Put 962 sample config files in samples/config-files. Added README in 963 /samples. 964 965 * 11/12/2003 966 967 (1) Added fetchFromCache() and storeToCache() to Similarity.pm to 968 make caching easier and cleaner. 969 970 (2) Updated wup, edge, lch, res, and lin to use fetchFromCache() and 971 storeToCache(). 972 973 * 10/25/2003 974 975 (1) Reduced the amount of duplication code in the measure modules by 976 moving some common code to WordNet::Similarity. 977 WordNet::Similarity is now a base class for all the measures. 978 Also added a module called infocontent.pm from which all 979 information content measures are descended (i.e., res, lin, 980 jcn). 981 982 (2) Removed @ symbol from all email addresses in all files (I 983 think). This might help keep spammers from harvesting our email 984 addresses. 985 986 Version 0.06 987 * 10/18/2003 988 989 (1) Removed dependence of the vector measure on PDL. Implemented 990 "in-house" sparse vector manipulation functions. 991 992 (2) Modified the README with updated documentation of similarity.pl 993 (--interact option) and wordVectors.pl. 994 995 * 10/15/2003 996 997 (1) Changed Makefile.PL so that it checks for version 1.30 of 998 QueryData 999 1000 * 10/13/2003 1001 1002 (1) Added "maxCacheSize" option to all measures. 1003 1004 (2) Added "maxCacheSize" option info to the man/pod documentation. 1005 1006 (3) Used the new dataPath() method of QueryData 1.31 in all the 1007 utilities to obtain the path of the WordNet data files. 1008 1009 (4) Modified Makefile.PL to check for PDL and BerkeleyDB dependency 1010 during installation. vector.pm is not installed on failed 1011 dependencies. 1012 1013 * 10/11/2003 1014 1015 (1) Replaced instances of deprecated WordNet::QueryData::query with 1016 WordNet::QueryData::queryWord in hso.pm 1017 1018 (2) made hso.pm check QueryData version. queryWord was broken in 1019 QueryData 1.29 and earlier 1020 1021 (3) added support for new relations in WordNet 2.0 to get_wn_info.pm 1022 1023 (4) updated test scripts to work with WN 2.0 (and WN 1.7.1) 1024 1025 * 10/06/2003 1026 1027 (1) Added rootNode option to wup.pm 1028 1029 * 09/27/2003 1030 1031 (1) Fixed syntax error in wordVectors.pl. 1032 1033 (2) Added readDB.pl to utils. 1034 1035 (3) Changed contact information in docs. 1036 1037 (4) Re-organized the samples subdirectory. 1038 1039 (5) Fixed typo in random.pm. 1040 1041 (6) Updated the MANIFEST. 1042 1043 * 09/21/2003 1044 1045 (1) Updated POD for WordNet::Similarity::wup 1046 1047 (2) Added option to wup to specify a cache size in a configuration 1048 file. 1049 1050 (3) similarity.pl now 'use's QueryData 1.30 or later. Previous 1051 versions of QueryData will not work. t/access.t also 'use's 1052 QueryData 1.30. get_wn_info.pm and lesk.pm both check for 1053 QueryData 1.30 and will die if it not found. 1054 1055 (4) Reorganized the bibliography in README and slightly re-worded 1056 part of the introduction. 1057 1058 * 09/18/2003 1059 1060 (1) Added new Wu Palmer measure of similarity 1061 (lib/WordNet/Similarity/wup.pm) 1062 1063 (2) Updated README to mention wup 1064 1065 (3) Added t/wup.t 1066 1067 (4) Updated POD for WordNet::Similarity to mention wup 1068 1069 (5) Updated the help message of similarity.pl to mention wup 1070 1071 (6) Added t/wup.t and lib/WordNet/Similarity/wup.pm to MANIFEST 1072 1073 * 09/05/2003 1074 1075 (1) Added '--interact' option to similarity.pl. 1076 1077 (2) Changed the structure of the Vector Relation File. 1078 1079 (3) Fixed a minor bug in similarity.pl. (s///g) 1080 1081 (4) Updated the perldocs for the measures. 1082 1083 (5) Incorporated some new features into the 'wordVectors.pl' 1084 utility. These features were used for thesis experiments. 1085 1086 (6) Added documentation about the Lesk and Vector relation files 1087 (they have different formats now). 1088 1089 Version 0.05 1090 * 06/03/2003 1091 1092 (1) Added new measure of semantic relatedness, based on 1093 co-occurrence vectors of WordNet glosses. 1094 1095 (2) Set up the package so that similarity.pl and the other perl 1096 utilities get installed in "/usr/local/bin". 1097 1098 (3) Complete rewrite of similarity.pl with cleaner code and added 1099 functionality: 1100 1101 (a) Multiple parts of speech can be specified as car#nv (noun 1102 and verb forms of car) or cool#nar (noun, adjective and 1103 adverb forms of cool). 1104 1105 (b) Word senses can now be specified as car#n#2, jump#v#2, etc. 1106 1107 (c) Added functionality to similarity.pl to use a local install 1108 of WordNet::Similarity modules (in non-standard 1109 directories). 1110 1111 (d) Output of similarity.pl now specifies the senses that 1112 represent the relatedness of two words. 1113 1114 (4) Enforced limit on the cache size of modules. 1115 1116 (5) Updated README to reflect the changes and to specify options for 1117 local installs of similarity.pl and the other utilities. 1118 1119 (6) Fixed the perl docs (remove leading spaces). 1120 1121 (7) Added mailing list address to documentation -- 1122 (http://groups.yahoo.com/group/wn-similarity). 1123 1124 (8) Improved jcn and lin tracing ("bird-crane" problem obvious now). 1125 1126 (9) Added new utility wordVectors.pl required for 1127 WordNet::Similarity::vector module. 1128 1129 Version 0.04 1130 * 05/02/2003 1131 1132 (1) *Fixed* newline in traces. 1133 1134 (2) *Fixed* blank line bug in brownFreq.pl. 1135 1136 (3) *Fixed* "--offset" option bug in similarity.pl. 1137 1138 (4) *Fixed* lin measure non-normalized scores... added zero 1139 infocontent handling in jcn and lin. 1140 1141 (5) New utility rawtextFreq.pl, to generate information content 1142 files from plain text. 1143 1144 (6) similarity.pl supports option to specify part-of-speech of input 1145 words while measuring relatedness. 1146 1147 (7) Added option to specify (conifuration / information content) 1148 file in similarity.pl. 1149 1150 (8) Added Resnik counting option to the information content 1151 generation utilities. 1152 1153 (9) More documentation on information content utilities. 1154 1155 (10) 1156 Added Add-1 smoothing option to the information content 1157 generation utilities. 1158 1159 Version 0.03 1160 * 03/10/2003 1161 1162 (1) Removed trace bug in hso.pm. 1163 1164 (2) Added test cases for all modules. 1165 1166 Version 0.01 1167 * 02/10/2003 1168 1169 (1) Created CPAN modules from distance ver 0.11. 1170 1171 (2) Modules are completely object oriented. 1172 1173 (3) Added Adapted Lesk semantic relatedness measure -- lesk.pm. 1174 1175 (4) Added simple edge counting semantic relatedness measure -- 1176 edge.pm. 1177 1178 (5) Added a random relatedness measure -- random.pm. 1179 1180 (6) jcn, res and lin measures now support verb hierarchies. 1181 1182 (7) Information content files can now be specified as parameters to 1183 the modules. 1184 1185 (8) Tools provided to build information content files from various 1186 publicly available corpora. 1187 1188 (9) Various parameters now control the behavior of the modules. 1189 These parameters are passed to the modules through 1190 'configuration files'. 1191 1192AUTHORS 1193 Ted Pedersen, University of Minnesota, Duluth 1194 tpederse at d.umn.edu 1195 1196 Siddharth Patwardhan, University of Utah, Salt Lake City 1197 sidd at cs.utah.edu 1198 1199 Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh 1200 banerjee+ at cs.cmu.edu 1201 1202 Jason Michelizzi 1203 1204SEE ALSO 1205 todo.pod 1206 1207COPYRIGHT 1208 Copyright (c) 2005, Ted Pedersen, Siddharth Patwardhan, Satanjeev 1209 Banerjee and Jason Michelizzi 1210 1211 Permission is granted to copy, distribute and/or modify this document 1212 under the terms of the GNU Free Documentation License, Version 1.2 or 1213 any later version published by the Free Software Foundation; with no 1214 Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. 1215 1216 Note: a copy of the GNU Free Documentation License is available on the 1217 web at <http://www.gnu.org/copyleft/fdl.html> and is included in this 1218 distribution as FDL.txt. 1219 1220