• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

AUTHORSH A D09-Dec-199884 52

COPYINGH A D13-Jun-20000

ChangeLogH A D19-Nov-200415 KiB461409

INSTALLH A D09-Dec-19981 KiB2921

Makefile.amH A D19-Nov-20043.7 KiB12478

Makefile.inH A D19-Nov-200419.6 KiB630498

NEWSH A D21-Oct-19993 KiB8161

READMEH A D22-Dec-199828.9 KiB821627

acconfig.hH A D09-Dec-1998232 144

aclocal.m4H A D19-Nov-200433.5 KiB943801

common.cH A D21-Oct-199929.8 KiB1,3371,035

config.guessH A D19-Nov-200442.6 KiB1,4601,260

config.h.inH A D19-Nov-20043.1 KiB12985

config.subH A D19-Nov-200430.4 KiB1,5501,409

configureH A D19-Nov-2004155.1 KiB5,3654,386

configure.inH A D19-Nov-20041.9 KiB6856

depcompH A D27-Jan-200311.8 KiB424278

eval.cH A D21-Oct-199942.6 KiB1,8571,439

index.cH A D19-Nov-200457.8 KiB2,2061,670

index_main.cH A D13-Jun-20006.6 KiB265238

install-shH A D09-Dec-19985.5 KiB251152

main.cH A D13-Jun-200025.5 KiB1,084816

missingH A D27-Jan-200310 KiB337263

mkinstalldirsH A D09-Dec-1998732 4123

optimize.cH A D21-Oct-199911.5 KiB514407

output.cH A D21-Oct-199910.1 KiB401309

parser.cH A D21-Oct-199927.5 KiB1,109923

pmatch.cH A D23-Dec-199818.1 KiB724568

preproc.cH A D21-Oct-19997.8 KiB340271

sample.sgreprcH A D12-Dec-19984.5 KiB11697

sgml.cH A D21-Mar-200157.2 KiB2,2631,945

sgrep.1H A D09-Dec-199823 KiB1,0941,065

sgrep.hH A D15-Mar-200123.8 KiB753462

sgrep.lsmH A D09-Dec-1998987 2322

sysdeps.cH A D22-Dec-199812.1 KiB509423

sysdeps.hH A D23-Dec-19982.1 KiB10756

README

1
2        README file for sgrep version 1.91-alpha -
3	a tool to search and index text, SGML, XML and HTML files using
4	structured patterns
5
6        Copyright (C) 1998  University of Helsinki,
7                            Department of Computer Science
8
9        Authors: Jani Jaakkola           Jani.Jaakkola@cs.helsinki.fi
10                 Pekka Kilpelainen       Pekka.Kilpelainen@cs.helsinki.fi
11
12---------------------------------------------------------------------------
13
14This README file is intended for describing the new features of
15sgrep-1.91a. If you want to know what sgrep is and what the old features
16are, see:
17http://www.cs.helsinki.fi/~jjaakkol/sgrep.html
18
19See the section "NEW QUERY LANGUAGE FEATURES" for description of the
20new operators available in version 1.91a.
21
22Sgrep-1.91 supports 16-bit wide characters and Unicode in XML-documents.
23See the section "WIDE CHARACTER SUPPORT" for information on wide
24characters and UTF-8 and UTF-16 encodings.
25
26This file (and newer versions of this file) is available from
27http://www.cs.helsinki.fi/~jjaakkol/sgrep/README.txt
28
29Sgrep is distributed under GNU General Public License. See file COPYING
30for details.
31
32This piece of software is still under development. This means that:
33- New features might be included before final sgrep-2.0 release.
34- Existing features might be changed.
35- It is guaranteed to have bugs.
36- All suggestions are welcome.
37- All available documentation of the new features is contained in
38  this file.
39
40---------------------------------------------------------------------------
41NEW FEATURES
42---------------------------------------------------------------------------
43
44Major new features since sgrep-1.0 which are already present:
45- Indexing of both structure and content.
46- SGML/XML/HTML scanner.
47- Official Win32 binary.
48- sgtool has been dumped. It never really worked and even when it
49  did, it wasn't very useful.
50- Should be completely compatible with older versions of sgrep.
51- Sgrep now supports direct containment. In SGML and XML world this
52  means, that you can query children or parents of given elements.
53- Sgrep uses GNU autoconf
54- Also the sources are now available
55- Operators for supporting direct containment
56- Nearness operators
57- 16-bit wide characters and Unicode support
58
59Features which will be present in sgrep-2.0:
60- Proper documentation
61- Support for querying notations, element type declarations and
62  attribute list declarations inside SGML/XML document prolog
63- Scanning of all well-formed XML-documents.
64
65Features probably won't be present in sgrep-2.0:
66- Regular expressions, since they are probably better handled by other
67  software, like Perl. However, sgrep still needs some new options for
68  better perl support.
69
70---------------------------------------------------------------------------
71Win32-BINARY RELEASE
72---------------------------------------------------------------------------
73
74The Win32-binary release contains both sgrep binary and m4 binary.
75Sgrep binary is compiled with MSVC and requires no additional libraries.
76
77Please note that the examples in this README file and in the sgrep
78WWW-pages have been written using sh shell-syntax. When you use sgrep
79under the windows shell, "COMMAND.COM" you have to either use the -f option
80or translate query from:
81
82% sgrep 'word("foo") or word("bar")' foobar
83
84to
85
86C:\> sgrep "word(\"foo\") or word(\"bar\")" foobar
87
88Alternatively, you can install bash from the Cygnus Cygwin project.
89
90The m4 binary comes from the Cygnus Cygwin project. See
91http://sourceware.cygnus.com/cygwin/ for details.
92Included binary release of m4 requires the cygwin.dll DLL-library.
93Both of them are distributed under GNU General Public License (GPL).
94See file COPYING for details.
95
96---------------------------------------------------------------------------
97SGML-SCANNER
98---------------------------------------------------------------------------
99
100Sgrep has a built-in scanner for XML, SGML and HTML-documents. This means
101that complex macros for querying SGML-files are no longer
102needed. However, sgrep still does not contain a full blown
103SGML-parser: the thing which it does contain could be described as an
104SGML-scanner. It does not recognize any syntax errors, it does
105not provide a parse tree and it does not provide any event stream.
106It just recognizes regions from SGML-files corresponding to different
107SGML tokens: start tags, end tags, attributes, etc.
108
109Since version 1.90a the SGML-scanner maintains an element stack. This
110is needed for the ability to support direct containment in queries
111to SGML/XML-files. Query language primitive 'elements' returns
112all elements of queried XML/SGML-documents. (see the 'childrening'
113and 'parenting' operators for examples).
114
115Since version 1.91a sgrep has support for 16-bit wide characters in
116query terms and support for UTF-8 and UTF-16 encodings in the
117SGML-scanner. See the "WIDE CHARACTER SUPPORT" below.
118
119SGML has many features which make it very difficult to parse. The
120SGML-scanner implemented in sgrep does not attempt to be a complete
121and error free SGML-parser; valid SGML-documents might confuse it.
122However, my goal is that all well formed XML-documents will be
123parsed correctly.
124
125The scanner has two modes:
126- SGML/HTML-mode
127  o Names are case insensitive
128  o PIs end with '>'
129- XML-mode
130  o Names are case sensitive
131  o PIs end with '?>'
132
133Sgrep will recognize empty XML elements (<ELEMENT/>) in both
134modes.
135
136The scanner does not automatically include entity references.
137However, it can automatically add external parsed entities
138defined in the internal document type definition subset to scanned files.
139Eg. if you have a line
140	<!ENTITY chapter1 SYSTEM "chapter1.sgml">
141in your document, the scanner can automatically include file
142"chapter1.sgml" to the list of scanned files, when the scanner sees
143this line in the internal document type definition subset.
144To use this feature you need to use "-g include-entities" option.
145
146---------------------------------------------------------------------------
147WIDE CHARACTER SUPPORT
148---------------------------------------------------------------------------
149
150Sgrep version 1.91a introduces 16-bit wide character support in index terms
151and in the SGML-parser.
152
153Since the sgrep query language is still strictly 8-bit, wide characters
154in queries need to be encoded. I chose to use encoding which looks
155just like character entity references in SGML: "\#<decimal number>;"
156for character number in decimal and "\#x<hex number>;" for character
157number in hexadecimal. Therefore the ISO-8859-1 letter a with two
158dots on top of it, '�' assuming you are reading this file with ISO-8859-1
159font '&auml;'-entity in HTML, '&#228;' as a decimal character reference
160and '&#e4;' as a hexadecimal character reference can be encoded in sgrep
161query either as "\#228;" or as "\#xe4;".
162
163So the finnish word "�l�m�l�" ("&auml;l&auml;m&ouml;l&ouml;" in HTML) can
164be queried either with query like 'word("�l�m�l�")', since sgrep query
165language supports 8-bit characters, or with encoded query like
166'word("\#228;l\#228;m\#248;l\#248")' or 'word("\#xe4;l\#xe4;m\#xf8;l\#xf8")'.
167
168The SGML-parser supports UTF-8 and UTF-16 encodings. You can select
169the encoding with the -g option:
170
171- "-g encoding=utf-8" selects UTF-8 encoding. This is the default if you
172  are using the SGML-scanner in XML-mode (with -g xml option).
173- "-g encoding=utf-16" selects UTF-16 encoding. Note that currently (in
174  version 1.91a), this is a synonym for "-g encoding=utf-8"
175  since sgrep switches automatically to UTF-16 mode from UTF-8 mode when
176  it sees the byte order mark (this also means, that you must have the byte
177  order mark, if you are using UTF-16).
178- "-g encoding=iso-8859-1" selects iso-8859-1 encoding, which is also the
179  default encoding when SGML-scanner is any other mode than XML.
180
181The SGML-scanner recognizes character entity references currently only
182in character data content. Character entity references in attribute values
183or entity literals are not recognized. No other entity references
184than character entity references are expanded, not even "&amp;", "&gt;"
185and "&lt;". I plan  to fix this before next release.
186
187The XML-scanner recognizes the encoding parameter in XML-declarations
188and can switch encoding accordingly (if not overridden with -g encoding
189option). Currently "us-ascii", "iso-8859-1", "utf-8" and "utf-16"
190encodings are recognized. Note that in XML-mode the SGML-parser
191interprets all characters classified as "Letter" in the XML-spesification
192as word characters by default.
193
194Currently only the SGML-scanner is aware of different encodings. The
195output module does not do any conversions: it just dumps the result
196regions from query files exatly as they were encoded there, even when
197different files use different encodings (this probably needs to be
198fixed).
199
200Here is an example using Murata Makotos example XML-documents in Japanese
201(see http://www.oasis-open.org/cover/xmlJapaneseExamples.html ).
202The unicode character 0x771f represents word Murata in japanese.
203
204% sgrep -o"%f:%l\n" -g xml 'word("\#x771f")' pr-xml-little-endian.xml pr-xml-utf-16.xml pr-xml-utf-8.xml weekly-utf-8.xml
205pr-xml-little-endian.xml:2
206pr-xml-utf-16.xml:2
207pr-xml-utf-8.xml:3
208
209---------------------------------------------------------------------------
210NEW QUERY LANGUAGE FEATURES
211---------------------------------------------------------------------------
212
213The example file "example.sgml" and its DTD "example.dtd" are
214included in this distribution.
215
216New query language features in version 1.91a and later:
217
218* near(distance)
219
220Finds regions of left hand side and right hand side having at most
221'distance' bytes bytes between them.
222'A near(0) B' would return regions of A and B which "touch" each other
223(in other words, there is no bytes between them. I know that using bytes
224is not the best way to measure distance in a text search engine, but the
225way Sgrep works makes this kind of query very fast. If you really need
226nearnes operator with words as a measure of distance, you could use
227'join(distance,word("*")) containing A containing B'. However this query
228would take much more time and memory to evaluate.)
229
230In this example, I use Jon Bosaks religious text as example material:
231
232% sgrep -x index -o"%r\n" 'word("jesus") near(20) word("peter")'
233Jesus was come into Peter
234Jesus taketh Peter
235Peter, and said unto Jesus
236Jesus taketh with him Peter
237Peter said unto Jesus
238Jesus unto Peter
239Peter followed Jesus
240Jesus loved saith unto Peter
241Jesus saith to Simon Peter
242Peter, an apostle of Jesus
243% sgrep -x index -o"%r\n" 'word("adam") near(30) word("eve") near(40) word("cain")'
244Adam knew Eve his wife; and she conceived, and bare Cain
245
246* near_before(distance)
247
248Works like just like 'near', except that 'near_after' requires the regions
249in the left hand side to occur before regions in the right hand side.
250
251% sgrep -x index -o"%r\n" 'word("peter") near_before(20) word("jesus")'
252Peter, and said unto Jesus
253Peter said unto Jesus
254Peter followed Jesus
255Peter, an apostle of Jesus
256
257New query language features in version 1.90a and later:
258
259* elements
260
261Returns all SGML-elements. This example counts all elements from input
262documents:
263
264% sgrep -c elements sgreptest.sgml
26514
266
267* parenting
268
269Works like old "containing" operator, except that parenting returns
270left hand side regions directly containing right hand side regions
271instead of all regions containing right hand side regions.
272NOTE: parenting works right only if the left hand side expression
273does not contain overlapping regions (which is guaranteed, if the
274left hand side regions correspond to SGML-elements).
275
276% sgrep 'elements parenting word("Peletier")' sgreptest.sgml
277<CITEREF RID="rf38">Peletier et al. (1994)</CITEREF>
278
279* childrening
280
281Works like old "in" operator, except that childrening returns left hand
282side regions directly contained in right hand side regions
283instead of all regions contained in right hand side regions.
284This example counts children elements of SGREPTEST-element
285
286% sgrep -c 'elements childrening (
287	stag("SGREPTEST") .. etag("SGREPTEST"))' sgreptest.sgml
28813
289
290* first(n, expression) and last(n ,expression)
291
292First-operator selects first n regions of the regions returned
293by expression and last-operator selects last n regions of
294the regions returned by expression.
295
296This query selects first child element of last child element of
297third ACT-element from a file containing word "Hamlet":
298(In other words, TITLE of last SCENE of third ACT of shakespeares
299famous PLAY "The Tragedy of Hamlet, Prince of Denmark")
300
301% cat test/childrening
302first(1,
303  elements childrening
304    last(1,
305      elements childrening
306        last(1,first(3,
307          stag("ACT") .. etag("ACT") in (file("*") containing word("Hamlet"))
308        )
309      )
310    )
311)
312% sgrep -x hamlet-index -f test/childrening
313<TITLE>SCENE IV.  The Queen's closet.</TITLE>
314%
315
316* first_bytes(n, expression) and last_bytes(n,expression)
317
318Operator first_bytes(n,expression) truncates all regions returned from
319expression to n-byte length starting from regions start point.
320Operator last_bytes(n,expression) truncates all regions returned from
321expression to n-byte length starting from regions end point.
322
323This example returns the start tags of SGREPTEST-elements children:
324
325% sgrep 'stag("*") containing first_bytes(1,elements childrening (stag("SGREPTEST") .. etag("SGREPTEST")))' sgreptest.sgml
326<Partno><Partno><Partno><Partno><partno><partno><partno><partno><partno><partno><partno><CITEREF RID="rf38"><AUTHOR>
327
328This query returns the end tags of SGREPTEST-elements children:
329
330% sgrep 'etag("*") containing last_bytes(1,elements childrening (stag("SGREPTEST") .. etag("SGREPTEST")))' sgreptest.sgml
331</partno></partno></partno></partno></partno></partno></partno></partno></PARTNO></PARTNO></PARTNO></CITEREF></AUTHOR>
332
333New query features in version 1.70a and later
334
335* file("filename")
336
337Returns the region containing the named file.
338
339* pi("PITarget")
340
341Returns the regions containing the processing instructions beginning with
342the given PI target.
343
344% sgrep 'pi("example_pi")' example.sgml
345<?example_pi processing instruction>
346
347* attribute("attribute name")
348
349Returns the regions containing the named attribute.
350
351% sgrep 'attribute("ATT1")' example.sgml
352att1="value1"
353
354* attvalue("attribute value")
355
356Returns the regions containing the given attribute value.
357
358% sgrep 'attvalue("value2")' example.sgml
359"value2"
360% sgrep 'attribute("*") containing attvalue("value2")' example.sgml
361att2="value2"
362
363* stag("GI")
364
365Returns the regions containing the start tags with the given GI.
366
367% sgrep 'stag("EXAMPLE")' example.sgml
368<EXAMPLE att1="value1" att2="value2">
369
370* etag("GI")
371
372Returns the regions containing the end tags with the given GI.
373
374% sgrep 'etag("EXAMPLE")' example.sgml
375</EXAMPLE>
376
377* word("word")
378
379Returns the regions containing the given word.
380N.B: A query word("foo") does not recognize occurrences of word "foo"
381inside comments. (See operators comment and comment_word  below.)
382
383% sgrep 'word("example")' example.sgml
384example
385% sgrep '"\n"_."\n" containing word("example")' example.sgml
386This is an example SGML file to demonstrate new features in sgrep-2.0.
387
388* comments
389
390Returns the regions containing all SGML comments.
391
392% sgrep 'comments' example.sgml
393<!-- comment --><!-- another comment -->
394
395* comment_word("comment word")
396
397Returns the region containing the given word inside comments.
398
399% sgrep 'comment_word("another")' example.sgml
400another
401% sgrep 'comments containing comment_word("another")' example.sgml
402<!-- another comment -->
403
404* cdata
405
406Returns regions containing CDATA marked sections. Sgrep recognizes
407words also inside CDATA marked sections.
408
409% sgrep 'cdata' example.sgml
410<![CDATA[ <CDATA> <marked> &section ]]><![CDATA[ another marked section ]]>
411% sgrep 'cdata containing word("another")' example.sgml
412<![CDATA[ another marked section ]]>
413
414* entity("entity name")
415
416Returns the regions containing references to the given entity.
417(Entity references are currently recognized only in PCDATA.)
418
419% sgrep 'entity("entity1")' example.sgml
420&entity1;
421% sgrep 'stag("ELEM1") .. etag("ELEM1") containing entity("entity1")'
422example.sgml
423<ELEM1>&entity1;</ELEM1>
424
425* doctype("doctype name")
426
427Returns the regions containing given document type name inside the document
428type declaration.
429
430% sgrep 'doctype("*")' example.sgml
431EXAMPLE
432
433* doctype_pid("publicid")
434
435Returns the regions containing the given document type public id inside
436document type declarations.
437
438% sgrep 'doctype_pid("*")' example.sgml
439-//SID//DTD sgrep example//EN
440
441* doctype_sid("systemid")
442
443Returns the regions containing the given document type system id inside
444document type declarations.
445
446% sgrep 'doctype_sid("ex*")' example.sgml
447example.dtd
448
449* entity_declaration("entity name")
450
451Returns the regions containing the declaration of the given entity name.
452
453% sgrep 'entity_declaration("entity1")' example.sgml
454<!ENTITY entity1 "literal value">
455
456* entity_literal("entity name")
457
458Returns the regions containing the literal value of the given entity name.
459
460% sgrep 'entity_literal("entity1")' example.sgml
461literal value
462
463* entity_pid("entity public id")
464
465Returns the regions containing the given public id of an entity within its
466declaration.
467
468% sgrep 'entity_pid("*")' example.sgml
469-//SID//NONSGML entity example//EN
470
471* entity_sid("entity system id")
472
473Returns the regions containing the given system id inside an entity
474declaration.
475
476% sgrep 'entity_sid("*")' example.sgml
477figure.file
478
479* entity_ndata("notation name")
480
481Returns the regions containing the given notation name inside entity
482declarations.
483
484% sgrep 'entity_ndata("*")' example.sgml
485anotation
486% sgrep 'entity_declaration("*") containing entity_ndata("ANOTATION")'
487example.sgml
488<!ENTITY figure PUBLIC "-//SID//NONSGML entity example//EN"
489                "figure.file" NDATA anotation>
490
491
492* prologs
493
494Returns the regions containing document prologs.
495
496% sgrep 'pi("*") in prologs' example.sgml
497<?pi inside prolog>
498
499--------------------------------------------------------------------
500 INDEXING
501--------------------------------------------------------------------
502
503Sgrep supports indexing of both structure and content of SGML, HTML
504and XML documents. Indexing is implemented by creating a separate index
505file, which contains a list of terms and of regions corresponding to these
506terms. This means that if you want to output the actual content of
507result regions you have to keep the original files around.
508
509If the query uses only the new SGML query features, the same query should
510return same results independent of whether an index was used or not.
511
512Indexes are stored in compressed binary files. Depending on the indexed
513material the index file size is 30-60% of the original files. This is
514not so bad, since in theory you can produce the full content of
515the original files from the index file, except for whitespace and
516punctuation marks.
517
518Maximum size of the indexed data is currently 2 gigabytes. If you
519want to index larger collections, you have to split the index
520to multiple index files. However, for optimal performance the
521index file size should be smaller than your available RAM-memory.
522
523Sgrep is switched to indexing mode by giving "-I" as first option in
524the command line. Command 'sgrep -I -h' will give you the summary of
525available indexing options:
526
527<CLIP>
528Usage: (sgindex | sgrep -I) <options> <files...>
529Use 'sgrep -h' for help on query mode options.
530
531Indexing mode options are:
532  -C              display copyright notice
533  -h              help (means this text)
534  -i              fold all words to lower case when indexing
535  -T              show statistics about created index files
536  -V              display version information
537  -v              verbose mode. Shows what is going on
538  -c <index file> create new index file
539  -F <file>       read list of input files from <file> instead of command line
540  -g <option>     set scanner option:
541      sgml        use SGML scanner
542      html        use HTML scanner (currently same as sgml scanner)
543      xml         use XML scanner
544      sgml-debug  show recognized SGML tokens
545      include-entities  automatically include system entities
546  -l <limit>      make a list of possible stopwords
547  -L <stop file>  write possible stopwords to file
548  -S <stop file>  read stop word list from file
549  -m <megabytes>  main memory available for indexing in megabytes
550  -w <char list>  set the list of characters used to recognize words
551        --              no more options
552
553Copyright (C) 1998 University of Helsinki. Use sgindex -C for details,
554</CLIP>
555
556Options -C, -h, and -V should be self explanatory.
557
558Indexes are created using the -c option and giving a file list to index.
559The file list can be directly in the command line, or with -F option
560(see below).
561
562<CLIP>
563% sgrep -I -c demo.index demo.sgml
564% ls -l demo.*
565-rw-------   1 jjaakkol grpd        53577 Aug 24 12:52 demo.index
566-rw-------   1 jjaakkol grpd        91536 Mar  6 13:40 demo.sgml
567</CLIP>
568
569First 13 lines of the index file contain statistics about the created
570index.
571
572<CLIP>
573% head -13 demo.index
574sgrep-index v0
575
5762441 terms
57711554 entries
5781024 bytes header (1%)
5799764 bytes term index (18%)
58017222 bytes strings (32%)
581  26366 total strings
582  9144 compressed with lcps (-34%)
58325545 bytes postings (47%)
58422 bytes file list (0%)
58553577 total index size
586--
587</CLIP>
588
589With -F <file> option you can give a list of the files to be indexed
590in the named file, instead of giving it on the command line.
591The file names given in the file have to separated by a newline.
592
593The -v option gives verbose progress reports while indexing:
594
595<CLIP>
596% sgrep -I -v -c demo.index demo.sgml
597Indexing 1/1 files 64/89K (71%)
598Writing index file of 52K
599Writing index 2048/2441 entries (83%)
600</CLIP>
601
602The entries added to the index are case sensitive by default.
603For example, word("Foo") is different from word("foo").
604With option -i you can instruct sgrep to
605fold all words (only words in content or comments, not any structural
606elements like element type names or attribute values) to lowercase.
607
608With option -T you can get some statistics (some useful for debugging,
609some less useful, and some mighty cryptic) about the created index.
610
611With option -L <term file> you can create a file containing list of
612all terms added to index. Each line in created file will contain the
613amount of bytes required by the term and the term itself.
614
615This example is using the XML WWW-page from http://www.sil.org/sgml/xml.html
616<CLIP>
617% sgrep -I -i -L terms -c xml.index xml.html
618% cat terms | sort -n | tail
6192043 eLI
6202094 wa
6212263 wto
6222543 wof
6232713 wand
6243414 eA
6253433 wxml
6264410 wthe
6275076 aHREF
6285390 sA
629</CLIP>
630
631With option -S <stop word list file> you can give indexer a stop word
632list to reduce the size of index. Stop word list consists of one
633stop word per line, with possibly including the amount of bytes
634in the term (so you can actually use a part of file originally created with
635the -L option to indexer).
636
637Here is an example using a simple English stop word list (containing words like
638"to", "and", "the" and so on) to reduce the size of the index:
639
640<CLIP>
641% sgrep -I -i -L terms -c xml.index xml.html
642% ls -l xml.index
643-rw-------   1 jjaakkol grpd       291599 Aug 24 14:13 xml.index
644% sgrep -I -i -S stoplist -c xml.index xml.html
645% ls -l xml.index
646-rw-------   1 jjaakkol grpd       259058 Aug 24 14:14 xml.index
647</CLIP>
648
649With using "-l <number>" option you can obtain information about the impact
650of stop word list on index size when all terms taking more space
651than a fraction of 1/number of the index would be considered as stop words.
652
653<CLIP>
654% sgrep -I -l 200 -i -c xml.index xml.html
655Possible stop words:
656    4K (1.74%) 'aHREF'
657    3K (1.17%) 'eA'
658    1K (0.70%) 'eLI'
659    5K (1.85%) 'sA'
660    1K (0.70%) 'sLI'
661    2K (0.72%) 'wa'
662    2K (0.93%) 'wand'
663    1K (0.62%) 'wfor'
664    1K (0.67%) 'win'
665    2K (0.87%) 'wof'
666    4K (1.51%) 'wthe'
667    2K (0.78%) 'wto'
668    3K (1.18%) 'wxml'
669-------------
670   38K (13.43%) total
671</CLIP>
672
673Using the "-g sgml" option you can select SGML mode scanner.
674
675Using the "-g xml" option you can select XML mode scanner.
676
677Using the "-g include-entities" option you can automatically include
678defined system entities to indexed files (see SGML scanner above).
679
680Using the "-g sgml-debug" option you can check the tokens which sgrep
681recognized from its input files:
682
683<CLIP>
684% sgrep -I -c demo.index -g sgml-debug sgreptest.sgml
685doctype("SGREPTEST"):dnSGREPTEST:(10,18)
686doctype_pid("-//SID//DTD Just something to test sgrep//EN"):dp-//SID//DTD Just
687something to test sgrep//EN:(29,72)
688doctype_sid("empty"):dsempty:(77,81)
689comment_word("Comment"):cComment:(93,99)
690comment(""):-:(88,110)
691pi("PI"):?PI:(111,115)
692....
693word("third"):wthird:(1391,1395)
694etag("PARTNO"):ePARTNO:(1396,1404)
695etag("SGREPTEST"):eSGREPTEST:(1406,1417)
696</CLIP>
697
698With the "-m <megabytes>'" option you can adjust the amount of main memory
699indexer will use for postings spool before writing a temporary
700file. Default value for -m option is 20 megabytes. Creating the index
701completely in main is faster than using temporary files. However, if you
702give a value larger than available main memory, sgrep will probably start
703trashing and indexing will be slow.
704
705With "-w <char list>" option you can select which characters make up
706words. The default is "-w a-zA-Z". Here in Finland I would use
707"-w A-Za-Z������" (assuming you have iso-8859-1 font). If you want to index
708also numbers you could use "-w A-Za-z0-9"
709
710---------------------------------------------------------------------------
711QUERIES
712---------------------------------------------------------------------------
713
714Sgrep has some new options when used in the query mode.
715
716Options "-F <file>", "-w <word characters>" and "-g <scanner option>"
717have the same functions as in the indexing mode.
718
719Option "-x <index file>" is used to specify an index while when one
720is used. If -x option is used and no file list is specified either
721in the command line or with -F option, sgrep obtains the list of
722queried files straight from the index.
723
724---------------------------------------------------------------------------
725EXAMPLE
726---------------------------------------------------------------------------
727
728Here the input file xml.html is taken from Robin Covers excellent
729WWW-page at http://www.sil.org/sgml/xml.html
730
731Example: Find all P elements containing word "newsfax":
732<CLIP>
733% time sgrep -x xml.index 'stag("P") .. etag("P") containing word("newsfax")'
734<p>[August 19, 1998]  <a href="http://www.zenweb.com/robert/tools/">Tools and
735Utilities</a> from Robert Hanson: XML::Parser, LOTE NewsFax to XML Parsers,
736LOTE XML to Kingdom Summaries, XML Script Server Parser.</p>
7370.03user 0.03system 0:00.25elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
738</CLIP>
739
740The same example without using index:
741
742<CLIP>
743% time sgrep -i 'stag("P") .. etag("P") containing word("newsfax")' xml.html
744<p>[August 19, 1998]  <a href="http://www.zenweb.com/robert/tools/">Tools and
745Utilities</a> from Robert Hanson: XML::Parser, LOTE NewsFax to XML Parsers,
746LOTE XML to Kingdom Summaries, XML Script Server Parser.</p>
7470.82user 0.05system 0:01.18elapsed 73%CPU (0avgtext+0avgdata 0maxresident)k
748</CLIP>
749
750---------------------------------------------------------------------------
751ANOTHER EXAMPLE
752---------------------------------------------------------------------------
753
754This example uses Jon Bosak's XML-example material: religious texts
755and Shakespeare's works. Since this query is slightly more complex,
756it has been put together from smaller parts by using m4. File "filelist"
757contains the list of all XML-example files from Bosaks collection.
758
759Here is the file "query"
760<CLIP>
761# Finds elements having given name
762define(ELEMENT, (stag($1) .. etag($1)))
763
764# Finds LINE elements
765define(E_LINE, (ELEMENT("LINE")))
766
767# Finds SPEECH elements
768define(E_SPEECH, (ELEMENT("SPEECH")))
769
770# Finds SPEECH elements where HAMLET is speaking
771define(HAMLET_SPEAKING, (E_SPEECH containing (
772                ELEMENT("SPEAKER") containing word("HAMLET"))))
773
774# Finds LINE elements containing words to, be, not and question
775define(TOBENOTQUESTION, (E_LINE containing word("to") containing word("be")
776   containing word("not") containing word("question")))
777
778# Finds the LINE where HAMLET says the famous words
779define(HAMLET_SAYS, (TOBENOTQUESTION in HAMLET_SPEAKING))
780</CLIP>
781
782Evaluate the query using plain search:
783
784<CLIP>
785% time sgrep -o "%f:\n %r\n" -f query -e HAMLET_SAYS -F filelist
786/xml/shakespeare.1.10.xml/hamlet.xml:
787 <LINE>To be, or not to be: that is the question:</LINE>
78816.60user 0.78system 0:18.29elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
7890inputs+0outputs (327major+455minor)pagefaults 0swaps
790</CLIP>
791
792Create an index of the input texts:
793
794<CLIP>
795% time sgrep -I -c index -v -F filelist
796Indexing 43/43 files 14957/14958K (99%)
797Writing index file of 5472K
798Writing index 35840/36691 entries (97%)
79923.65user 4.56system 0:32.77elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k
8000inputs+0outputs (94major+5928minor)pagefaults 0swaps
801</CLIP>
802
803Evaluate the query using index:
804
805<CLIP>
806% time sgrep -x index -o "%f:\n %r\n" -f query -e HAMLET_SAYS -F filelist
807/xml/shakespeare.1.10.xml/hamlet.xml:
808 <LINE>To be, or not to be: that is the question:</LINE>
8091.24user 0.13system 0:01.43elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
8100inputs+0outputs (536major+728minor)pagefaults 0swaps
811</CLIP>
812
813---------------------------------------------------------------------------
814THAT'S IT! Enjoy!
815---------------------------------------------------------------------------
816
817Please send comments about sgrep-2.0 to
818Jani Jaakkola (jjaakkol@cs.helsinki.fi).
819
820
821