1
2More documentation can be found in the doc/ directory or at http://www.recoll.org
3
4
5 Recoll user manual
6
7 Jean-Francois Dockes
8
9 <jfd@recoll.org>
10
11 Copyright (c) 2005-2015 Jean-Francois Dockes
12
13 Permission is granted to copy, distribute and/or modify this document
14 under the terms of the GNU Free Documentation License, Version 1.3 or any
15 later version published by the Free Software Foundation; with no Invariant
16 Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
17 license can be found at the following location: GNU web site.
18
19 This document introduces full text search notions and describes the
20 installation and use of the Recoll application. This version describes
21 Recoll 1.21.
22
23 ----------------------------------------------------------------------
24
25 Table of Contents
26
27 1. Introduction
28
29 1.1. Giving it a try
30
31 1.2. Full text search
32
33 1.3. Recoll overview
34
35 2. Indexing
36
37 2.1. Introduction
38
39 2.1.1. Indexing modes
40
41 2.1.2. Configurations, multiple indexes
42
43 2.1.3. Document types
44
45 2.1.4. Indexing failures
46
47 2.1.5. Recovery
48
49 2.2. Index storage
50
51 2.2.1. Xapian index formats
52
53 2.2.2. Security aspects
54
55 2.3. Index configuration
56
57 2.3.1. Multiple indexes
58
59 2.3.2. Index case and diacritics sensitivity
60
61 2.3.3. The index configuration GUI
62
63 2.4. Indexing WEB pages you wisit
64
65 2.5. Extended attributes data
66
67 2.6. Importing external tags
68
69 2.7. Periodic indexing
70
71 2.7.1. Running indexing
72
73 2.7.2. Using cron to automate indexing
74
75 2.8. Real time indexing
76
77 2.8.1. Slowing down the reindexing rate for fast
78 changing files
79
80 3. Searching
81
82 3.1. Searching with the Qt graphical user interface
83
84 3.1.1. Simple search
85
86 3.1.2. The default result list
87
88 3.1.3. The result table
89
90 3.1.4. Running arbitrary commands on result
91 files (1.20 and later)
92
93 3.1.5. Displaying thumbnails
94
95 3.1.6. The preview window
96
97 3.1.7. The Query Fragments window
98
99 3.1.8. Complex/advanced search
100
101 3.1.9. The term explorer tool
102
103 3.1.10. Multiple indexes
104
105 3.1.11. Document history
106
107 3.1.12. Sorting search results and collapsing
108 duplicates
109
110 3.1.13. Search tips, shortcuts
111
112 3.1.14. Saving and restoring queries (1.21 and
113 later)
114
115 3.1.15. Customizing the search interface
116
117 3.2. Searching with the KDE KIO slave
118
119 3.2.1. What's this
120
121 3.2.2. Searchable documents
122
123 3.3. Searching on the command line
124
125 3.4. Path translations
126
127 3.5. The query language
128
129 3.5.1. Modifiers
130
131 3.6. Search case and diacritics sensitivity
132
133 3.7. Anchored searches and wildcards
134
135 3.7.1. More about wildcards
136
137 3.7.2. Anchored searches
138
139 3.8. Desktop integration
140
141 3.8.1. Hotkeying recoll
142
143 3.8.2. The KDE Kicker Recoll applet
144
145 4. Programming interface
146
147 4.1. Writing a document input handler
148
149 4.1.1. Simple input handlers
150
151 4.1.2. "Multiple" handlers
152
153 4.1.3. Telling Recoll about the handler
154
155 4.1.4. Input handler HTML output
156
157 4.1.5. Page numbers
158
159 4.2. Field data processing
160
161 4.3. API
162
163 4.3.1. Interface elements
164
165 4.3.2. Python interface
166
167 5. Installation and configuration
168
169 5.1. Installing a binary copy
170
171 5.2. Supporting packages
172
173 5.3. Building from source
174
175 5.3.1. Prerequisites
176
177 5.3.2. Building
178
179 5.3.3. Installation
180
181 5.4. Configuration overview
182
183 5.4.1. Environment variables
184
185 5.4.2. The main configuration file, recoll.conf
186
187 5.4.3. The fields file
188
189 5.4.4. The mimemap file
190
191 5.4.5. The mimeconf file
192
193 5.4.6. The mimeview file
194
195 5.4.7. The ptrans file
196
197 5.4.8. Examples of configuration adjustments
198
199Chapter 1. Introduction
200
2011.1. Giving it a try
202
203 If you do not like reading manuals (who does?) but wish to give Recoll a
204 try, just install the application and start the recoll graphical user
205 interface (GUI), which will ask permission to index your home directory by
206 default, allowing you to search immediately after indexing completes.
207
208 Do not do this if your home directory contains a huge number of documents
209 and you do not want to wait or are very short on disk space. In this case,
210 you may first want to customize the configuration to restrict the indexed
211 area (for the very impatient with a completed package install, from the
212 recoll GUI: Preferences -> Indexing configuration, then adjust the Top
213 directories section).
214
215 Also be aware that you may need to install the appropriate supporting
216 applications for document types that need them (for example antiword for
217 Microsoft Word files).
218
2191.2. Full text search
220
221 Recoll is a full text search application. Full text search finds your data
222 by content rather than by external attributes (like a file name). You
223 specify words (terms) which should or should not appear in the text you
224 are looking for, and receive in return a list of matching documents,
225 ordered so that the most relevant documents will appear first.
226
227 You do not need to remember in what file or email message you stored a
228 given piece of information. You just ask for related terms, and the tool
229 will return a list of documents where these terms are prominent, in a
230 similar way to Internet search engines.
231
232 Full text search applications try to determine which documents are most
233 relevant to the search terms you provide. Computer algorithms for
234 determining relevance can be very complex, and in general are inferior to
235 the power of the human mind to rapidly determine relevance. The quality of
236 relevance guessing is probably the most important aspect when evaluating a
237 search application.
238
239 In many cases, you are looking for all the forms of a word, including
240 plurals, different tenses for a verb, or terms derived from the same root
241 or stem (example: floor, floors, floored, flooring...). Queries are
242 usually automatically expanded to all such related terms (words that
243 reduce to the same stem). This can be prevented for searching for a
244 specific form.
245
246 Stemming, by itself, does not accommodate for misspellings or phonetic
247 searches. A full text search application may also support this form of
248 approximation. For example, a search for aliterattion returning no result
249 may propose, depending on index contents, alliteration alteration
250 alterations altercation as possible replacement terms.
251
2521.3. Recoll overview
253
254 Recoll uses the Xapian information retrieval library as its storage and
255 retrieval engine. Xapian is a very mature package using a sophisticated
256 probabilistic ranking model.
257
258 The Xapian library manages an index database which describes where terms
259 appear in your document files. It efficiently processes the complex
260 queries which are produced by the Recoll query expansion mechanism, and is
261 in charge of the all-important relevance computation task.
262
263 Recoll provides the mechanisms and interface to get data into and out of
264 the index. This includes translating the many possible document formats
265 into pure text, handling term variations (using Xapian stemmers), and
266 spelling approximations (using the aspell speller), interpreting user
267 queries and presenting results.
268
269 In a shorter way, Recoll does the dirty footwork, Xapian deals with the
270 intelligent parts of the process.
271
272 The Xapian index can be big (roughly the size of the original document
273 set), but it is not a document archive. Recoll can only display documents
274 that still exist at the place from which they were indexed. (Actually,
275 there is a way to reconstruct a document from the information in the
276 index, but the result is not nice, as all formatting, punctuation and
277 capitalization are lost).
278
279 Recoll stores all internal data in Unicode UTF-8 format, and it can index
280 files of many types with different character sets, encodings, and
281 languages into the same index. It can process documents embedded inside
282 other documents (for example a pdf document stored inside a Zip archive
283 sent as an email attachment...), down to an arbitrary depth.
284
285 Stemming is the process by which Recoll reduces words to their radicals so
286 that searching does not depend, for example, on a word being singular or
287 plural (floor, floors), or on a verb tense (flooring, floored). Because
288 the mechanisms used for stemming depend on the specific grammatical rules
289 for each language, there is a separate Xapian stemmer module for most
290 common languages where stemming makes sense.
291
292 Recoll stores the unstemmed versions of terms in the main index and uses
293 auxiliary databases for term expansion (one for each stemming language),
294 which means that you can switch stemming languages between searches, or
295 add a language without needing a full reindex.
296
297 Storing documents written in different languages in the same index is
298 possible, and commonly done. In this situation, you can specify several
299 stemming languages for the index.
300
301 Recoll currently makes no attempt at automatic language recognition, which
302 means that the stemmer will sometimes be applied to terms from other
303 languages with potentially strange results. In practise, even if this
304 introduces possibilities of confusion, this approach has been proven quite
305 useful, and it is much less cumbersome than separating your documents
306 according to what language they are written in.
307
308 Before version 1.18, Recoll stripped most accents and diacritics from
309 terms, and converted them to lower case before either storing them in the
310 index or searching for them. As a consequence, it was impossible to search
311 for a particular capitalization of a term (US / us), or to discriminate
312 two terms based on diacritics (sake / sake, mate / mate).
313
314 As of version 1.18, Recoll can optionally store the raw terms, without
315 accent stripping or case conversion. In this configuration, it is still
316 possible (and most common) for a query to be insensitive to case and/or
317 diacritics. Appropriate term expansions are performed before actually
318 accessing the main index. This is described in more detail in the section
319 about index case and diacritics sensitivity.
320
321 Recoll has many parameters which define exactly what to index, and how to
322 classify and decode the source documents. These are kept in configuration
323 files. A default configuration is copied into a standard location (usually
324 something like /usr/[local/]share/recoll/examples) during installation.
325 The default values set by the configuration files in this directory may be
326 overridden by values that you set inside your personal configuration,
327 found by default in the .recoll sub-directory of your home directory. The
328 default configuration will index your home directory with default
329 parameters and should be sufficient for giving Recoll a try, but you may
330 want to adjust it later, which can be done either by editing the text
331 files or by using configuration menus in the recoll GUI. Some other
332 parameters affecting only the recoll GUI are stored in the standard
333 location defined by Qt.
334
335 The indexing process is started automatically the first time you execute
336 the recoll GUI. Indexing can also be performed by executing the
337 recollindex command. Recoll indexing is multithreaded by default when
338 appropriate hardware resources are available, and can perform in parallel
339 multiple tasks among text extraction, segmentation and index updates.
340
341 Searches are usually performed inside the recoll GUI, which has many
342 options to help you find what you are looking for. However, there are
343 other ways to perform Recoll searches: mostly a command line interface, a
344 Python programming interface, a KDE KIO slave module, and Ubuntu Unity
345 Lens (for older versions) or Scope (for current versions) modules.
346
347Chapter 2. Indexing
348
3492.1. Introduction
350
351 Indexing is the process by which the set of documents is analyzed and the
352 data entered into the database. Recoll indexing is normally incremental:
353 documents will only be processed if they have been modified since the last
354 run. On the first execution, all documents will need processing. A full
355 index build can be forced later by specifying an option to the indexing
356 command (recollindex -z or -Z).
357
358 recollindex skips files which caused an error during a previous pass. This
359 is a performance optimization, and a new behaviour in version 1.21 (failed
360 files were always retried by previous versions). The command line option
361 -k can be set to retry failed files, for example after updating a filter.
362
363 The following sections give an overview of different aspects of the
364 indexing processes and configuration, with links to detailed sections.
365
366 Depending on your data, temporary files may be needed during indexing,
367 some of them possibly quite big. You can use the RECOLL_TMPDIR or TMPDIR
368 environment variables to determine where they are created (the default is
369 to use /tmp). Using TMPDIR has the nice property that it may also be taken
370 into account by auxiliary commands executed by recollindex.
371
372 2.1.1. Indexing modes
373
374 Recoll indexing can be performed along two different modes:
375
376 o Periodic (or batch) indexing: indexing takes place at discrete times,
377 by executing the recollindex command. The typical usage is to have a
378 nightly indexing run programmed into your cron file.
379
380 o Real time indexing: indexing takes place as soon as a file is created
381 or changed. recollindex runs as a daemon and uses a file system
382 alteration monitor such as inotify, Fam or Gamin to detect file
383 changes.
384
385 The choice between the two methods is mostly a matter of preference, and
386 they can be combined by setting up multiple indexes (ie: use periodic
387 indexing on a big documentation directory, and real time indexing on a
388 small home directory). Monitoring a big file system tree can consume
389 significant system resources.
390
391 The choice of method and the parameters used can be configured from the
392 recoll GUI: Preferences -> Indexing schedule
393
394 2.1.2. Configurations, multiple indexes
395
396 The parameters describing what is to be indexed and local preferences are
397 defined in text files contained in a configuration directory.
398
399 All parameters have defaults, defined in system-wide files.
400
401 Without further configuration, Recoll will index all appropriate files
402 from your home directory, with a reasonable set of defaults.
403
404 A default personal configuration directory ($HOME/.recoll/) is created
405 when a Recoll program is first executed. It is possible to create other
406 configuration directories, and use them by setting the RECOLL_CONFDIR
407 environment variable, or giving the -c option to any of the Recoll
408 commands.
409
410 In some cases, it may be interesting to index different areas of the file
411 system to separate databases. You can do this by using multiple
412 configuration directories, each indexing a file system area to a specific
413 database. Typically, this would be done to separate personal and shared
414 indexes, or to take advantage of the organization of your data to improve
415 search precision.
416
417 The generated indexes can be queried concurrently in a transparent manner.
418
419 For index generation, multiple configurations are totally independent from
420 each other. When multiple indexes need to be used for a single search,
421 some parameters should be consistent among the configurations.
422
423 2.1.3. Document types
424
425 Recoll knows about quite a few different document types. The parameters
426 for document types recognition and processing are set in configuration
427 files.
428
429 Most file types, like HTML or word processing files, only hold one
430 document. Some file types, like email folders or zip archives, can hold
431 many individually indexed documents, which may themselves be compound
432 ones. Such hierarchies can go quite deep, and Recoll can process, for
433 example, a LibreOffice document stored as an attachment to an email
434 message inside an email folder archived in a zip file...
435
436 Recoll indexing processes plain text, HTML, OpenDocument
437 (Open/LibreOffice), email formats, and a few others internally.
438
439 Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
440 applications for preprocessing. The list is in the installation section.
441 After every indexing operation, Recoll updates a list of commands that
442 would be needed for indexing existing files types. This list can be
443 displayed by selecting the menu option File -> Show Missing Helpers in the
444 recoll GUI. It is stored in the missing text file inside the configuration
445 directory.
446
447 By default, Recoll will try to index any file type that it has a way to
448 read. This is sometimes not desirable, and there are ways to either
449 exclude some types, or on the contrary to define a positive list of types
450 to be indexed. In the latter case, any type not in the list will be
451 ignored.
452
453 Excluding types can be done by adding wildcard name patterns to the
454 skippedNames list, which can be done from the GUI Index configuration
455 menu. For versions 1.20 and later, you can alternatively set the
456 excludedmimetypes list in the configuration file. This can be redefined
457 for subdirectories.
458
459 You can also define an exclusive list of MIME types to be indexed (no
460 others will be indexed), by setting the indexedmimetypes configuration
461 variable. Example:
462
463 indexedmimetypes = text/html application/pdf
464
465
466 It is possible to redefine this parameter for subdirectories. Example:
467
468 [/path/to/my/dir]
469 indexedmimetypes = application/pdf
470
471
472 (When using sections like this, don't forget that they remain in effect
473 until the end of the file or another section indicator).
474
475 excludedmimetypes or indexedmimetypes, can be set either by editing the
476 main configuration file (recoll.conf), or from the GUI index configuration
477 tool.
478
479 2.1.4. Indexing failures
480
481 Indexing may fail for some documents, for a number of reasons: a helper
482 program may be missing, the document may be corrupt, we may fail to
483 uncompress a file because no file system space is available, etc.
484
485 Recoll versions prior to 1.21 always retried to index files which had
486 previously caused an error. This guaranteed that anything that may have
487 become indexable (for example because a helper had been installed) would
488 be indexed. However this was bad for performance because some indexing
489 failures may be quite costly (for example failing to uncompress a big file
490 because of insufficient disk space).
491
492 The indexer in Recoll versions 1.21 and later do not retry failed file by
493 default. Retrying will only occur if an explicit option (-k) is set on the
494 recollindex command line, or if a script executed when recollindex starts
495 up says so. The script is defined by a configuration variable
496 (checkneedretryindexscript), and makes a rather lame attempt at deciding
497 if a helper command may have been installed, by checking if any of the
498 common bin directories have changed.
499
500 2.1.5. Recovery
501
502 In the rare case where the index becomes corrupted (which can signal
503 itself by weird search results or crashes), the index files need to be
504 erased before restarting a clean indexing pass. Just delete the xapiandb
505 directory (see next section), or, alternatively, start the next
506 recollindex with the -z option, which will reset the database before
507 indexing.
508
5092.2. Index storage
510
511 The default location for the index data is the xapiandb subdirectory of
512 the Recoll configuration directory, typically $HOME/.recoll/xapiandb/.
513 This can be changed via two different methods (with different purposes):
514
515 o You can specify a different configuration directory by setting the
516 RECOLL_CONFDIR environment variable, or using the -c option to the
517 Recoll commands. This method would typically be used to index
518 different areas of the file system to different indexes. For example,
519 if you were to issue the following commands:
520
521 export RECOLL_CONFDIR=~/.indexes-email
522 recoll
523
524
525 Then Recoll would use configuration files stored in ~/.indexes-email/
526 and, (unless specified otherwise in recoll.conf) would look for the
527 index in ~/.indexes-email/xapiandb/.
528
529 Using multiple configuration directories and configuration options
530 allows you to tailor multiple configurations and indexes to handle
531 whatever subset of the available data you wish to make searchable.
532
533 o For a given configuration directory, you can specify a non-default
534 storage location for the index by setting the dbdir parameter in the
535 configuration file (see the configuration section). This method would
536 mainly be of use if you wanted to keep the configuration directory in
537 its default location, but desired another location for the index,
538 typically out of disk occupation concerns.
539
540 The size of the index is determined by the size of the set of documents,
541 but the ratio can vary a lot. For a typical mixed set of documents, the
542 index size will often be close to the data set size. In specific cases (a
543 set of compressed mbox files for example), the index can become much
544 bigger than the documents. It may also be much smaller if the documents
545 contain a lot of images or other non-indexed data (an extreme example
546 being a set of mp3 files where only the tags would be indexed).
547
548 Of course, images, sound and video do not increase the index size, which
549 means that nowadays (2012), typically, even a big index will be negligible
550 against the total amount of data on the computer.
551
552 The index data directory (xapiandb) only contains data that can be
553 completely rebuilt by an index run (as long as the original documents
554 exist), and it can always be destroyed safely.
555
556 2.2.1. Xapian index formats
557
558 Xapian versions usually support several formats for index storage. A given
559 major Xapian version will have a current format, used to create new
560 indexes, and will also support the format from the previous major version.
561
562 Xapian will not convert automatically an existing index from the older
563 format to the newer one. If you want to upgrade to the new format, or if a
564 very old index needs to be converted because its format is not supported
565 any more, you will have to explicitly delete the old index, then run a
566 normal indexing process.
567
568 Using the -z option to recollindex is not sufficient to change the format,
569 you will have to delete all files inside the index directory (typically
570 ~/.recoll/xapiandb) before starting the indexing.
571
572 2.2.2. Security aspects
573
574 The Recoll index does not hold copies of the indexed documents. But it
575 does hold enough data to allow for an almost complete reconstruction. If
576 confidential data is indexed, access to the database directory should be
577 restricted.
578
579 Recoll (since version 1.4) will create the configuration directory with a
580 mode of 0700 (access by owner only). As the index data directory is by
581 default a sub-directory of the configuration directory, this should result
582 in appropriate protection.
583
584 If you use another setup, you should think of the kind of protection you
585 need for your index, set the directory and files access modes
586 appropriately, and also maybe adjust the umask used during index updates.
587
5882.3. Index configuration
589
590 Variables set inside the Recoll configuration files control which areas of
591 the file system are indexed, and how files are processed. These variables
592 can be set either by editing the text files or by using the dialogs in the
593 recoll GUI.
594
595 The first time you start recoll, you will be asked whether or not you
596 would like it to build the index. If you want to adjust the configuration
597 before indexing, just click Cancel at this point, which will get you into
598 the configuration interface. If you exit at this point, recoll will have
599 created a ~/.recoll directory containing empty configuration files, which
600 you can edit by hand.
601
602 The configuration is documented inside the installation chapter of this
603 document, or in the recoll.conf(5) man page, but the most current
604 information will most likely be the comments inside the sample file. The
605 most immediately useful variable you may interested in is probably
606 topdirs, which determines what subtrees get indexed.
607
608 The applications needed to index file types other than text, HTML or email
609 (ie: pdf, postscript, ms-word...) are described in the external packages
610 section.
611
612 As of Recoll 1.18 there are two incompatible types of Recoll indexes,
613 depending on the treatment of character case and diacritics. The next
614 section describes the two types in more detail.
615
616 2.3.1. Multiple indexes
617
618 Multiple Recoll indexes can be created by using several configuration
619 directories which are usually set to index different areas of the file
620 system. A specific index can be selected for updating or searching, using
621 the RECOLL_CONFDIR environment variable or the -c option to recoll and
622 recollindex.
623
624 A typical usage scenario for the multiple index feature would be for a
625 system administrator to set up a central index for shared data, that you
626 choose to search or not in addition to your personal data. Of course,
627 there are other possibilities. There are many cases where you know the
628 subset of files that should be searched, and where narrowing the search
629 can improve the results. You can achieve approximately the same effect
630 with the directory filter in advanced search, but multiple indexes will
631 have much better performance and may be worth the trouble.
632
633 A recollindex program instance can only update one specific index.
634
635 The main index (defined by RECOLL_CONFDIR or -c) is always active. If this
636 is undesirable, you can set up your base configuration to index an empty
637 directory.
638
639 The different search interfaces (GUI, command line, ...) have different
640 methods to define the set of indexes to be used, see the appropriate
641 section.
642
643 If a set of multiple indexes are to be used together for searches, some
644 configuration parameters must be consistent among the set. These are
645 parameters which need to be the same when indexing and searching. As the
646 parameters come from the main configuration when searching, they need to
647 be compatible with what was set when creating the other indexes (which
648 came from their respective configuration directories).
649
650 Most importantly, all indexes to be queried concurrently must have the
651 same option concerning character case and diacritics stripping, but there
652 are other constraints. Most of the relevant parameters are described in
653 the linked section.
654
655 2.3.2. Index case and diacritics sensitivity
656
657 As of Recoll version 1.18 you have a choice of building an index with
658 terms stripped of character case and diacritics, or one with raw terms.
659 For a source term of Resume, the former will store resume, the latter
660 Resume.
661
662 Each type of index allows performing searches insensitive to case and
663 diacritics: with a raw index, the user entry will be expanded to match all
664 case and diacritics variations present in the index. With a stripped
665 index, the search term will be stripped before searching.
666
667 A raw index allows for another possibility which a stripped index cannot
668 offer: using case and diacritics to discriminate between terms, returning
669 different results when searching for US and us or resume and resume. Read
670 the section about search case and diacritics sensitivity for more details.
671
672 The type of index to be created is controlled by the indexStripChars
673 configuration variable which can only be changed by editing the
674 configuration file. Any change implies an index reset (not automated by
675 Recoll), and all indexes in a search must be set in the same way (again,
676 not checked by Recoll).
677
678 If the indexStripChars is not set, Recoll 1.18 creates a stripped index by
679 default, for compatibility with previous versions.
680
681 As a cost for added capability, a raw index will be slightly bigger than a
682 stripped one (around 10%). Also, searches will be more complex, so
683 probably slightly slower, and the feature is still young, so that a
684 certain amount of weirdness cannot be excluded.
685
686 One of the most adverse consequence of using a raw index is that some
687 phrase and proximity searches may become impossible: because each term
688 needs to be expanded, and all combinations searched for, the
689 multiplicative expansion may become unmanageable.
690
691 2.3.3. The index configuration GUI
692
693 Most parameters for a given index configuration can be set from a recoll
694 GUI running on this configuration (either as default, or by setting
695 RECOLL_CONFDIR or the -c option.)
696
697 The interface is started from the Preferences -> Index Configuration menu
698 entry. It is divided in four tabs, Global parameters, Local parameters,
699 Web history (which is explained in the next section) and Search
700 parameters.
701
702 The Global parameters tab allows setting global variables, like the lists
703 of top directories, skipped paths, or stemming languages.
704
705 The Local parameters tab allows setting variables that can be redefined
706 for subdirectories. This second tab has an initially empty list of
707 customisation directories, to which you can add. The variables are then
708 set for the currently selected directory (or at the top level if the empty
709 line is selected).
710
711 The Search parameters section defines parameters which are used at query
712 time, but are global to an index and affect all search tools, not only the
713 GUI.
714
715 The meaning for most entries in the interface is self-evident and
716 documented by a ToolTip popup on the text label. For more detail, you will
717 need to refer to the configuration section of this guide.
718
719 The configuration tool normally respects the comments and most of the
720 formatting inside the configuration file, so that it is quite possible to
721 use it on hand-edited files, which you might nevertheless want to backup
722 first...
723
7242.4. Indexing WEB pages you wisit
725
726 With the help of a Firefox extension, Recoll can index the Internet pages
727 that you visit. The extension was initially designed for the Beagle
728 indexer, but it has recently be renamed and better adapted to Recoll.
729
730 The extension works by copying visited WEB pages to an indexing queue
731 directory, which Recoll then processes, indexing the data, storing it into
732 a local cache, then removing the file from the queue.
733
734 This feature can be enabled in the GUI Index configuration panel, or by
735 editing the configuration file (set processwebqueue to 1).
736
737 A current pointer to the extension can be found, along with up-to-date
738 instructions, on the Recoll wiki.
739
740 A copy of the indexed WEB pages is retained by Recoll in a local cache
741 (from which previews can be fetched). The cache size can be adjusted from
742 the Index configuration / Web history panel. Once the maximum size is
743 reached, old pages are purged - both from the cache and the index - to
744 make room for new ones, so you need to explicitly archive in some other
745 place the pages that you want to keep indefinitely.
746
7472.5. Extended attributes data
748
749 User extended attributes are named pieces of information that most modern
750 file systems can attach to any file.
751
752 Recoll versions 1.19 and later process extended attributes as document
753 fields by default. For older versions, this has to be activated at build
754 time.
755
756 A freedesktop standard defines a few special attributes, which are handled
757 as such by Recoll:
758
759 mime_type
760
761 If set, this overrides any other determination of the file MIME
762 type.
763
764 charset
765 If set, this defines the file character set (mostly useful for
766 plain text files).
767
768 By default, other attributes are handled as Recoll fields. On Linux, the
769 user prefix is removed from the name. This can be configured more
770 precisely inside the fields configuration file.
771
7722.6. Importing external tags
773
774 During indexing, it is possible to import metadata for each file by
775 executing commands. For example, this could extract user tag data for the
776 file and store it in a field for indexing.
777
778 See the section about the metadatacmds field in the main configuration
779 chapter for more detail.
780
7812.7. Periodic indexing
782
783 2.7.1. Running indexing
784
785 Indexing is always performed by the recollindex program, which can be
786 started either from the command line or from the File menu in the recoll
787 GUI program. When started from the GUI, the indexing will run on the same
788 configuration recoll was started on. When started from the command line,
789 recollindex will use the RECOLL_CONFDIR variable or accept a -c confdir
790 option to specify a non-default configuration directory.
791
792 If the recoll program finds no index when it starts, it will automatically
793 start indexing (except if canceled).
794
795 The recollindex indexing process can be interrupted by sending an
796 interrupt (Ctrl-C, SIGINT) or terminate (SIGTERM) signal. Some time may
797 elapse before the process exits, because it needs to properly flush and
798 close the index. This can also be done from the recoll GUI File -> Stop
799 Indexing menu entry.
800
801 After such an interruption, the index will be somewhat inconsistent
802 because some operations which are normally performed at the end of the
803 indexing pass will have been skipped (for example, the stemming and
804 spelling databases will be inexistent or out of date). You just need to
805 restart indexing at a later time to restore consistency. The indexing will
806 restart at the interruption point (the full file tree will be traversed,
807 but files that were indexed up to the interruption and for which the index
808 is still up to date will not need to be reindexed).
809
810 recollindex has a number of other options which are described in its man
811 page. Only a few will be described here.
812
813 Option -z will reset the index when starting. This is almost the same as
814 destroying the index files (the nuance is that the Xapian format version
815 will not be changed).
816
817 Option -Z will force the update of all documents without resetting the
818 index first. This will not have the "clean start" aspect of -z, but the
819 advantage is that the index will remain available for querying while it is
820 rebuilt, which can be a significant advantage if it is very big (some
821 installations need days for a full index rebuild).
822
823 Option -k will force retrying files which previously failed to be indexed,
824 for example because of a missing helper program.
825
826 Of special interest also, maybe, are the -i and -f options. -i allows
827 indexing an explicit list of files (given as command line parameters or
828 read on stdin). -f tells recollindex to ignore file selection parameters
829 from the configuration. Together, these options allow building a custom
830 file selection process for some area of the file system, by adding the top
831 directory to the skippedPaths list and using an appropriate file selection
832 method to build the file list to be fed to recollindex -if. Trivial
833 example:
834
835 find . -name indexable.txt -print | recollindex -if
836
837
838 recollindex -i will not descend into subdirectories specified as
839 parameters, but just add them as index entries. It is up to the external
840 file selection method to build the complete file list.
841
842 2.7.2. Using cron to automate indexing
843
844 The most common way to set up indexing is to have a cron task execute it
845 every night. For example the following crontab entry would do it every day
846 at 3:30AM (supposing recollindex is in your PATH):
847
848 30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
849
850 Or, using anacron:
851
852 1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
853
854 As of version 1.17 the Recoll GUI has dialogs to manage crontab entries
855 for recollindex. You can reach them from the Preferences -> Indexing
856 Schedule menu. They only work with the good old cron, and do not give
857 access to all features of cron scheduling.
858
859 The usual command to edit your crontab is crontab -e (which will usually
860 start the vi editor to edit the file). You may have more sophisticated
861 tools available on your system.
862
863 Please be aware that there may be differences between your usual
864 interactive command line environment and the one seen by crontab commands.
865 Especially the PATH variable may be of concern. Please check the crontab
866 manual pages about possible issues.
867
8682.8. Real time indexing
869
870 Real time monitoring/indexing is performed by starting the recollindex -m
871 command. With this option, recollindex will detach from the terminal and
872 become a daemon, permanently monitoring file changes and updating the
873 index.
874
875 Under KDE, Gnome and some other desktop environments, the daemon can
876 automatically started when you log in, by creating a desktop file inside
877 the ~/.config/autostart directory. This can be done for you by the Recoll
878 GUI. Use the Preferences->Indexing Schedule menu.
879
880 With older X11 setups, starting the daemon is normally performed as part
881 of the user session script.
882
883 The rclmon.sh script can be used to easily start and stop the daemon. It
884 can be found in the examples directory (typically
885 /usr/local/[share/]recoll/examples).
886
887 For example, my out of fashion xdm-based session has a .xsession script
888 with the following lines at the end:
889
890 recollconf=$HOME/.recoll-home
891 recolldata=/usr/local/share/recoll
892 RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
893
894 fvwm
895
896
897 The indexing daemon gets started, then the window manager, for which the
898 session waits.
899
900 By default the indexing daemon will monitor the state of the X11 session,
901 and exit when it finishes, it is not necessary to kill it explicitly. (The
902 X11 server monitoring can be disabled with option -x to recollindex).
903
904 If you use the daemon completely out of an X11 session, you need to add
905 option -x to disable X11 session monitoring (else the daemon will not
906 start).
907
908 By default, the messages from the indexing daemon will be setn to the same
909 file as those from the interactive commands (logfilename). You may want to
910 change this by setting the daemlogfilename and daemloglevel configuration
911 parameters. Also the log file will only be truncated when the daemon
912 starts. If the daemon runs permanently, the log file may grow quite big,
913 depending on the log level.
914
915 When building Recoll, the real time indexing support can be customised
916 during package configuration with the --with[out]-fam or
917 --with[out]-inotify options. The default is currently to include inotify
918 monitoring on systems that support it, and, as of Recoll 1.17, gamin
919 support on FreeBSD.
920
921 While it is convenient that data is indexed in real time, repeated
922 indexing can generate a significant load on the system when files such as
923 email folders change. Also, monitoring large file trees by itself
924 significantly taxes system resources. You probably do not want to enable
925 it if your system is short on resources. Periodic indexing is adequate in
926 most cases.
927
928 Increasing resources for inotify
929
930 On Linux systems, monitoring a big tree may need increasing the resources
931 available to inotify, which are normally defined in /etc/sysctl.conf.
932
933 ### inotify
934 #
935 # cat /proc/sys/fs/inotify/max_queued_events - 16384
936 # cat /proc/sys/fs/inotify/max_user_instances - 128
937 # cat /proc/sys/fs/inotify/max_user_watches - 16384
938 #
939 # -- Change to:
940 #
941 fs.inotify.max_queued_events=32768
942 fs.notify.max_user_instances=256
943 fs.inotify.max_user_watches=32768
944
945
946 Especially, you will need to trim your tree or adjust the max_user_watches
947 value if indexing exits with a message about errno ENOSPC (28) from
948 inotify_add_watch.
949
950 2.8.1. Slowing down the reindexing rate for fast changing files
951
952 When using the real time monitor, it may happen that some files need to be
953 indexed, but change so often that they impose an excessive load for the
954 system.
955
956 Recoll provides a configuration option to specify the minimum time before
957 which a file, specified by a wildcard pattern, cannot be reindexed. See
958 the mondelaypatterns parameter in the configuration section.
959
960Chapter 3. Searching
961
9623.1. Searching with the Qt graphical user interface
963
964 The recoll program provides the main user interface for searching. It is
965 based on the Qt library.
966
967 recoll has two search modes:
968
969 o Simple search (the default, on the main screen) has a single entry
970 field where you can enter multiple words.
971
972 o Advanced search (a panel accessed through the Tools menu or the
973 toolbox bar icon) has multiple entry fields, which you may use to
974 build a logical condition, with additional filtering on file type,
975 location in the file system, modification date, and size.
976
977 In most cases, you can enter the terms as you think them, even if they
978 contain embedded punctuation or other non-textual characters. For example,
979 Recoll can handle things like email addresses, or arbitrary cut and paste
980 from another text window, punctuation and all.
981
982 The main case where you should enter text differently from how it is
983 printed is for east-asian languages (Chinese, Japanese, Korean). Words
984 composed of single or multiple characters should be entered separated by
985 white space in this case (they would typically be printed without white
986 space).
987
988 Some searches can be quite complex, and you may want to re-use them later,
989 perhaps with some tweaking. Recoll versions 1.21 and later can save and
990 restore searches, using XML files. See Saving and restoring queries.
991
992 3.1.1. Simple search
993
994 1. Start the recoll program.
995
996 2. Possibly choose a search mode: Any term, All terms, File name or Query
997 language.
998
999 3. Enter search term(s) in the text field at the top of the window.
1000
1001 4. Click the Search button or hit the Enter key to start the search.
1002
1003 The initial default search mode is Query language. Without special
1004 directives, this will look for documents containing all of the search
1005 terms (the ones with more terms will get better scores), just like the All
1006 terms mode which will ignore such directives. Any term will search for
1007 documents where at least one of the terms appear.
1008
1009 The Query Language features are described in a separate section.
1010
1011 All search modes allow wildcards inside terms (*, ?, []). You may want to
1012 have a look at the section about wildcards for more information about
1013 this.
1014
1015 File name will specifically look for file names. The point of having a
1016 separate file name search is that wild card expansion can be performed
1017 more efficiently on a small subset of the index (allowing wild cards on
1018 the left of terms without excessive penalty). Things to know:
1019
1020 o White space in the entry should match white space in the file name,
1021 and is not treated specially.
1022
1023 o The search is insensitive to character case and accents, independently
1024 of the type of index.
1025
1026 o An entry without any wild card character and not capitalized will be
1027 prepended and appended with '*' (ie: etc -> *etc*, but Etc -> etc).
1028
1029 o If you have a big index (many files), excessively generic fragments
1030 may result in inefficient searches.
1031
1032 You can search for exact phrases (adjacent words in a given order) by
1033 enclosing the input inside double quotes. Ex: "virtual reality".
1034
1035 When using a stripped index, character case has no influence on search,
1036 except that you can disable stem expansion for any term by capitalizing
1037 it. Ie: a search for floor will also normally look for flooring, floored,
1038 etc., but a search for Floor will only look for floor, in any character
1039 case. Stemming can also be disabled globally in the preferences. When
1040 using a raw index, the rules are a bit more complicated.
1041
1042 Recoll remembers the last few searches that you performed. You can use the
1043 simple search text entry widget (a combobox) to recall them (click on the
1044 thing at the right of the text field). Please note, however, that only the
1045 search texts are remembered, not the mode (all/any/file name).
1046
1047 Typing Esc Space while entering a word in the simple search entry will
1048 open a window with possible completions for the word. The completions are
1049 extracted from the database.
1050
1051 Double-clicking on a word in the result list or a preview window will
1052 insert it into the simple search entry field.
1053
1054 You can cut and paste any text into an All terms or Any term search field,
1055 punctuation, newlines and all - except for wildcard characters (single ?
1056 characters are ok). Recoll will process it and produce a meaningful
1057 search. This is what most differentiates this mode from the Query Language
1058 mode, where you have to care about the syntax.
1059
1060 You can use the Tools -> Advanced search dialog for more complex searches.
1061
1062 3.1.2. The default result list
1063
1064 After starting a search, a list of results will instantly be displayed in
1065 the main list window.
1066
1067 By default, the document list is presented in order of relevance (how well
1068 the system estimates that the document matches the query). You can sort
1069 the result by ascending or descending date by using the vertical arrows in
1070 the toolbar.
1071
1072 Clicking on the Preview link for an entry will open an internal preview
1073 window for the document. Further Preview clicks for the same search will
1074 open tabs in the existing preview window. You can use Shift+Click to force
1075 the creation of another preview window, which may be useful to view the
1076 documents side by side. (You can also browse successive results in a
1077 single preview window by typing Shift+ArrowUp/Down in the window).
1078
1079 Clicking the Open link will start an external viewer for the document. By
1080 default, Recoll lets the desktop choose the appropriate application for
1081 most document types (there is a short list of exceptions, see further). If
1082 you prefer to completely customize the choice of applications, you can
1083 uncheck the Use desktop preferences option in the GUI preferences dialog,
1084 and click the Choose editor applications button to adjust the predefined
1085 Recoll choices. The tool accepts multiple selections of MIME types (e.g.
1086 to set up the editor for the dozens of office file types).
1087
1088 Even when Use desktop preferences is checked, there is a small list of
1089 exceptions, for MIME types where the Recoll choice should override the
1090 desktop one. These are applications which are well integrated with Recoll,
1091 especially evince for viewing PDF and Postscript files because of its
1092 support for opening the document at a specific page and passing a search
1093 string as an argument. Of course, you can edit the list (in the GUI
1094 preferences) if you would prefer to lose the functionality and use the
1095 standard desktop tool.
1096
1097 You may also change the choice of applications by editing the mimeview
1098 configuration file if you find this more convenient.
1099
1100 Each result entry also has a right-click menu with an Open With entry.
1101 This lets you choose an application from the list of those which
1102 registered with the desktop for the document MIME type.
1103
1104 The Preview and Open edit links may not be present for all entries,
1105 meaning that Recoll has no configured way to preview a given file type
1106 (which was indexed by name only), or no configured external editor for the
1107 file type. This can sometimes be adjusted simply by tweaking the mimemap
1108 and mimeview configuration files (the latter can be modified with the user
1109 preferences dialog).
1110
1111 The format of the result list entries is entirely configurable by using
1112 the preference dialog to edit an HTML fragment.
1113
1114 You can click on the Query details link at the top of the results page to
1115 see the query actually performed, after stem expansion and other
1116 processing.
1117
1118 Double-clicking on any word inside the result list or a preview window
1119 will insert it into the simple search text.
1120
1121 The result list is divided into pages (the size of which you can change in
1122 the preferences). Use the arrow buttons in the toolbar or the links at the
1123 bottom of the page to browse the results.
1124
1125 3.1.2.1. No results: the spelling suggestions
1126
1127 When a search yields no result, and if the aspell dictionary is
1128 configured, Recoll will try to check for misspellings among the query
1129 terms, and will propose lists of replacements. Clicking on one of the
1130 suggestions will replace the word and restart the search. You can hold any
1131 of the modifier keys (Ctrl, Shift, etc.) while clicking if you would
1132 rather stay on the suggestion screen because several terms need
1133 replacement.
1134
1135 3.1.2.2. The result list right-click menu
1136
1137 Apart from the preview and edit links, you can display a pop-up menu by
1138 right-clicking over a paragraph in the result list. This menu has the
1139 following entries:
1140
1141 o Preview
1142
1143 o Open
1144
1145 o Open With
1146
1147 o Run Script
1148
1149 o Copy File Name
1150
1151 o Copy Url
1152
1153 o Save to File
1154
1155 o Find similar
1156
1157 o Preview Parent document
1158
1159 o Open Parent document
1160
1161 o Open Snippets Window
1162
1163 The Preview and Open entries do the same thing as the corresponding links.
1164
1165 Open With lets you open the document with one of the applications claiming
1166 to be able to handle its MIME type (the information comes from the
1167 .desktop files in /usr/share/applications).
1168
1169 Run Script allows starting an arbitrary command on the result file. It
1170 will only appear for results which are top-level files. See further for a
1171 more detailed description.
1172
1173 The Copy File Name and Copy Url copy the relevant data to the clipboard,
1174 for later pasting.
1175
1176 Save to File allows saving the contents of a result document to a chosen
1177 file. This entry will only appear if the document does not correspond to
1178 an existing file, but is a subdocument inside such a file (ie: an email
1179 attachment). It is especially useful to extract attachments with no
1180 associated editor.
1181
1182 The Open/Preview Parent document entries allow working with the higher
1183 level document (e.g. the email message an attachment comes from). Recoll
1184 is sometimes not totally accurate as to what it can or can't do in this
1185 area. For example the Parent entry will also appear for an email which is
1186 part of an mbox folder file, but you can't actually visualize the mbox
1187 (there will be an error dialog if you try).
1188
1189 If the document is a top-level file, Open Parent will start the default
1190 file manager on the enclosing filesystem directory.
1191
1192 The Find similar entry will select a number of relevant term from the
1193 current document and enter them into the simple search field. You can then
1194 start a simple search, with a good chance of finding documents related to
1195 the current result. I can't remember a single instance where this function
1196 was actually useful to me...
1197
1198 The Open Snippets Window entry will only appear for documents which
1199 support page breaks (typically PDF, Postscript, DVI). The snippets window
1200 lists extracts from the document, taken around search terms occurrences,
1201 along with the corresponding page number, as links which can be used to
1202 start the native viewer on the appropriate page. If the viewer supports
1203 it, its search function will also be primed with one of the search terms.
1204
1205 3.1.3. The result table
1206
1207 In Recoll 1.15 and newer, the results can be displayed in spreadsheet-like
1208 fashion. You can switch to this presentation by clicking the table-like
1209 icon in the toolbar (this is a toggle, click again to restore the list).
1210
1211 Clicking on the column headers will allow sorting by the values in the
1212 column. You can click again to invert the order, and use the header
1213 right-click menu to reset sorting to the default relevance order (you can
1214 also use the sort-by-date arrows to do this).
1215
1216 Both the list and the table display the same underlying results. The sort
1217 order set from the table is still active if you switch back to the list
1218 mode. You can click twice on a date sort arrow to reset it from there.
1219
1220 The header right-click menu allows adding or deleting columns. The columns
1221 can be resized, and their order can be changed (by dragging). All the
1222 changes are recorded when you quit recoll
1223
1224 Hovering over a table row will update the detail area at the bottom of the
1225 window with the corresponding values. You can click the row to freeze the
1226 display. The bottom area is equivalent to a result list paragraph, with
1227 links for starting a preview or a native application, and an equivalent
1228 right-click menu. Typing Esc (the Escape key) will unfreeze the display.
1229
1230 3.1.4. Running arbitrary commands on result files (1.20 and later)
1231
1232 Apart from the Open and Open With operations, which allow starting an
1233 application on a result document (or a temporary copy), based on its MIME
1234 type, it is also possible to run arbitrary commands on results which are
1235 top-level files, using the Run Script entry in the results pop-up menu.
1236
1237 The commands which will appear in the Run Script submenu must be defined
1238 by .desktop files inside the scripts subdirectory of the current
1239 configuration directory.
1240
1241 Here follows an example of a .desktop file, which could be named for
1242 example, ~/.recoll/scripts/myscript.desktop (the exact file name inside
1243 the directory is irrelevant):
1244
1245 [Desktop Entry]
1246 Type=Application
1247 Name=MyFirstScript
1248 Exec=/home/me/bin/tryscript %F
1249 MimeType=*/*
1250
1251
1252 The Name attribute defines the label which will appear inside the Run
1253 Script menu. The Exec attribute defines the program to be run, which does
1254 not need to actually be a script, of course. The MimeType attribute is not
1255 used, but needs to exist.
1256
1257 The commands defined this way can also be used from links inside the
1258 result paragraph.
1259
1260 As an example, it might make sense to write a script which would move the
1261 document to the trash and purge it from the Recoll index.
1262
1263 3.1.5. Displaying thumbnails
1264
1265 The default format for the result list entries and the detail area of the
1266 result table display an icon for each result document. The icon is either
1267 a generic one determined from the MIME type, or a thumbnail of the
1268 document appearance. Thumbnails are only displayed if found in the
1269 standard freedesktop location, where they would typically have been
1270 created by a file manager.
1271
1272 Recoll has no capability to create thumbnails. A relatively simple trick
1273 is to use the Open parent document/folder entry in the result list popup
1274 menu. This should open a file manager window on the containing directory,
1275 which should in turn create the thumbnails (depending on your settings).
1276 Restarting the search should then display the thumbnails.
1277
1278 There are also some pointers about thumbnail generation on the Recoll
1279 wiki.
1280
1281 3.1.6. The preview window
1282
1283 The preview window opens when you first click a Preview link inside the
1284 result list.
1285
1286 Subsequent preview requests for a given search open new tabs in the
1287 existing window (except if you hold the Shift key while clicking which
1288 will open a new window for side by side viewing).
1289
1290 Starting another search and requesting a preview will create a new preview
1291 window. The old one stays open until you close it.
1292
1293 You can close a preview tab by typing Ctrl-W (Ctrl + W) in the window.
1294 Closing the last tab for a window will also close the window.
1295
1296 Of course you can also close a preview window by using the window manager
1297 button in the top of the frame.
1298
1299 You can display successive or previous documents from the result list
1300 inside a preview tab by typing Shift+Down or Shift+Up (Down and Up are the
1301 arrow keys).
1302
1303 A right-click menu in the text area allows switching between displaying
1304 the main text or the contents of fields associated to the document (ie:
1305 author, abtract, etc.). This is especially useful in cases where the term
1306 match did not occur in the main text but in one of the fields. In the case
1307 of images, you can switch between three displays: the image itself, the
1308 image metadata as extracted by exiftool and the fields, which is the
1309 metadata stored in the index.
1310
1311 You can print the current preview window contents by typing Ctrl-P (Ctrl +
1312 P) in the window text.
1313
1314 3.1.6.1. Searching inside the preview
1315
1316 The preview window has an internal search capability, mostly controlled by
1317 the panel at the bottom of the window, which works in two modes: as a
1318 classical editor incremental search, where we look for the text entered in
1319 the entry zone, or as a way to walk the matches between the document and
1320 the Recoll query that found it.
1321
1322 Incremental text search
1323
1324 The preview tabs have an internal incremental search function. You
1325 initiate the search either by typing a / (slash) or CTL-F inside
1326 the text area or by clicking into the Search for: text field and
1327 entering the search string. You can then use the Next and Previous
1328 buttons to find the next/previous occurrence. You can also type F3
1329 inside the text area to get to the next occurrence.
1330
1331 If you have a search string entered and you use Ctrl-Up/Ctrl-Down
1332 to browse the results, the search is initiated for each successive
1333 document. If the string is found, the cursor will be positioned at
1334 the first occurrence of the search string.
1335
1336 Walking the match lists
1337
1338 If the entry area is empty when you click the Next or Previous
1339 buttons, the editor will be scrolled to show the next match to any
1340 search term (the next highlighted zone). If you select a search
1341 group from the dropdown list and click Next or Previous, the match
1342 list for this group will be walked. This is not the same as a text
1343 search, because the occurrences will include non-exact matches (as
1344 caused by stemming or wildcards). The search will revert to the
1345 text mode as soon as you edit the entry area.
1346
1347 3.1.7. The Query Fragments window
1348
1349 Selecting the Tools -> Query Fragments menu entry will open a window with
1350 radio- and check-buttons which can be used to activate query language
1351 fragments for filtering the current query. This can be useful if you have
1352 frequent reusable selectors, for example, filtering on alternate
1353 directories, or searching just one category of files, not covered by the
1354 standard category selectors.
1355
1356 The contents of the window are entirely customizable, and defined by the
1357 contents of the fragbuts.xml file inside the configuration directory. The
1358 sample file distributed with Recoll (which you should be able to find
1359 under /usr/share/recoll/examples/fragbuts.xml), contains an example which
1360 filters the results from the WEB history.
1361
1362 Here follows an example:
1363
1364 <?xml version="1.0" encoding="UTF-8"?>
1365
1366 <fragbuts version="1.0">
1367
1368 <radiobuttons>
1369
1370 <fragbut>
1371 <label>Include Web Results</label>
1372 <frag></frag>
1373 </fragbut>
1374
1375 <fragbut>
1376 <label>Exclude Web Results</label>
1377 <frag>-rclbes:BGL</frag>
1378 </fragbut>
1379
1380 <fragbut>
1381 <label>Only Web Results</label>
1382 <frag>rclbes:BGL</frag>
1383 </fragbut>
1384
1385 </radiobuttons>
1386
1387 <buttons>
1388
1389 <fragbut>
1390 <label>Year 2010</label>
1391 <frag>date:2010-01-01/2010-12-31</frag>
1392 </fragbut>
1393
1394 <fragbut>
1395 <label>My Great Directory Only</label>
1396 <frag>dir:/my/great/directory</frag>
1397 </fragbut>
1398
1399 </buttons>
1400 </fragbuts>
1401
1402 Each radiobuttons or buttons section defines a line of checkbuttons or
1403 radiobuttons inside the window. Any number of buttons can be selected, but
1404 the radiobuttons in a line are exclusive.
1405
1406 Each fragbut section defines the label for a button, and the Query
1407 Language fragment which will be added (as an AND filter) before performing
1408 the query if the button is active.
1409
1410 This feature is new in Recoll 1.20, and will probably be refined depending
1411 on user feedback.
1412
1413 3.1.8. Complex/advanced search
1414
1415 The advanced search dialog helps you build more complex queries without
1416 memorizing the search language constructs. It can be opened through the
1417 Tools menu or through the main toolbar.
1418
1419 Recoll keeps a history of searches. See Advanced search history.
1420
1421 The dialog has two tabs:
1422
1423 1. The first tab lets you specify terms to search for, and permits
1424 specifying multiple clauses which are combined to build the search.
1425
1426 2. The second tab lets filter the results according to file size, date of
1427 modification, MIME type, or location.
1428
1429 Click on the Start Search button in the advanced search dialog, or type
1430 Enter in any text field to start the search. The button in the main window
1431 always performs a simple search.
1432
1433 Click on the Show query details link at the top of the result page to see
1434 the query expansion.
1435
1436 3.1.8.1. Advanced search: the "find" tab
1437
1438 This part of the dialog lets you constructc a query by combining multiple
1439 clauses of different types. Each entry field is configurable for the
1440 following modes:
1441
1442 o All terms.
1443
1444 o Any term.
1445
1446 o None of the terms.
1447
1448 o Phrase (exact terms in order within an adjustable window).
1449
1450 o Proximity (terms in any order within an adjustable window).
1451
1452 o Filename search.
1453
1454 Additional entry fields can be created by clicking the Add clause button.
1455
1456 When searching, the non-empty clauses will be combined either with an AND
1457 or an OR conjunction, depending on the choice made on the left (All
1458 clauses or Any clause).
1459
1460 Entries of all types except "Phrase" and "Near" accept a mix of single
1461 words and phrases enclosed in double quotes. Stemming and wildcard
1462 expansion will be performed as for simple search.
1463
1464 Phrases and Proximity searches. These two clauses work in similar ways,
1465 with the difference that proximity searches do not impose an order on the
1466 words. In both cases, an adjustable number (slack) of non-matched words
1467 may be accepted between the searched ones (use the counter on the left to
1468 adjust this count). For phrases, the default count is zero (exact match).
1469 For proximity it is ten (meaning that two search terms, would be matched
1470 if found within a window of twelve words). Examples: a phrase search for
1471 quick fox with a slack of 0 will match quick fox but not quick brown fox.
1472 With a slack of 1 it will match the latter, but not fox quick. A proximity
1473 search for quick fox with the default slack will match the latter, and
1474 also a fox is a cunning and quick animal.
1475
1476 3.1.8.2. Advanced search: the "filter" tab
1477
1478 This part of the dialog has several sections which allow filtering the
1479 results of a search according to a number of criteria
1480
1481 o The first section allows filtering by dates of last modification. You
1482 can specify both a minimum and a maximum date. The initial values are
1483 set according to the oldest and newest documents found in the index.
1484
1485 o The next section allows filtering the results by file size. There are
1486 two entries for minimum and maximum size. Enter decimal numbers. You
1487 can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
1488 respectively.
1489
1490 o The next section allows filtering the results by their MIME types, or
1491 MIME categories (ie: media/text/message/etc.).
1492
1493 You can transfer the types between two boxes, to define which will be
1494 included or excluded by the search.
1495
1496 The state of the file type selection can be saved as the default (the
1497 file type filter will not be activated at program start-up, but the
1498 lists will be in the restored state).
1499
1500 o The bottom section allows restricting the search results to a sub-tree
1501 of the indexed area. You can use the Invert checkbox to search for
1502 files not in the sub-tree instead. If you use directory filtering
1503 often and on big subsets of the file system, you may think of setting
1504 up multiple indexes instead, as the performance may be better.
1505
1506 You can use relative/partial paths for filtering. Ie, entering
1507 dirA/dirB would match either /dir1/dirA/dirB/myfile1 or
1508 /dir2/dirA/dirB/someother/myfile2.
1509
1510 3.1.8.3. Advanced search history
1511
1512 The advanced search tool memorizes the last 100 searches performed. You
1513 can walk the saved searches by using the up and down arrow keys while the
1514 keyboard focus belongs to the advanced search dialog.
1515
1516 The complex search history can be erased, along with the one for simple
1517 search, by selecting the File -> Erase Search History menu entry.
1518
1519 3.1.9. The term explorer tool
1520
1521 Recoll automatically manages the expansion of search terms to their
1522 derivatives (ie: plural/singular, verb inflections). But there are other
1523 cases where the exact search term is not known. For example, you may not
1524 remember the exact spelling, or only know the beginning of the name.
1525
1526 The search will only propose replacement terms with spelling variations
1527 when no matching document were found. In some cases, both proper spellings
1528 and mispellings are present in the index, and it may be interesting to
1529 look for them explicitly.
1530
1531 The term explorer tool (started from the toolbar icon or from the Term
1532 explorer entry of the Tools menu) can be used to search the full index
1533 terms list. It has three modes of operations:
1534
1535 Wildcard
1536
1537 In this mode of operation, you can enter a search string with
1538 shell-like wildcards (*, ?, []). ie: xapi* would display all index
1539 terms beginning with xapi. (More about wildcards here).
1540
1541 Regular expression
1542
1543 This mode will accept a regular expression as input. Example:
1544 word[0-9]+. The expression is implicitly anchored at the
1545 beginning. Ie: press will match pression but not expression. You
1546 can use .*press to match the latter, but be aware that this will
1547 cause a full index term list scan, which can be quite long.
1548
1549 Stem expansion
1550
1551 This mode will perform the usual stem expansion normally done as
1552 part user input processing. As such it is probably mostly useful
1553 to demonstrate the process.
1554
1555 Spelling/Phonetic
1556
1557 In this mode, you enter the term as you think it is spelled, and
1558 Recoll will do its best to find index terms that sound like your
1559 entry. This mode uses the Aspell spelling application, which must
1560 be installed on your system for things to work (if your documents
1561 contain non-ascii characters, Recoll needs an aspell version newer
1562 than 0.60 for UTF-8 support). The language which is used to build
1563 the dictionary out of the index terms (which is done at the end of
1564 an indexing pass) is the one defined by your NLS environment.
1565 Weird things will probably happen if languages are mixed up.
1566
1567 Note that in cases where Recoll does not know the beginning of the string
1568 to search for (ie a wildcard expression like *coll), the expansion can
1569 take quite a long time because the full index term list will have to be
1570 processed. The expansion is currently limited at 10000 results for
1571 wildcards and regular expressions. It is possible to change the limit in
1572 the configuration file.
1573
1574 Double-clicking on a term in the result list will insert it into the
1575 simple search entry field. You can also cut/paste between the result list
1576 and any entry field (the end of lines will be taken care of).
1577
1578 3.1.10. Multiple indexes
1579
1580 See the section describing the use of multiple indexes for generalities.
1581 Only the aspects concerning the recoll GUI are described here.
1582
1583 A recoll program instance is always associated with a specific index,
1584 which is the one to be updated when requested from the File menu, but it
1585 can use any number of Recoll indexes for searching. The external indexes
1586 can be selected through the external indexes tab in the preferences
1587 dialog.
1588
1589 Index selection is performed in two phases. A set of all usable indexes
1590 must first be defined, and then the subset of indexes to be used for
1591 searching. These parameters are retained across program executions (there
1592 are kept separately for each Recoll configuration). The set of all indexes
1593 is usually quite stable, while the active ones might typically be adjusted
1594 quite frequently.
1595
1596 The main index (defined by RECOLL_CONFDIR) is always active. If this is
1597 undesirable, you can set up your base configuration to index an empty
1598 directory.
1599
1600 When adding a new index to the set, you can select either a Recoll
1601 configuration directory, or directly a Xapian index directory. In the
1602 first case, the Xapian index directory will be obtained from the selected
1603 configuration.
1604
1605 As building the set of all indexes can be a little tedious when done
1606 through the user interface, you can use the RECOLL_EXTRA_DBS environment
1607 variable to provide an initial set. This might typically be set up by a
1608 system administrator so that every user does not have to do it. The
1609 variable should define a colon-separated list of index directories, ie:
1610
1611 export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
1612
1613 Another environment variable, RECOLL_ACTIVE_EXTRA_DBS allows adding to the
1614 active list of indexes. This variable was suggested and implemented by a
1615 Recoll user. It is mostly useful if you use scripts to mount external
1616 volumes with Recoll indexes. By using RECOLL_EXTRA_DBS and
1617 RECOLL_ACTIVE_EXTRA_DBS, you can add and activate the index for the
1618 mounted volume when starting recoll.
1619
1620 RECOLL_ACTIVE_EXTRA_DBS is available for Recoll versions 1.17.2 and later.
1621 A change was made in the same update so that recoll will automatically
1622 deactivate unreachable indexes when starting up.
1623
1624 3.1.11. Document history
1625
1626 Documents that you actually view (with the internal preview or an external
1627 tool) are entered into the document history, which is remembered.
1628
1629 You can display the history list by using the Tools/Doc History menu
1630 entry.
1631
1632 You can erase the document history by using the Erase document history
1633 entry in the File menu.
1634
1635 3.1.12. Sorting search results and collapsing duplicates
1636
1637 The documents in a result list are normally sorted in order of relevance.
1638 It is possible to specify a different sort order, either by using the
1639 vertical arrows in the GUI toolbox to sort by date, or switching to the
1640 result table display and clicking on any header. The sort order chosen
1641 inside the result table remains active if you switch back to the result
1642 list, until you click one of the vertical arrows, until both are unchecked
1643 (you are back to sort by relevance).
1644
1645 Sort parameters are remembered between program invocations, but result
1646 sorting is normally always inactive when the program starts. It is
1647 possible to keep the sorting activation state between program invocations
1648 by checking the Remember sort activation state option in the preferences.
1649
1650 It is also possible to hide duplicate entries inside the result list
1651 (documents with the exact same contents as the displayed one). The test of
1652 identity is based on an MD5 hash of the document container, not only of
1653 the text contents (so that ie, a text document with an image added will
1654 not be a duplicate of the text only). Duplicates hiding is controlled by
1655 an entry in the GUI configuration dialog, and is off by default.
1656
1657 As of release 1.19, when a result document does have undisplayed
1658 duplicates, a Dups link will be shown with the result list entry. Clicking
1659 the link will display the paths (URLs + ipaths) for the duplicate entries.
1660
1661 3.1.13. Search tips, shortcuts
1662
1663 3.1.13.1. Terms and search expansion
1664
1665 Term completion. Typing Esc Space in the simple search entry field while
1666 entering a word will either complete the current word if its beginning
1667 matches a unique term in the index, or open a window to propose a list of
1668 completions.
1669
1670 Picking up new terms from result or preview text. Double-clicking on a
1671 word in the result list or in a preview window will copy it to the simple
1672 search entry field.
1673
1674 Wildcards. Wildcards can be used inside search terms in all forms of
1675 searches. More about wildcards.
1676
1677 Automatic suffixes. Words like odt or ods can be automatically turned into
1678 query language ext:xxx clauses. This can be enabled in the Search
1679 preferences panel in the GUI.
1680
1681 Disabling stem expansion. Entering a capitalized word in any search field
1682 will prevent stem expansion (no search for gardening if you enter Garden
1683 instead of garden). This is the only case where character case should make
1684 a difference for a Recoll search. You can also disable stem expansion or
1685 change the stemming language in the preferences.
1686
1687 Finding related documents. Selecting the Find similar documents entry in
1688 the result list paragraph right-click menu will select a set of
1689 "interesting" terms from the current result, and insert them into the
1690 simple search entry field. You can then possibly edit the list and start a
1691 search to find documents which may be apparented to the current result.
1692
1693 File names. File names are added as terms during indexing, and you can
1694 specify them as ordinary terms in normal search fields (Recoll used to
1695 index all directories in the file path as terms. This has been abandoned
1696 as it did not seem really useful). Alternatively, you can use the specific
1697 file name search which will only look for file names, and may be faster
1698 than the generic search especially when using wildcards.
1699
1700 3.1.13.2. Working with phrases and proximity
1701
1702 Phrases and Proximity searches. A phrase can be looked for by enclosing it
1703 in double quotes. Example: "user manual" will look only for occurrences of
1704 user immediately followed by manual. You can use the This phrase field of
1705 the advanced search dialog to the same effect. Phrases can be entered
1706 along simple terms in all simple or advanced search entry fields (except
1707 This exact phrase).
1708
1709 AutoPhrases. This option can be set in the preferences dialog. If it is
1710 set, a phrase will be automatically built and added to simple searches
1711 when looking for Any terms. This will not change radically the results,
1712 but will give a relevance boost to the results where the search terms
1713 appear as a phrase. Ie: searching for virtual reality will still find all
1714 documents where either virtual or reality or both appear, but those which
1715 contain virtual reality should appear sooner in the list.
1716
1717 Phrase searches can strongly slow down a query if most of the terms in the
1718 phrase are common. This is why the autophrase option is off by default for
1719 Recoll versions before 1.17. As of version 1.17, autophrase is on by
1720 default, but very common terms will be removed from the constructed
1721 phrase. The removal threshold can be adjusted from the search preferences.
1722
1723 Phrases and abbreviations. As of Recoll version 1.17, dotted abbreviations
1724 like I.B.M. are also automatically indexed as a word without the dots:
1725 IBM. Searching for the word inside a phrase (ie: "the IBM company") will
1726 only match the dotted abrreviation if you increase the phrase slack (using
1727 the advanced search panel control, or the o query language modifier).
1728 Literal occurrences of the word will be matched normally.
1729
1730 3.1.13.3. Others
1731
1732 Using fields. You can use the query language and field specifications to
1733 only search certain parts of documents. This can be especially helpful
1734 with email, for example only searching emails from a specific originator:
1735 search tips from:helpfulgui
1736
1737 Adjusting the result table columns. When displaying results in table mode,
1738 you can use a right click on the table headers to activate a pop-up menu
1739 which will let you adjust what columns are displayed. You can drag the
1740 column headers to adjust their order. You can click them to sort by the
1741 field displayed in the column. You can also save the result list in CSV
1742 format.
1743
1744 Changing the GUI geometry. It is possible to configure the GUI in wide
1745 form factor by dragging the toolbars to one of the sides (their location
1746 is remembered between sessions), and moving the category filters to a menu
1747 (can be set in the Preferences -> GUI configuration -> User interface
1748 panel).
1749
1750 Query explanation. You can get an exact description of what the query
1751 looked for, including stem expansion, and Boolean operators used, by
1752 clicking on the result list header.
1753
1754 Advanced search history. As of Recoll 1.18, you can display any of the
1755 last 100 complex searches performed by using the up and down arrow keys
1756 while the advanced search panel is active.
1757
1758 Browsing the result list inside a preview window. Entering Shift-Down or
1759 Shift-Up (Shift + an arrow key) in a preview window will display the next
1760 or the previous document from the result list. Any secondary search
1761 currently active will be executed on the new document.
1762
1763 Scrolling the result list from the keyboard. You can use PageUp and
1764 PageDown to scroll the result list, Shift+Home to go back to the first
1765 page. These work even while the focus is in the search entry.
1766
1767 Result table: moving the focus to the table. You can use Ctrl-r to move
1768 the focus from the search entry to the table, and then use the arrow keys
1769 to change the current row. Ctrl-Shift-s returns to the search.
1770
1771 Result table: open / preview. With the focus in the result table, you can
1772 use Ctrl-o to open the document from the current row, Ctrl-Shift-o to open
1773 the document and close recoll, Ctrl-d to preview the document.
1774
1775 Editing a new search while the focus is not in the search entry. You can
1776 use the Ctrl-Shift-S shortcut to return the cursor to the search entry
1777 (and select the current search text), while the focus is anywhere in the
1778 main window.
1779
1780 Forced opening of a preview window. You can use Shift+Click on a result
1781 list Preview link to force the creation of a preview window instead of a
1782 new tab in the existing one.
1783
1784 Closing previews. Entering Ctrl-W in a tab will close it (and, for the
1785 last tab, close the preview window). Entering Esc will close the preview
1786 window and all its tabs.
1787
1788 Printing previews. Entering Ctrl-P in a preview window will print the
1789 currently displayed text.
1790
1791 Quitting. Entering Ctrl-Q almost anywhere will close the application.
1792
1793 3.1.14. Saving and restoring queries (1.21 and later)
1794
1795 Both simple and advanced query dialogs save recent history, but the amount
1796 is limited: old queries will eventually be forgotten. Also, important
1797 queries may be difficult to find among others. This is why both types of
1798 queries can also be explicitly saved to files, from the GUI menus: File
1799 -> Save last query / Load last query
1800
1801 The default location for saved queries is a subdirectory of the current
1802 configuration directory, but saved queries are ordinary files and can be
1803 written or moved anywhere.
1804
1805 Some of the saved query parameters are part of the preferences (e.g.
1806 autophrase or the active external indexes), and may differ when the query
1807 is loaded from the time it was saved. In this case, Recoll will warn of
1808 the differences, but will not change the user preferences.
1809
1810 3.1.15. Customizing the search interface
1811
1812 You can customize some aspects of the search interface by using the GUI
1813 configuration entry in the Preferences menu.
1814
1815 There are several tabs in the dialog, dealing with the interface itself,
1816 the parameters used for searching and returning results, and what indexes
1817 are searched.
1818
1819 User interface parameters:
1820
1821 o Highlight color for query terms: Terms from the user query are
1822 highlighted in the result list samples and the preview window. The
1823 color can be chosen here. Any Qt color string should work (ie red,
1824 #ff0000). The default is blue.
1825
1826 o Style sheet: The name of a Qt style sheet text file which is applied
1827 to the whole Recoll application on startup. The default value is
1828 empty, but there is a skeleton style sheet (recoll.qss) inside the
1829 /usr/share/recoll/examples directory. Using a style sheet, you can
1830 change most recoll graphical parameters: colors, fonts, etc. See the
1831 sample file for a few simple examples.
1832
1833 You should be aware that parameters (e.g.: the background color) set
1834 inside the Recoll GUI style sheet will override global system
1835 preferences, with possible strange side effects: for example if you
1836 set the foreground to a light color and the background to a dark one
1837 in the desktop preferences, but only the background is set inside the
1838 Recoll style sheet, and it is light too, then text will appear
1839 light-on-light inside the Recoll GUI.
1840
1841 o Maximum text size highlighted for preview Inserting highlights on
1842 search term inside the text before inserting it in the preview window
1843 involves quite a lot of processing, and can be disabled over the given
1844 text size to speed up loading.
1845
1846 o Prefer HTML to plain text for preview if set, Recoll will display HTML
1847 as such inside the preview window. If this causes problems with the Qt
1848 HTML display, you can uncheck it to display the plain text version
1849 instead.
1850
1851 o Plain text to HTML line style: when displaying plain text inside the
1852 preview window, Recoll tries to preserve some of the original text
1853 line breaks and indentation. It can either use PRE HTML tags, which
1854 will well preserve the indentation but will force horizontal scrolling
1855 for long lines, or use BR tags to break at the original line breaks,
1856 which will let the editor introduce other line breaks according to the
1857 window width, but will lose some of the original indentation. The
1858 third option has been available in recent releases and is probably now
1859 the best one: use PRE tags with line wrapping.
1860
1861 o Choose editor applicationsr: this opens a dialog which allows you to
1862 select the application to be used to open each MIME type. The default
1863 is nornally to use the xdg-open utility, but you can override it.
1864
1865 o Exceptions: even wen xdg-open is used by default for opening
1866 documents, you can set exceptions for MIME types that will still be
1867 opened according to Recoll preferences. This is useful for passing
1868 parameters like page numbers or search strings to applications that
1869 support them (e.g. evince). This cannot be done with xdg-open which
1870 only supports passing one parameter.
1871
1872 o Document filter choice style: this will let you choose if the document
1873 categories are displayed as a list or a set of buttons, or a menu.
1874
1875 o Start with simple search mode: this lets you choose the value of the
1876 simple search type on program startup. Either a fixed value (e.g.
1877 Query Language, or the value in use when the program last exited.
1878
1879 o Auto-start simple search on white space entry: if this is checked, a
1880 search will be executed each time you enter a space in the simple
1881 search input field. This lets you look at the result list as you enter
1882 new terms. This is off by default, you may like it or not...
1883
1884 o Start with advanced search dialog open : If you use this dialog
1885 frequently, checking the entries will get it to open when recoll
1886 starts.
1887
1888 o Remember sort activation state if set, Recoll will remember the sort
1889 tool stat between invocations. It normally starts with sorting
1890 disabled.
1891
1892 Result list parameters:
1893
1894 o Number of results in a result page
1895
1896 o Result list font: There is quite a lot of information shown in the
1897 result list, and you may want to customize the font and/or font size.
1898 The rest of the fonts used by Recoll are determined by your generic Qt
1899 config (try the qtconfig command).
1900
1901 o Edit result list paragraph format string: allows you to change the
1902 presentation of each result list entry. See the result list
1903 customisation section.
1904
1905 o Edit result page HTML header insert: allows you to define text
1906 inserted at the end of the result page HTML header. More detail in the
1907 result list customisation section.
1908
1909 o Date format: allows specifying the format used for displaying dates
1910 inside the result list. This should be specified as an strftime()
1911 string (man strftime).
1912
1913 o Abstract snippet separator: for synthetic abstracts built from index
1914 data, which are usually made of several snippets from different parts
1915 of the document, this defines the snippet separator, an ellipsis by
1916 default.
1917
1918 Search parameters:
1919
1920 o Hide duplicate results: decides if result list entries are shown for
1921 identical documents found in different places.
1922
1923 o Stemming language: stemming obviously depends on the document's
1924 language. This listbox will let you chose among the stemming databases
1925 which were built during indexing (this is set in the main
1926 configuration file), or later added with recollindex -s (See the
1927 recollindex manual). Stemming languages which are dynamically added
1928 will be deleted at the next indexing pass unless they are also added
1929 in the configuration file.
1930
1931 o Automatically add phrase to simple searches: a phrase will be
1932 automatically built and added to simple searches when looking for Any
1933 terms. This will give a relevance boost to the results where the
1934 search terms appear as a phrase (consecutive and in order).
1935
1936 o Autophrase term frequency threshold percentage: very frequent terms
1937 should not be included in automatic phrase searches for performance
1938 reasons. The parameter defines the cutoff percentage (percentage of
1939 the documents where the term appears).
1940
1941 o Replace abstracts from documents: this decides if we should synthesize
1942 and display an abstract in place of an explicit abstract found within
1943 the document itself.
1944
1945 o Dynamically build abstracts: this decides if Recoll tries to build
1946 document abstracts (lists of snippets) when displaying the result
1947 list. Abstracts are constructed by taking context from the document
1948 information, around the search terms.
1949
1950 o Synthetic abstract size: adjust to taste...
1951
1952 o Synthetic abstract context words: how many words should be displayed
1953 around each term occurrence.
1954
1955 o Query language magic file name suffixes: a list of words which
1956 automatically get turned into ext:xxx file name suffix clauses when
1957 starting a query language query (ie: doc xls xlsx...). This will save
1958 some typing for people who use file types a lot when querying.
1959
1960 External indexes: This panel will let you browse for additional indexes
1961 that you may want to search. External indexes are designated by their
1962 database directory (ie: /home/someothergui/.recoll/xapiandb,
1963 /usr/local/recollglobal/xapiandb).
1964
1965 Once entered, the indexes will appear in the External indexes list, and
1966 you can chose which ones you want to use at any moment by checking or
1967 unchecking their entries.
1968
1969 Your main database (the one the current configuration indexes to), is
1970 always implicitly active. If this is not desirable, you can set up your
1971 configuration so that it indexes, for example, an empty directory. An
1972 alternative indexer may also need to implement a way of purging the index
1973 from stale data,
1974
1975 3.1.15.1. The result list format
1976
1977 Newer versions of Recoll (from 1.17) normally use WebKit HTML widgets for
1978 the result list and the snippets window (this may be disabled at build
1979 time). Total customisation is possible with full support for CSS and
1980 Javascript. Conversely, there are limits to what you can do with the older
1981 Qt QTextBrowser, but still, it is possible to decide what data each result
1982 will contain, and how it will be displayed.
1983
1984 The result list presentation can be exhaustively customized by adjusting
1985 two elements:
1986
1987 o The paragraph format
1988
1989 o HTML code inside the header section. For versions 1.21 and later, this
1990 is also used for the snippets window
1991
1992 The paragraph format and the header fragment can be edited from the Result
1993 list tab of the GUI configuration.
1994
1995 The header fragment is used both for the result list and the snippets
1996 window. The snippets list is a table and has a snippets class attribute.
1997 Each paragraph in the result list is a table, with class respar, but this
1998 can be changed by editing the paragraph format.
1999
2000 There are a few examples on the page about customising the result list on
2001 the Recoll web site.
2002
2003 The paragraph format
2004
2005 This is an arbitrary HTML string where the following printf-like %
2006 substitutions will be performed:
2007
2008 o %A. Abstract
2009
2010 o %D. Date
2011
2012 o %I. Icon image name. This is normally determined from the MIME type.
2013 The associations are defined inside the mimeconf configuration file.
2014 If a thumbnail for the file is found at the standard Freedesktop
2015 location, this will be displayed instead.
2016
2017 o %K. Keywords (if any)
2018
2019 o %L. Precooked Preview, Edit, and possibly Snippets links
2020
2021 o %M. MIME type
2022
2023 o %N. result Number inside the result page
2024
2025 o %P. Parent folder Url. In the case of an embedded document, this is
2026 the parent folder for the top level container file.
2027
2028 o %R. Relevance percentage
2029
2030 o %S. Size information
2031
2032 o %T. Title or Filename if not set.
2033
2034 o %t. Title or Filename if not set.
2035
2036 o %U. Url
2037
2038 The format of the Preview, Edit, and Snippets links is <a href="P%N">, <a
2039 href="E%N"> and <a href="A%N"> where docnum (%N) expands to the document
2040 number inside the result page).
2041
2042 A link target defined as "F%N" will open the document corresponding to the
2043 %P parent folder expansion, usually creating a file manager window on the
2044 folder where the container file resides. E.g.:
2045
2046 <a href="F%N">%P</a>
2047
2048 A link target defined as R%N|scriptname will run the corresponding script
2049 on the result file (if the document is embedded, the script will be
2050 started on the top-level parent). See the section about defining scripts.
2051
2052 In addition to the predefined values above, all strings like %(fieldname)
2053 will be replaced by the value of the field named fieldname for this
2054 document. Only stored fields can be accessed in this way, the value of
2055 indexed but not stored fields is not known at this point in the search
2056 process (see field configuration). There are currently very few fields
2057 stored by default, apart from the values above (only author and filename),
2058 so this feature will need some custom local configuration to be useful. An
2059 example candidate would be the recipient field which is generated by the
2060 message input handlers.
2061
2062 The default value for the paragraph format string is:
2063
2064 "<table class=\"respar\">\n"
2065 "<tr>\n"
2066 "<td><a href='%U'><img src='%I' width='64'></a></td>\n"
2067 "<td>%L <i>%S</i> <b>%T</b><br>\n"
2068 "<span style='white-space:nowrap'><i>%M</i> %D</span> <i>%U</i> %i<br>\n"
2069 "%A %K</td>\n"
2070 "</tr></table>\n"
2071
2072 You may, for example, try the following for a more web-like experience:
2073
2074 <u><b><a href="P%N">%T</a></b></u><br>
2075 %A<font color=#008000>%U - %S</font> - %L
2076
2077 Note that the P%N link in the above paragraph makes the title a preview
2078 link. Or the clean looking:
2079
2080 <img src="%I" align="left">%L <font color="#900000">%R</font>
2081 <b>%T&</b><br>%S
2082 <font color="#808080"><i>%U</i></font>
2083 <table bgcolor="#e0e0e0">
2084 <tr><td><div>%A</div></td></tr>
2085 </table>%K
2086
2087 These samples, and some others are on the web site, with pictures to show
2088 how they look.
2089
2090 It is also possible to define the value of the snippet separator inside
2091 the abstract section.
2092
20933.2. Searching with the KDE KIO slave
2094
2095 3.2.1. What's this
2096
2097 The Recoll KIO slave allows performing a Recoll search by entering an
2098 appropriate URL in a KDE open dialog, or with an HTML-based interface
2099 displayed in Konqueror.
2100
2101 The HTML-based interface is similar to the Qt-based interface, but
2102 slightly less powerful for now. Its advantage is that you can perform your
2103 search while staying fully within the KDE framework: drag and drop from
2104 the result list works normally and you have your normal choice of
2105 applications for opening files.
2106
2107 The alternative interface uses a directory view of search results. Due to
2108 limitations in the current KIO slave interface, it is currently not
2109 obviously useful (to me).
2110
2111 The interface is described in more detail inside a help file which you can
2112 access by entering recoll:/ inside the konqueror URL line (this works only
2113 if the recoll KIO slave has been previously installed).
2114
2115 The instructions for building this module are located in the source tree.
2116 See: kde/kio/recoll/00README.txt. Some Linux distributions do package the
2117 kio-recoll module, so check before diving into the build process, maybe
2118 it's already out there ready for one-click installation.
2119
2120 3.2.2. Searchable documents
2121
2122 As a sample application, the Recoll KIO slave could allow preparing a set
2123 of HTML documents (for example a manual) so that they become their own
2124 search interface inside konqueror.
2125
2126 This can be done by either explicitly inserting <a href="recoll://...">
2127 links around some document areas, or automatically by adding a very small
2128 javascript program to the documents, like the following example, which
2129 would initiate a search by double-clicking any term:
2130
2131 <script language="JavaScript">
2132 function recollsearch() {
2133 var t = document.getSelection();
2134 window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
2135 encodeURIComponent(t);
2136 }
2137 </script>
2138 ....
2139 <body ondblclick="recollsearch()">
2140
2141
21423.3. Searching on the command line
2143
2144 There are several ways to obtain search results as a text stream, without
2145 a graphical interface:
2146
2147 o By passing option -t to the recoll program.
2148
2149 o By using the recollq program.
2150
2151 o By writing a custom Python program, using the Recoll Python API.
2152
2153 The first two methods work in the same way and accept/need the same
2154 arguments (except for the additional -t to recoll). The query to be
2155 executed is specified as command line arguments.
2156
2157 recollq is not built by default. You can use the Makefile in the query
2158 directory to build it. This is a very simple program, and if you can
2159 program a little c++, you may find it useful to taylor its output format
2160 to your needs. Not that recollq is only really useful on systems where the
2161 Qt libraries (or even the X11 ones) are not available. Otherwise, just use
2162 recoll -t, which takes the exact same parameters and options which are
2163 described for recollq
2164
2165 recollq has a man page (not installed by default, look in the doc/man
2166 directory). The Usage string is as follows:
2167
2168 recollq: usage:
2169 -P: Show the date span for all the documents present in the index
2170 [-o|-a|-f] [-q] <query string>
2171 Runs a recoll query and displays result lines.
2172 Default: will interpret the argument(s) as a xesam query string
2173 query may be like:
2174 implicit AND, Exclusion, field spec: t1 -t2 title:t3
2175 OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
2176 Phrase: "t1 t2" (needs additional quoting on cmd line)
2177 -o Emulate the GUI simple search in ANY TERM mode
2178 -a Emulate the GUI simple search in ALL TERMS mode
2179 -f Emulate the GUI simple search in filename mode
2180 -q is just ignored (compatibility with the recoll GUI command line)
2181 Common options:
2182 -c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
2183 -d also dump file contents
2184 -n [first-]<cnt> define the result slice. The default value for [first]
2185 is 0. Without the option, the default max count is 2000.
2186 Use n=0 for no limit
2187 -b : basic. Just output urls, no mime types or titles
2188 -Q : no result lines, just the processed query and result count
2189 -m : dump the whole document meta[] array for each result
2190 -A : output the document abstracts
2191 -S fld : sort by field <fld>
2192 -s stemlang : set stemming language to use (must exist in index...)
2193 Use -s "" to turn off stem expansion
2194 -D : sort descending
2195 -i <dbdir> : additional index, several can be given
2196 -e use url encoding (%xx) for urls
2197 -F <field name list> : output exactly these fields for each result.
2198 The field values are encoded in base64, output in one line and
2199 separated by one space character. This is the recommended format
2200 for use by other programs. Use a normal query with option -m to
2201 see the field names.
2202
2203 Sample execution:
2204
2205 recollq 'ilur -nautique mime:text/html'
2206 Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11)
2207 OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
2208 4 results
2209 text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
2210 text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
2211 text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
2212 text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
2213
22143.4. Path translations
2215
2216 In some cases, the document paths stored inside the index do not match the
2217 actual ones, so that document previews and accesses will fail. This can
2218 occur in a number of circumstances:
2219
2220 o When using multiple indexes it is a relatively common occurrence that
2221 some will actually reside on a remote volume, for example mounted via
2222 NFS. In this case, the paths used to access the documents on the local
2223 machine are not necessarily the same than the ones used while indexing
2224 on the remote machine. For example, /home/me may have been used as a
2225 topdirs elements while indexing, but the directory might be mounted as
2226 /net/server/home/me on the local machine.
2227
2228 o The case may also occur with removable disks. It is perfectly possible
2229 to configure an index to live with the documents on the removable
2230 disk, but it may happen that the disk is not mounted at the same place
2231 so that the documents paths from the index are invalid.
2232
2233 o As a last example, one could imagine that a big directory has been
2234 moved, but that it is currently inconvenient to run the indexer.
2235
2236 More generally, the path translation facility may be useful whenever the
2237 documents paths seen by the indexer are not the same as the ones which
2238 should be used at query time.
2239
2240 Recoll has a facility for rewriting access paths when extracting the data
2241 from the index. The translations can be defined for the main index and for
2242 any additional query index.
2243
2244 In the above NFS example, Recoll could be instructed to rewrite any
2245 file:///home/me URL from the index to file:///net/server/home/me, allowing
2246 accesses from the client.
2247
2248 The translations are defined in the ptrans configuration file, which can
2249 be edited by hand or from the GUI external indexes configuration dialog.
2250
22513.5. The query language
2252
2253 The query language processor is activated in the GUI simple search entry
2254 when the search mode selector is set to Query Language. It can also be
2255 used with the KIO slave or the command line search. It broadly has the
2256 same capabilities as the complex search interface in the GUI.
2257
2258 The language is based on the (seemingly defunct) Xesam user search
2259 language specification.
2260
2261 If the results of a query language search puzzle you and you doubt what
2262 has been actually searched for, you can use the GUI Show Query link at the
2263 top of the result list to check the exact query which was finally executed
2264 by Xapian.
2265
2266 Here follows a sample request that we are going to explain:
2267
2268 author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
2269
2270
2271 This would search for all documents with John Doe appearing as a phrase in
2272 the author field (exactly what this is would depend on the document type,
2273 ie: the From: header, for an email message), and containing either beatles
2274 or lennon and either live or unplugged but not potatoes (in any part of
2275 the document).
2276
2277 An element is composed of an optional field specification, and a value,
2278 separated by a colon (the field separator is the last colon in the
2279 element). Examples: Eugenie, author:balzac, dc:title:grandet
2280 dc:title:"eugenie grandet"
2281
2282 The colon, if present, means "contains". Xesam defines other relations,
2283 which are mostly unsupported for now (except in special cases, described
2284 further down).
2285
2286 All elements in the search entry are normally combined with an implicit
2287 AND. It is possible to specify that elements be OR'ed instead, as in
2288 Beatles OR Lennon. The OR must be entered literally (capitals), and it has
2289 priority over the AND associations: word1 word2 OR word3 means word1 AND
2290 (word2 OR word3) not (word1 AND word2) OR word3. Explicit parenthesis are
2291 not supported.
2292
2293 As of Recoll 1.21, you can use parentheses to group elements, which will
2294 sometimes make things clearer, and may allow expressing combinations which
2295 would have been difficult otherwise.
2296
2297 An element preceded by a - specifies a term that should not appear.
2298
2299 As usual, words inside quotes define a phrase (the order of words is
2300 significant), so that title:"prejudice pride" is not the same as
2301 title:prejudice title:pride, and is unlikely to find a result.
2302
2303 Words inside phrases and capitalized words are not stem-expanded.
2304 Wildcards may be used anywhere inside a term. Specifying a wild-card on
2305 the left of a term can produce a very slow search (or even an incorrect
2306 one if the expansion is truncated because of excessive size). Also see
2307 More about wildcards.
2308
2309 To save you some typing, recent Recoll versions (1.20 and later) interpret
2310 a comma-separated list of terms as an AND list inside the field. Use slash
2311 characters ('/') for an OR list. No white space is allowed. So
2312
2313 author:john,lennon
2314
2315 will search for documents with john and lennon inside the author field (in
2316 any order), and
2317
2318 author:john/ringo
2319
2320 would search for john or ringo.
2321
2322 Modifiers can be set on a double-quote value, for example to specify a
2323 proximity search (unordered). See the modifier section. No space must
2324 separate the final double-quote and the modifiers value, e.g. "two
2325 one"po10
2326
2327 Recoll currently manages the following default fields:
2328
2329 o title, subject or caption are synonyms which specify data to be
2330 searched for in the document title or subject.
2331
2332 o author or from for searching the documents originators.
2333
2334 o recipient or to for searching the documents recipients.
2335
2336 o keyword for searching the document-specified keywords (few documents
2337 actually have any).
2338
2339 o filename for the document's file name. This is not necessarily set for
2340 all documents: internal documents contained inside a compound one (for
2341 example an EPUB section) do not inherit the container file name any
2342 more, this was replaced by an explicit field (see next). Sub-documents
2343 can still have a specific filename, if it is implied by the document
2344 format, for example the attachment file name for an email attachment.
2345
2346 o containerfilename. This is set for all documents, both top-level and
2347 contained sub-documents, and is always the name of the filesystem
2348 directory entry which contains the data. The terms from this field can
2349 only be matched by an explicit field specification (as opposed to
2350 terms from filename which are also indexed as general document
2351 content). This avoids getting matches for all the sub-documents when
2352 searching for the container file name.
2353
2354 o ext specifies the file name extension (Ex: ext:html)
2355
2356 Recoll 1.20 and later have a way to specify aliases for the field names,
2357 which will save typing, for example by aliasing filename to fn or
2358 containerfilename to cfn. See the section about the fields file
2359
2360 The field syntax also supports a few field-like, but special, criteria:
2361
2362 o dir for filtering the results on file location (Ex:
2363 dir:/home/me/somedir). -dir also works to find results not in the
2364 specified directory (release >= 1.15.8). Tilde expansion will be
2365 performed as usual (except for a bug in versions 1.19 to 1.19.11p1).
2366 Wildcards will be expanded, but please have a look at an important
2367 limitation of wildcards in path filters.
2368
2369 Relative paths also make sense, for example, dir:share/doc would match
2370 either /usr/share/doc or /usr/local/share/doc
2371
2372 Several dir clauses can be specified, both positive and negative. For
2373 example the following makes sense:
2374
2375 dir:recoll dir:src -dir:utils -dir:common
2376
2377
2378 This would select results which have both recoll and src in the path
2379 (in any order), and which have not either utils or common.
2380
2381 You can also use OR conjunctions with dir: clauses.
2382
2383 A special aspect of dir clauses is that the values in the index are
2384 not transcoded to UTF-8, and never lower-cased or unaccented, but
2385 stored as binary. This means that you need to enter the values in the
2386 exact lower or upper case, and that searches for names with diacritics
2387 may sometimes be impossible because of character set conversion
2388 issues. Non-ASCII UNIX file paths are an unending source of trouble
2389 and are best avoided.
2390
2391 You need to use double-quotes around the path value if it contains
2392 space characters.
2393
2394 o size for filtering the results on file size. Example: size<10000. You
2395 can use <, > or = as operators. You can specify a range like the
2396 following: size>100 size<1000. The usual k/K, m/M, g/G, t/T can be
2397 used as (decimal) multipliers. Ex: size>1k to search for files bigger
2398 than 1000 bytes.
2399
2400 o date for searching or filtering on dates. The syntax for the argument
2401 is based on the ISO8601 standard for dates and time intervals. Only
2402 dates are supported, no times. The general syntax is 2 elements
2403 separated by a / character. Each element can be a date or a period of
2404 time. Periods are specified as PnYnMnD. The n numbers are the
2405 respective numbers of years, months or days, any of which may be
2406 missing. Dates are specified as YYYY-MM-DD. The days and months parts
2407 may be missing. If the / is present but an element is missing, the
2408 missing element is interpreted as the lowest or highest date in the
2409 index. Examples:
2410
2411 o 2001-03-01/2002-05-01 the basic syntax for an interval of dates.
2412
2413 o 2001-03-01/P1Y2M the same specified with a period.
2414
2415 o 2001/ from the beginning of 2001 to the latest date in the index.
2416
2417 o 2001 the whole year of 2001
2418
2419 o P2D/ means 2 days ago up to now if there are no documents with
2420 dates in the future.
2421
2422 o /2003 all documents from 2003 or older.
2423
2424 Periods can also be specified with small letters (ie: p2y).
2425
2426 o mime or format for specifying the MIME type. This one is quite special
2427 because you can specify several values which will be OR'ed (the normal
2428 default for the language is AND). Ex: mime:text/plain mime:text/html.
2429 Specifying an explicit boolean operator before a mime specification is
2430 not supported and will produce strange results. You can filter out
2431 certain types by using negation (-mime:some/type), and you can use
2432 wildcards in the value (mime:text/*). Note that mime is the ONLY field
2433 with an OR default. You do need to use OR with ext terms for example.
2434
2435 o type or rclcat for specifying the category (as in
2436 text/media/presentation/etc.). The classification of MIME types in
2437 categories is defined in the Recoll configuration (mimeconf), and can
2438 be modified or extended. The default category names are those which
2439 permit filtering results in the main GUI screen. Categories are OR'ed
2440 like MIME types above. This can't be negated with - either.
2441
2442 The document input handlers used while indexing have the possibility to
2443 create other fields with arbitrary names, and aliases may be defined in
2444 the configuration, so that the exact field search possibilities may be
2445 different for you if someone took care of the customisation.
2446
2447 3.5.1. Modifiers
2448
2449 Some characters are recognized as search modifiers when found immediately
2450 after the closing double quote of a phrase, as in "some
2451 term"modifierchars. The actual "phrase" can be a single term of course.
2452 Supported modifiers:
2453
2454 o l can be used to turn off stemming (mostly makes sense with p because
2455 stemming is off by default for phrases).
2456
2457 o o can be used to specify a "slack" for phrase and proximity searches:
2458 the number of additional terms that may be found between the specified
2459 ones. If o is followed by an integer number, this is the slack, else
2460 the default is 10.
2461
2462 o p can be used to turn the default phrase search into a proximity one
2463 (unordered). Example:"order any in"p
2464
2465 o C will turn on case sensitivity (if the index supports it).
2466
2467 o D will turn on diacritics sensitivity (if the index supports it).
2468
2469 o A weight can be specified for a query element by specifying a decimal
2470 value at the start of the modifiers. Example: "Important"2.5.
2471
24723.6. Search case and diacritics sensitivity
2473
2474 For Recoll versions 1.18 and later, and when working with a raw index (not
2475 the default), searches can be made sensitive to character case and
2476 diacritics. How this happens is controlled by configuration variables and
2477 what search data is entered.
2478
2479 The general default is that searches are insensitive to case and
2480 diacritics. An entry of resume will match any of Resume, RESUME, resume,
2481 Resume etc.
2482
2483 Two configuration variables can automate switching on sensitivity:
2484
2485 autodiacsens
2486
2487 If this is set, search sensitivity to diacritics will be turned on
2488 as soon as an accented character exists in a search term. When the
2489 variable is set to true, resume will start a
2490 diacritics-unsensitive search, but resume will be matched exactly.
2491 The default value is false.
2492
2493 autocasesens
2494
2495 If this is set, search sensitivity to character case will be
2496 turned on as soon as an upper-case character exists in a search
2497 term except for the first one. When the variable is set to true,
2498 us or Us will start a diacritics-unsensitive search, but US will
2499 be matched exactly. The default value is true (contrary to
2500 autodiacsens).
2501
2502 As in the past, capitalizing the first letter of a word will turn off its
2503 stem expansion and have no effect on case-sensitivity.
2504
2505 You can also explicitly activate case and diacritics sensitivity by using
2506 modifiers with the query language. C will make the term case-sensitive,
2507 and D will make it diacritics-sensitive. Examples:
2508
2509 "us"C
2510
2511
2512 will search for the term us exactly (Us will not be a match).
2513
2514 "resume"D
2515
2516
2517 will search for the term resume exactly (resume will not be a match).
2518
2519 When either case or diacritics sensitivity is activated, stem expansion is
2520 turned off. Having both does not make much sense.
2521
25223.7. Anchored searches and wildcards
2523
2524 Some special characters are interpreted by Recoll in search strings to
2525 expand or specialize the search. Wildcards expand a root term in
2526 controlled ways. Anchor characters can restrict a search to succeed only
2527 if the match is found at or near the beginning of the document or one of
2528 its fields.
2529
2530 3.7.1. More about wildcards
2531
2532 All words entered in Recoll search fields will be processed for wildcard
2533 expansion before the request is finally executed.
2534
2535 The wildcard characters are:
2536
2537 o * which matches 0 or more characters.
2538
2539 o ? which matches a single character.
2540
2541 o [] which allow defining sets of characters to be matched (ex: [abc]
2542 matches a single character which may be 'a' or 'b' or 'c', [0-9]
2543 matches any number.
2544
2545 You should be aware of a few things when using wildcards.
2546
2547 o Using a wildcard character at the beginning of a word can make for a
2548 slow search because Recoll will have to scan the whole index term list
2549 to find the matches. However, this is much less a problem for field
2550 searches, and queries like author:*@domain.com can sometimes be very
2551 useful.
2552
2553 o For Recoll version 18 only, when working with a raw index (preserving
2554 character case and diacritics), the literal part of a wildcard
2555 expression will be matched exactly for case and diacritics. This is
2556 not true any more for versions 19 and later.
2557
2558 o Using a * at the end of a word can produce more matches than you would
2559 think, and strange search results. You can use the term explorer tool
2560 to check what completions exist for a given term. You can also see
2561 exactly what search was performed by clicking on the link at the top
2562 of the result list. In general, for natural language terms, stem
2563 expansion will produce better results than an ending * (stem expansion
2564 is turned off when any wildcard character appears in the term).
2565
2566 3.7.1.1. Wildcards and path filtering
2567
2568 Due to the way that Recoll processes wildcards inside dir path filtering
2569 clauses, they will have a multiplicative effect on the query size. A
2570 clause containing wildcards in several paths elements, like, for example,
2571 dir:/home/me/*/*/docdir, will almost certainly fail if your indexed tree
2572 is of any realistic size.
2573
2574 Depending on the case, you may be able to work around the issue by
2575 specifying the paths elements more narrowly, with a constant prefix, or by
2576 using 2 separate dir: clauses instead of multiple wildcards, as in
2577 dir:/home/me dir:docdir. The latter query is not equivalent to the initial
2578 one because it does not specify a number of directory levels, but that's
2579 the best we can do (and it may be actually more useful in some cases).
2580
2581 3.7.2. Anchored searches
2582
2583 Two characters are used to specify that a search hit should occur at the
2584 beginning or at the end of the text. ^ at the beginning of a term or
2585 phrase constrains the search to happen at the start, $ at the end force it
2586 to happen at the end.
2587
2588 As this function is implemented as a phrase search it is possible to
2589 specify a maximum distance at which the hit should occur, either through
2590 the controls of the advanced search panel, or using the query language,
2591 for example, as in:
2592
2593 "^someterm"o10
2594
2595 which would force someterm to be found within 10 terms of the start of the
2596 text. This can be combined with a field search as in
2597 somefield:"^someterm"o10 or somefield:someterm$.
2598
2599 This feature can also be used with an actual phrase search, but in this
2600 case, the distance applies to the whole phrase and anchor, so that, for
2601 example, bla bla my unexpected term at the beginning of the text would be
2602 a match for "^my term"o5.
2603
2604 Anchored searches can be very useful for searches inside somewhat
2605 structured documents like scientific articles, in case explicit metadata
2606 has not been supplied (a most frequent case), for example for looking for
2607 matches inside the abstract or the list of authors (which occur at the top
2608 of the document).
2609
26103.8. Desktop integration
2611
2612 Being independent of the desktop type has its drawbacks: Recoll desktop
2613 integration is minimal. However there are a few tools available:
2614
2615 o The KDE KIO Slave was described in a previous section.
2616
2617 o If you use a recent version of Ubuntu Linux, you may find the Ubuntu
2618 Unity Lens module useful.
2619
2620 o There is also an independently developed Krunner plugin.
2621
2622 Here follow a few other things that may help.
2623
2624 3.8.1. Hotkeying recoll
2625
2626 It is surprisingly convenient to be able to show or hide the Recoll GUI
2627 with a single keystroke. Recoll comes with a small Python script, based on
2628 the libwnck window manager interface library, which will allow you to do
2629 just this. The detailed instructions are on this wiki page.
2630
2631 3.8.2. The KDE Kicker Recoll applet
2632
2633 This is probably obsolete now. Anyway:
2634
2635 The Recoll source tree contains the source code to the recoll_applet, a
2636 small application derived from the find_applet. This can be used to add a
2637 small Recoll launcher to the KDE panel.
2638
2639 The applet is not automatically built with the main Recoll programs, nor
2640 is it included with the main source distribution (because the KDE build
2641 boilerplate makes it relatively big). You can download its source from the
2642 recoll.org download page. Use the omnipotent configure;make;make install
2643 incantation to build and install.
2644
2645 You can then add the applet to the panel by right-clicking the panel and
2646 choosing the Add applet entry.
2647
2648 The recoll_applet has a small text window where you can type a Recoll
2649 query (in query language form), and an icon which can be used to restrict
2650 the search to certain types of files. It is quite primitive, and launches
2651 a new recoll GUI instance every time (even if it is already running). You
2652 may find it useful anyway.
2653
2654Chapter 4. Programming interface
2655
2656 Recoll has an Application Programming Interface, usable both for indexing
2657 and searching, currently accessible from the Python language.
2658
2659 Another less radical way to extend the application is to write input
2660 handlers for new types of documents.
2661
2662 The processing of metadata attributes for documents (fields) is highly
2663 configurable.
2664
26654.1. Writing a document input handler
2666
2667 Terminology
2668
2669 The small programs or pieces of code which handle the processing of the
2670 different document types for Recoll used to be called filters, which is
2671 still reflected in the name of the directory which holds them and many
2672 configuration variables. They were named this way because one of their
2673 primary functions is to filter out the formatting directives and keep the
2674 text content. However these modules may have other behaviours, and the
2675 term input handler is now progressively substituted in the documentation.
2676 filter is still used in many places though.
2677
2678 Recoll input handlers cooperate to translate from the multitude of input
2679 document formats, simple ones as opendocument, acrobat), or compound ones
2680 such as Zip or Email, into the final Recoll indexing input format, which
2681 is plain text. Most input handlers are executable programs or scripts. A
2682 few handlers are coded in C++ and live inside recollindex. This latter
2683 kind will not be described here.
2684
2685 There are currently (1.18 and since 1.13) two kinds of external executable
2686 input handlers:
2687
2688 o Simple exec handlers run once and exit. They can be bare programs like
2689 antiword, or scripts using other programs. They are very simple to
2690 write, because they just need to print the converted document to the
2691 standard output. Their output can be plain text or HTML. HTML is
2692 usually preferred because it can store metadata fields and it allows
2693 preserving some of the formatting for the GUI preview.
2694
2695 o Multiple execm handlers can process multiple files (sparing the
2696 process startup time which can be very significant), or multiple
2697 documents per file (e.g.: for zip or chm files). They communicate with
2698 the indexer through a simple protocol, but are nevertheless a bit more
2699 complicated than the older kind. Most of new handlers are written in
2700 Python, using a common module to handle the protocol. There is an
2701 exception, rclimg which is written in Perl. The subdocuments output by
2702 these handlers can be directly indexable (text or HTML), or they can
2703 be other simple or compound documents that will need to be processed
2704 by another handler.
2705
2706 In both cases, handlers deal with regular file system files, and can
2707 process either a single document, or a linear list of documents in each
2708 file. Recoll is responsible for performing up to date checks, deal with
2709 more complex embedding and other upper level issues.
2710
2711 A simple handler returning a document in text/plain format, can transfer
2712 no metadata to the indexer. Generic metadata, like document size or
2713 modification date, will be gathered and stored by the indexer.
2714
2715 Handlers that produce text/html format can return an arbitrary amount of
2716 metadata inside HTML meta tags. These will be processed according to the
2717 directives found in the fields configuration file.
2718
2719 The handlers that can handle multiple documents per file return a single
2720 piece of data to identify each document inside the file. This piece of
2721 data, called an ipath element will be sent back by Recoll to extract the
2722 document at query time, for previewing, or for creating a temporary file
2723 to be opened by a viewer.
2724
2725 The following section describes the simple handlers, and the next one
2726 gives a few explanations about the execm ones. You could conceivably write
2727 a simple handler with only the elements in the manual. This will not be
2728 the case for the other ones, for which you will have to look at the code.
2729
2730 4.1.1. Simple input handlers
2731
2732 Recoll simple handlers are usually shell-scripts, but this is in no way
2733 necessary. Extracting the text from the native format is the difficult
2734 part. Outputting the format expected by Recoll is trivial. Happily enough,
2735 most document formats have translators or text extractors which can be
2736 called from the handler. In some cases the output of the translating
2737 program is completely appropriate, and no intermediate shell-script is
2738 needed.
2739
2740 Input handlers are called with a single argument which is the source file
2741 name. They should output the result to stdout.
2742
2743 When writing a handler, you should decide if it will output plain text or
2744 HTML. Plain text is simpler, but you will not be able to add metadata or
2745 vary the output character encoding (this will be defined in a
2746 configuration file). Additionally, some formatting may be easier to
2747 preserve when previewing HTML. Actually the deciding factor is metadata:
2748 Recoll has a way to extract metadata from the HTML header and use it for
2749 field searches..
2750
2751 The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
2752 the handler if the operation is for indexing or previewing. Some handlers
2753 use this to output a slightly different format, for example stripping
2754 uninteresting repeated keywords (ie: Subject: for email) when indexing.
2755 This is not essential.
2756
2757 You should look at one of the simple handlers, for example rclps for a
2758 starting point.
2759
2760 Don't forget to make your handler executable before testing !
2761
2762 4.1.2. "Multiple" handlers
2763
2764 If you can program and want to write an execm handler, it should not be
2765 too difficult to make sense of one of the existing modules. For example,
2766 look at rclzip which uses Zip file paths as identifiers (ipath), and
2767 rclics, which uses an integer index. Also have a look at the comments
2768 inside the internfile/mh_execm.h file and possibly at the corresponding
2769 module.
2770
2771 execm handlers sometimes need to make a choice for the nature of the ipath
2772 elements that they use in communication with the indexer. Here are a few
2773 guidelines:
2774
2775 o Use ASCII or UTF-8 (if the identifier is an integer print it, for
2776 example, like printf %d would do).
2777
2778 o If at all possible, the data should make some kind of sense when
2779 printed to a log file to help with debugging.
2780
2781 o Recoll uses a colon (:) as a separator to store a complex path
2782 internally (for deeper embedding). Colons inside the ipath elements
2783 output by a handler will be escaped, but would be a bad choice as a
2784 handler-specific separator (mostly, again, for debugging issues).
2785
2786 In any case, the main goal is that it should be easy for the handler to
2787 extract the target document, given the file name and the ipath element.
2788
2789 execm handlers will also produce a document with a null ipath element.
2790 Depending on the type of document, this may have some associated data
2791 (e.g. the body of an email message), or none (typical for an archive
2792 file). If it is empty, this document will be useful anyway for some
2793 operations, as the parent of the actual data documents.
2794
2795 4.1.3. Telling Recoll about the handler
2796
2797 There are two elements that link a file to the handler which should
2798 process it: the association of file to MIME type and the association of a
2799 MIME type with a handler.
2800
2801 The association of files to MIME types is mostly based on name suffixes.
2802 The types are defined inside the mimemap file. Example:
2803
2804
2805 .doc = application/msword
2806
2807 If no suffix association is found for the file name, Recoll will try to
2808 execute the file -i command to determine a MIME type.
2809
2810 The association of file types to handlers is performed in the mimeconf
2811 file. A sample will probably be of better help than a long explanation:
2812
2813
2814 [index]
2815 application/msword = exec antiword -t -i 1 -m UTF-8;\
2816 mimetype = text/plain ; charset=utf-8
2817
2818 application/ogg = exec rclogg
2819
2820 text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
2821
2822 application/x-chm = execm rclchm
2823
2824 The fragment specifies that:
2825
2826 o application/msword files are processed by executing the antiword
2827 program, which outputs text/plain encoded in utf-8.
2828
2829 o application/ogg files are processed by the rclogg script, with default
2830 output type (text/html, with encoding specified in the header, or
2831 utf-8 by default).
2832
2833 o text/rtf is processed by unrtf, which outputs text/html. The
2834 iso-8859-1 encoding is specified because it is not the utf-8 default,
2835 and not output by unrtf in the HTML header section.
2836
2837 o application/x-chm is processed by a persistent handler. This is
2838 determined by the execm keyword.
2839
2840 4.1.4. Input handler HTML output
2841
2842 The output HTML could be very minimal like the following example:
2843
2844 <html>
2845 <head>
2846 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2847 </head>
2848 <body>
2849 Some text content
2850 </body>
2851 </html>
2852
2853
2854 You should take care to escape some characters inside the text by
2855 transforming them into appropriate entities. At the very minimum, "&"
2856 should be transformed into "&", "<" should be transformed into "<".
2857 This is not always properly done by translating programs which output
2858 HTML, and of course never by those which output plain text.
2859
2860 When encapsulating plain text in an HTML body, the display of a preview
2861 may be improved by enclosing the text inside <pre> tags.
2862
2863 The character set needs to be specified in the header. It does not need to
2864 be UTF-8 (Recoll will take care of translating it), but it must be
2865 accurate for good results.
2866
2867 Recoll will process meta tags inside the header as possible document
2868 fields candidates. Documents fields can be processed by the indexer in
2869 different ways, for searching or displaying inside query results. This is
2870 described in a following section.
2871
2872 By default, the indexer will process the standard header fields if they
2873 are present: title, meta/description, and meta/keywords are both indexed
2874 and stored for query-time display.
2875
2876 A predefined non-standard meta tag will also be processed by Recoll
2877 without further configuration: if a date tag is present and has the right
2878 format, it will be used as the document date (for display and sorting), in
2879 preference to the file modification date. The date format should be as
2880 follows:
2881
2882 <meta name="date" content="YYYY-mm-dd HH:MM:SS">
2883 or
2884 <meta name="date" content="YYYY-mm-ddTHH:MM:SS">
2885
2886
2887 Example:
2888
2889 <meta name="date" content="2013-02-24 17:50:00">
2890
2891
2892 Input handlers also have the possibility to "invent" field names. This
2893 should also be output as meta tags:
2894
2895 <meta name="somefield" content="Some textual data" />
2896
2897 You can embed HTML markup inside the content of custom fields, for
2898 improving the display inside result lists. In this case, add a (wildly
2899 non-standard) markup attribute to tell Recoll that the value is HTML and
2900 should not be escaped for display.
2901
2902 <meta name="somefield" markup="html" content="Some <i>textual</i> data" />
2903
2904 As written above, the processing of fields is described in a further
2905 section.
2906
2907 4.1.5. Page numbers
2908
2909 The indexer will interpret ^L characters in the handler output as
2910 indicating page breaks, and will record them. At query time, this allows
2911 starting a viewer on the right page for a hit or a snippet. Currently,
2912 only the PDF, Postscript and DVI handlers generate page breaks.
2913
29144.2. Field data processing
2915
2916 Fields are named pieces of information in or about documents, like title,
2917 author, abstract.
2918
2919 The field values for documents can appear in several ways during indexing:
2920 either output by input handlers as meta fields in the HTML header section,
2921 or extracted from file extended attributes, or added as attributes of the
2922 Doc object when using the API, or again synthetized internally by Recoll.
2923
2924 The Recoll query language allows searching for text in a specific field.
2925
2926 Recoll defines a number of default fields. Additional ones can be output
2927 by handlers, and described in the fields configuration file.
2928
2929 Fields can be:
2930
2931 o indexed, meaning that their terms are separately stored in inverted
2932 lists (with a specific prefix), and that a field-specific search is
2933 possible.
2934
2935 o stored, meaning that their value is recorded in the index data record
2936 for the document, and can be returned and displayed with search
2937 results.
2938
2939 A field can be either or both indexed and stored. This and other aspects
2940 of fields handling is defined inside the fields configuration file.
2941
2942 The sequence of events for field processing is as follows:
2943
2944 o During indexing, recollindex scans all meta fields in HTML documents
2945 (most document types are transformed into HTML at some point). It
2946 compares the name for each element to the configuration defining what
2947 should be done with fields (the fields file)
2948
2949 o If the name for the meta element matches one for a field that should
2950 be indexed, the contents are processed and the terms are entered into
2951 the index with the prefix defined in the fields file.
2952
2953 o If the name for the meta element matches one for a field that should
2954 be stored, the content of the element is stored with the document data
2955 record, from which it can be extracted and displayed at query time.
2956
2957 o At query time, if a field search is performed, the index prefix is
2958 computed and the match is only performed against appropriately
2959 prefixed terms in the index.
2960
2961 o At query time, the field can be displayed inside the result list by
2962 using the appropriate directive in the definition of the result list
2963 paragraph format. All fields are displayed on the fields screen of the
2964 preview window (which you can reach through the right-click menu).
2965 This is independent of the fact that the search which produced the
2966 results used the field or not.
2967
2968 You can find more information in the section about the fields file, or in
2969 comments inside the file.
2970
2971 You can also have a look at the example on the Wiki, detailing how one
2972 could add a page count field to pdf documents for displaying inside result
2973 lists.
2974
29754.3. API
2976
2977 4.3.1. Interface elements
2978
2979 A few elements in the interface are specific and and need an explanation.
2980
2981 udi
2982
2983 An udi (unique document identifier) identifies a document. Because
2984 of limitations inside the index engine, it is restricted in length
2985 (to 200 bytes), which is why a regular URI cannot be used. The
2986 structure and contents of the udi is defined by the application
2987 and opaque to the index engine. For example, the internal file
2988 system indexer uses the complete document path (file path +
2989 internal path), truncated to length, the suppressed part being
2990 replaced by a hash value.
2991
2992 ipath
2993
2994 This data value (set as a field in the Doc object) is stored,
2995 along with the URL, but not indexed by Recoll. Its contents are
2996 not interpreted, and its use is up to the application. For
2997 example, the Recoll internal file system indexer stores the part
2998 of the document access path internal to the container file (ipath
2999 in this case is a list of subdocument sequential numbers). url and
3000 ipath are returned in every search result and permit access to the
3001 original document.
3002
3003 Stored and indexed fields
3004
3005 The fields file inside the Recoll configuration defines which
3006 document fields are either "indexed" (searchable), "stored"
3007 (retrievable with search results), or both.
3008
3009 Data for an external indexer, should be stored in a separate index, not
3010 the one for the Recoll internal file system indexer, except if the latter
3011 is not used at all). The reason is that the main document indexer purge
3012 pass would remove all the other indexer's documents, as they were not seen
3013 during indexing. The main indexer documents would also probably be a
3014 problem for the external indexer purge operation.
3015
3016 4.3.2. Python interface
3017
3018 4.3.2.1. Introduction
3019
3020 Recoll versions after 1.11 define a Python programming interface, both for
3021 searching and indexing. The indexing portion has seen little use, but the
3022 searching one is used in the Recoll Ubuntu Unity Lens and Recoll Web UI.
3023
3024 The API is inspired by the Python database API specification. There were
3025 two major changes in recent Recoll versions:
3026
3027 o The basis for the Recoll API changed from Python database API version
3028 1.0 (Recoll versions up to 1.18.1), to version 2.0 (Recoll 1.18.2 and
3029 later).
3030 o The recoll module became a package (with an internal recoll module) as
3031 of Recoll version 1.19, in order to add more functions. For existing
3032 code, this only changes the way the interface must be imported.
3033
3034 We will mostly describe the new API and package structure here. A
3035 paragraph at the end of this section will explain a few differences and
3036 ways to write code compatible with both versions.
3037
3038 The Python interface can be found in the source package, under
3039 python/recoll.
3040
3041 The python/recoll/ directory contains the usual setup.py. After
3042 configuring the main Recoll code, you can use the script to build and
3043 install the Python module:
3044
3045 cd recoll-xxx/python/recoll
3046 python setup.py build
3047 python setup.py install
3048
3049
3050 The normal Recoll installer installs the Python API along with the main
3051 code.
3052
3053 When installing from a repository, and depending on the distribution, the
3054 Python API can sometimes be found in a separate package.
3055
3056 4.3.2.2. Recoll package
3057
3058 The recoll package contains two modules:
3059
3060 o The recoll module contains functions and classes used to query (or
3061 update) the index.
3062
3063 o The rclextract module contains functions and classes used to access
3064 document data.
3065
3066 4.3.2.3. The recoll module
3067
3068 Functions
3069
3070 connect(confdir=None, extra_dbs=None, writable = False)
3071 The connect() function connects to one or several Recoll index(es)
3072 and returns a Db object.
3073 o confdir may specify a configuration directory. The usual
3074 defaults apply.
3075 o extra_dbs is a list of additional indexes (Xapian
3076 directories).
3077 o writable decides if we can index new data through this
3078 connection.
3079 This call initializes the recoll module, and it should always be
3080 performed before any other call or object creation.
3081
3082 Classes
3083
3084 The Db class
3085
3086 A Db object is created by a connect() call and holds a connection to a
3087 Recoll index.
3088
3089 Methods
3090
3091 Db.close()
3092 Closes the connection. You can't do anything with the Db object
3093 after this.
3094
3095 Db.query(), Db.cursor()
3096 These aliases return a blank Query object for this index.
3097
3098 Db.setAbstractParams(maxchars, contextwords)
3099 Set the parameters used to build snippets (sets of keywords in
3100 context text fragments). maxchars defines the maximum total size
3101 of the abstract. contextwords defines how many terms are shown
3102 around the keyword.
3103
3104 Db.termMatch(match_type, expr, field='', maxlen=-1, casesens=False,
3105 diacsens=False, lang='english')
3106 Expand an expression against the index term list. Performs the
3107 basic function from the GUI term explorer tool. match_type can be
3108 either of wildcard, regexp or stem. Returns a list of terms
3109 expanded from the input expression.
3110
3111 The Query class
3112
3113 A Query object (equivalent to a cursor in the Python DB API) is created by
3114 a Db.query() call. It is used to execute index searches.
3115
3116 Methods
3117
3118 Query.sortby(fieldname, ascending=True)
3119 Sort results by fieldname, in ascending or descending order. Must
3120 be called before executing the search.
3121
3122 Query.execute(query_string, stemming=1, stemlang="english")
3123 Starts a search for query_string, a Recoll search language string.
3124
3125 Query.executesd(SearchData)
3126 Starts a search for the query defined by the SearchData object.
3127
3128 Query.fetchmany(size=query.arraysize)
3129 Fetches the next Doc objects in the current search results, and
3130 returns them as an array of the required size, which is by default
3131 the value of the arraysize data member.
3132
3133 Query.fetchone()
3134 Fetches the next Doc object from the current search results.
3135
3136 Query.close()
3137 Closes the query. The object is unusable after the call.
3138
3139 Query.scroll(value, mode='relative')
3140 Adjusts the position in the current result set. mode can be
3141 relative or absolute.
3142
3143 Query.getgroups()
3144 Retrieves the expanded query terms as a list of pairs. Meaningful
3145 only after executexx In each pair, the first entry is a list of
3146 user terms (of size one for simple terms, or more for group and
3147 phrase clauses), the second a list of query terms as derived from
3148 the user terms and used in the Xapian Query.
3149
3150 Query.getxquery()
3151 Return the Xapian query description as a Unicode string.
3152 Meaningful only after executexx.
3153
3154 Query.highlight(text, ishtml = 0, methods = object)
3155 Will insert <span "class=rclmatch">, </span> tags around the match
3156 areas in the input text and return the modified text. ishtml can
3157 be set to indicate that the input text is HTML and that HTML
3158 special characters should not be escaped. methods if set should be
3159 an object with methods startMatch(i) and endMatch() which will be
3160 called for each match and should return a begin and end tag
3161
3162 Query.makedocabstract(doc, methods = object))
3163 Create a snippets abstract for doc (a Doc object) by selecting
3164 text around the match terms. If methods is set, will also perform
3165 highlighting. See the highlight method.
3166
3167 Query.__iter__() and Query.next()
3168 So that things like for doc in query: will work.
3169
3170 Data descriptors
3171
3172 Query.arraysize
3173 Default number of records processed by fetchmany (r/w).
3174
3175 Query.rowcount
3176 Number of records returned by the last execute.
3177
3178 Query.rownumber
3179 Next index to be fetched from results. Normally increments after
3180 each fetchone() call, but can be set/reset before the call to
3181 effect seeking (equivalent to using scroll()). Starts at 0.
3182
3183 The Doc class
3184
3185 A Doc object contains index data for a given document. The data is
3186 extracted from the index when searching, or set by the indexer program
3187 when updating. The Doc object has many attributes to be read or set by its
3188 user. It matches exactly the Rcl::Doc C++ object. Some of the attributes
3189 are predefined, but, especially when indexing, others can be set, the name
3190 of which will be processed as field names by the indexing configuration.
3191 Inputs can be specified as Unicode or strings. Outputs are Unicode
3192 objects. All dates are specified as Unix timestamps, printed as strings.
3193 Please refer to the rcldb/rcldoc.h C++ file for a description of the
3194 predefined attributes.
3195
3196 At query time, only the fields that are defined as stored either by
3197 default or in the fields configuration file will be meaningful in the Doc
3198 object. Especially this will not be the case for the document text. See
3199 the rclextract module for accessing document contents.
3200
3201 Methods
3202
3203 get(key), [] operator
3204 Retrieve the named doc attribute
3205
3206 getbinurl()
3207 Retrieve the URL in byte array format (no transcoding), for use as
3208 parameter to a system call.
3209
3210 items()
3211 Return a dictionary of doc object keys/values
3212
3213 keys()
3214 list of doc object keys (attribute names).
3215
3216 The SearchData class
3217
3218 A SearchData object allows building a query by combining clauses, for
3219 execution by Query.executesd(). It can be used in replacement of the query
3220 language approach. The interface is going to change a little, so no
3221 detailed doc for now...
3222
3223 Methods
3224
3225 addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', qstring=string,
3226 slack=0, field='', stemming=1, subSearch=SearchData)
3227
3228 4.3.2.4. The rclextract module
3229
3230 Index queries do not provide document content (only a partial and
3231 unprecise reconstruction is performed to show the snippets text). In order
3232 to access the actual document data, the data extraction part of the
3233 indexing process must be performed (subdocument access and format
3234 translation). This is not trivial in general. The rclextract module
3235 currently provides a single class which can be used to access the data
3236 content for result documents.
3237
3238 Classes
3239
3240 The Extractor class
3241
3242 Methods
3243
3244 Extractor(doc)
3245 An Extractor object is built from a Doc object, output from a
3246 query.
3247
3248 Extractor.textextract(ipath)
3249 Extract document defined by ipath and return a Doc object. The
3250 doc.text field has the document text converted to either
3251 text/plain or text/html according to doc.mimetype. The typical use
3252 would be as follows:
3253
3254 qdoc = query.fetchone()
3255 extractor = recoll.Extractor(qdoc)
3256 doc = extractor.textextract(qdoc.ipath)
3257 # use doc.text, e.g. for previewing
3258
3259 Extractor.idoctofile(ipath, targetmtype, outfile='')
3260 Extracts document into an output file, which can be given
3261 explicitly or will be created as a temporary file to be deleted by
3262 the caller. Typical use:
3263
3264 qdoc = query.fetchone()
3265 extractor = recoll.Extractor(qdoc)
3266 filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
3267
3268 4.3.2.5. Example code
3269
3270 The following sample would query the index with a user language string.
3271 See the python/samples directory inside the Recoll source for other
3272 examples. The recollgui subdirectory has a very embryonic GUI which
3273 demonstrates the highlighting and data extraction functions.
3274
3275 #!/usr/bin/env python
3276
3277 from recoll import recoll
3278
3279 db = recoll.connect()
3280 db.setAbstractParams(maxchars=80, contextwords=4)
3281
3282 query = db.query()
3283 nres = query.execute("some user question")
3284 print "Result count: ", nres
3285 if nres > 5:
3286 nres = 5
3287 for i in range(nres):
3288 doc = query.fetchone()
3289 print "Result #%d" % (query.rownumber,)
3290 for k in ("title", "size"):
3291 print k, ":", getattr(doc, k).encode('utf-8')
3292 abs = db.makeDocAbstract(doc, query).encode('utf-8')
3293 print abs
3294 print
3295
3296
3297
3298 4.3.2.6. Compatibility with the previous version
3299
3300 The following code fragments can be used to ensure that code can run with
3301 both the old and the new API (as long as it does not use the new abilities
3302 of the new API of course).
3303
3304 Adapting to the new package structure:
3305
3306
3307 try:
3308 from recoll import recoll
3309 from recoll import rclextract
3310 hasextract = True
3311 except:
3312 import recoll
3313 hasextract = False
3314
3315
3316 Adapting to the change of nature of the next Query member. The same test
3317 can be used to choose to use the scroll() method (new) or set the next
3318 value (old).
3319
3320
3321 rownum = query.next if type(query.next) == int else \
3322 query.rownumber
3323
3324
3325Chapter 5. Installation and configuration
3326
33275.1. Installing a binary copy
3328
3329 Recoll binary copies are always distributed as regular packages for your
3330 system. They can be obtained either through the system's normal software
3331 distribution framework (e.g. Debian/Ubuntu apt, FreeBSD ports, etc.), or
3332 from some type of "backports" repository providing versions newer than the
3333 standard ones, or found on the Recoll WEB site in some cases.
3334
3335 There used to exist another form of binary install, as pre-compiled source
3336 trees, but these are just less convenient than the packages and don't
3337 exist any more.
3338
3339 The package management tools will usually automatically deal with hard
3340 dependencies for packages obtained from a proper package repository. You
3341 will have to deal with them by hand for downloaded packages (for example,
3342 when dpkg complains about missing dependencies).
3343
3344 In all cases, you will have to check or install supporting applications
3345 for the file types that you want to index beyond those that are natively
3346 processed by Recoll (text, HTML, email files, and a few others).
3347
3348 You should also maybe have a look at the configuration section (but this
3349 may not be necessary for a quick test with default parameters). Most
3350 parameters can be more conveniently set from the GUI interface.
3351
33525.2. Supporting packages
3353
3354 Recoll uses external applications to index some file types. You need to
3355 install them for the file types that you wish to have indexed (these are
3356 run-time optional dependencies. None is needed for building or running
3357 Recoll except for indexing their specific file type).
3358
3359 After an indexing pass, the commands that were found missing can be
3360 displayed from the recoll File menu. The list is stored in the missing
3361 text file inside the configuration directory.
3362
3363 A list of common file types which need external commands follows. Many of
3364 the handlers need the iconv command, which is not always listed as a
3365 dependency.
3366
3367 Please note that, due to the relatively dynamic nature of this
3368 information, the most up to date version is now kept on
3369 http://www.recoll.org/features.html along with links to the home pages or
3370 best source/patches pages, and misc tips. The list below is not updated
3371 often and may be quite stale.
3372
3373 For many Linux distributions, most of the commands listed can be installed
3374 from the package repositories. However, the packages are sometimes
3375 outdated, or not the best version for Recoll, so you should take a look at
3376 http://www.recoll.org/features.html if a file type is important to you.
3377
3378 As of Recoll release 1.14, a number of XML-based formats that were handled
3379 by ad hoc handler code now use the xsltproc command, which usually comes
3380 with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
3381
3382 Now for the list:
3383
3384 o Openoffice files need unzip and xsltproc.
3385
3386 o PDF files need pdftotext which is part of Poppler (usually comes with
3387 the poppler-utils package). Avoid the original one from Xpdf.
3388
3389 o Postscript files need pstotext. The original version has an issue with
3390 shell character in file names, which is corrected in recent packages.
3391 See http://www.recoll.org/features.html for more detail.
3392
3393 o MS Word needs antiword. It is also useful to have wvWare installed as
3394 it may be be used as a fallback for some files which antiword does not
3395 handle.
3396
3397 o MS Excel and PowerPoint are processed by internal Python handlers.
3398
3399 o MS Open XML (docx) needs xsltproc.
3400
3401 o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3402 Ubuntu) package.
3403
3404 o RTF files need unrtf, which, in its older versions, has much trouble
3405 with non-western character sets. Many Linux distributions carry
3406 outdated unrtf versions. Check http://www.recoll.org/features.html for
3407 details.
3408
3409 o TeX files need untex or detex. Check
3410 http://www.recoll.org/features.html for sources if it's not packaged
3411 for your distribution.
3412
3413 o dvi files need dvips.
3414
3415 o djvu files need djvutxt and djvused from the DjVuLibre package.
3416
3417 o Audio files: Recoll releases 1.14 and later use a single Python
3418 handler based on mutagen for all audio file types.
3419
3420 o Pictures: Recoll uses the Exiftool Perl package to extract tag
3421 information. Most image file formats are supported. Note that there
3422 may not be much interest in indexing the technical tags (image size,
3423 aperture, etc.). This is only of interest if you store personal tags
3424 or textual descriptions inside the image files.
3425
3426 o chm: files in Microsoft help format need Python and the pychm module
3427 (which needs chmlib).
3428
3429 o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
3430 module. icalendar is not needed for newer versions, which use internal
3431 code.
3432
3433 o Zip archives need Python (and the standard zipfile module).
3434
3435 o Rar archives need Python, the rarfile Python module and the unrar
3436 utility.
3437
3438 o Midi karaoke files need Python and the Midi module
3439
3440 o Konqueror webarchive format with Python (uses the Tarfile module).
3441
3442 o Mimehtml web archive format (support based on the email handler, which
3443 introduces some mild weirdness, but still usable).
3444
3445 Text, HTML, email folders, and Scribus files are processed internally. Lyx
3446 is used to index Lyx files. Many handlers need iconv and the standard sed
3447 and awk.
3448
34495.3. Building from source
3450
3451 5.3.1. Prerequisites
3452
3453 If you can install any or all of the following through the package manager
3454 for your system, all the better. Especially Qt is a very big piece of
3455 software, but you will most probably be able to find a binary package.
3456
3457 You may have to compile Xapian but this is easy.
3458
3459 The shopping list:
3460
3461 o C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
3462 itself by strange messages about a missing iconv_open.
3463
3464 o Development files for Xapian core.
3465
3466 Important
3467
3468 If you are building Xapian for an older CPU (before Pentium 4 or
3469 Athlon 64), you need to add the --disable-sse flag to the configure
3470 command. Else all Xapian application will crash with an illegal
3471 instruction error.
3472
3473 o Development files for Qt 4 . Recoll has not been tested with Qt 5 yet.
3474 Recoll 1.15.9 was the last version to support Qt 3. If you do not want
3475 to install or build the Qt Webkit module, Recoll has a configuration
3476 option to disable its use (see further).
3477
3478 o Development files for X11 and zlib.
3479
3480 o You may also need libiconv. On Linux systems, the iconv interface is
3481 part of libc and you should not need to do anything special.
3482
3483 Check the Recoll download page for up to date version information.
3484
3485 5.3.2. Building
3486
3487 Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
3488 versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
3489 ok). If you build on another system, and need to modify things, I would
3490 very much welcome patches.
3491
3492 Configure options:
3493
3494 o --without-aspell will disable the code for phonetic matching of search
3495 terms.
3496
3497 o --with-fam or --with-inotify will enable the code for real time
3498 indexing. Inotify support is enabled by default on recent Linux
3499 systems.
3500
3501 o --with-qzeitgeist will enable sending Zeitgeist events about the
3502 visited search results, and needs the qzeitgeist package.
3503
3504 o --disable-webkit is available from version 1.17 to implement the
3505 result list with a Qt QTextBrowser instead of a WebKit widget if you
3506 do not or can't depend on the latter.
3507
3508 o --disable-idxthreads is available from version 1.19 to suppress
3509 multithreading inside the indexing process. You can also use the
3510 run-time configuration to restrict recollindex to using a single
3511 thread, but the compile-time option may disable a few more unused
3512 locks. This only applies to the use of multithreading for the core
3513 index processing (data input). The Recoll monitor mode always uses at
3514 least two threads of execution.
3515
3516 o --disable-python-module will avoid building the Python module.
3517
3518 o --disable-xattr will prevent fetching data from file extended
3519 attributes. Beyond a few standard attributes, fetching extended
3520 attributes data can only be useful is some application stores data in
3521 there, and also needs some simple configuration (see comments in the
3522 fields configuration file).
3523
3524 o --enable-camelcase will enable splitting camelCase words. This is not
3525 enabled by default as it has the unfortunate side-effect of making
3526 some phrase searches quite confusing: ie, "MySQL manual" would be
3527 matched by "MySQL manual" and "my sql manual" but not "mysql manual"
3528 (only inside phrase searches).
3529
3530 o --with-file-command Specify the version of the 'file' command to use
3531 (ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
3532 the gnu version on systems where the native one is bad.
3533
3534 o --disable-qtgui Disable the Qt interface. Will allow building the
3535 indexer and the command line search program in absence of a Qt
3536 environment.
3537
3538 o --disable-x11mon Disable X11 connection monitoring inside recollindex.
3539 Together with --disable-qtgui, this allows building recoll without Qt
3540 and X11.
3541
3542 o --disable-pic will compile Recoll with position-dependant code. This
3543 is incompatible with building the KIO or the Python or PHP extensions,
3544 but might yield very marginally faster code.
3545
3546 o Of course the usual autoconf configure options, like --prefix apply.
3547
3548 Normal procedure:
3549
3550 cd recoll-xxx
3551 ./configure
3552 make
3553 (practices usual hardship-repelling invocations)
3554
3555
3556 There is little auto-configuration. The configure script will mainly link
3557 one of the system-specific files in the mk directory to mk/sysconf. If
3558 your system is not known yet, it will tell you as much, and you may want
3559 to manually copy and modify one of the existing files (the new file name
3560 should be the output of uname -s).
3561
3562 5.3.2.1. Building on Solaris
3563
3564 We did not test building the GUI on Solaris for recent versions. You will
3565 need at least Qt 4.4. There are some hints on an old web site page, they
3566 may still be valid.
3567
3568 Someone did test the 1.19 indexer and Python module build, they do work,
3569 with a few minor glitches. Be sure to use GNU make and install.
3570
3571 5.3.3. Installation
3572
3573 Either type make install or execute recollinstall prefix, in the root of
3574 the source tree. This will copy the commands to prefix/bin and the sample
3575 configuration files, scripts and other shared data to prefix/share/recoll.
3576
3577 If the installation prefix given to recollinstall is different from either
3578 the system default or the value which was specified when executing
3579 configure (as in configure --prefix /some/path), you will have to set the
3580 RECOLL_DATADIR environment variable to indicate where the shared data is
3581 to be found (ie for (ba)sh: export
3582 RECOLL_DATADIR=/some/path/share/recoll).
3583
3584 You can then proceed to configuration.
3585
35865.4. Configuration overview
3587
3588 Most of the parameters specific to the recoll GUI are set through the
3589 Preferences menu and stored in the standard Qt place
3590 ($HOME/.config/Recoll.org/recoll.conf). You probably do not want to edit
3591 this by hand.
3592
3593 Recoll indexing options are set inside text configuration files located in
3594 a configuration directory. There can be several such directories, each of
3595 which defines the parameters for one index.
3596
3597 The configuration files can be edited by hand or through the Index
3598 configuration dialog (Preferences menu). The GUI tool will try to respect
3599 your formatting and comments as much as possible, so it is quite possible
3600 to use both ways.
3601
3602 The most accurate documentation for the configuration parameters is given
3603 by comments inside the default files, and we will just give a general
3604 overview here.
3605
3606 By default, for each index, there are two sets of configuration files.
3607 System-wide configuration files are kept in a directory named like
3608 /usr/[local/]share/recoll/examples, and define default values, shared by
3609 all indexes. For each index, a parallel set of files defines the
3610 customized parameters.
3611
3612 In addition (as of Recoll version 1.19.7), it is possible to specify two
3613 additional configuration directories which will be stacked before and
3614 after the user configuration directory. These are defined by the
3615 RECOLL_CONFTOP and RECOLL_CONFMID environment variables. Values from
3616 configuration files inside the top directory will override user ones,
3617 values from configuration files inside the middle directory will override
3618 system ones and be overridden by user ones. These two variables may be of
3619 use to applications which augment Recoll functionality, and need to add
3620 configuration data without disturbing the user's files. Please note that
3621 the two, currently single, values will probably be interpreted as
3622 colon-separated lists in the future: do not use colon characters inside
3623 the directory paths.
3624
3625 The default location of the configuration is the .recoll directory in your
3626 home. Most people will only use this directory.
3627
3628 This location can be changed, or others can be added with the
3629 RECOLL_CONFDIR environment variable or the -c option parameter to recoll
3630 and recollindex.
3631
3632 If the .recoll directory does not exist when recoll or recollindex are
3633 started, it will be created with a set of empty configuration files.
3634 recoll will give you a chance to edit the configuration file before
3635 starting indexing. recollindex will proceed immediately. To avoid
3636 mistakes, the automatic directory creation will only occur for the default
3637 location, not if -c or RECOLL_CONFDIR were used (in the latter cases, you
3638 will have to create the directory).
3639
3640 All configuration files share the same format. For example, a short
3641 extract of the main configuration file might look as follows:
3642
3643 # Space-separated list of directories to index.
3644 topdirs = ~/docs /usr/share/doc
3645
3646 [~/somedirectory-with-utf8-txt-files]
3647 defaultcharset = utf-8
3648
3649
3650 There are three kinds of lines:
3651
3652 o Comment (starts with #) or empty.
3653
3654 o Parameter affectation (name = value).
3655
3656 o Section definition ([somedirname]).
3657
3658 Depending on the type of configuration file, section definitions either
3659 separate groups of parameters or allow redefining some parameters for a
3660 directory sub-tree. They stay in effect until another section definition,
3661 or the end of file, is encountered. Some of the parameters used for
3662 indexing are looked up hierarchically from the current directory location
3663 upwards. Not all parameters can be meaningfully redefined, this is
3664 specified for each in the next section.
3665
3666 When found at the beginning of a file path, the tilde character (~) is
3667 expanded to the name of the user's home directory, as a shell would do.
3668
3669 White space is used for separation inside lists. List elements with
3670 embedded spaces can be quoted using double-quotes.
3671
3672 Encoding issues. Most of the configuration parameters are plain ASCII. Two
3673 particular sets of values may cause encoding issues:
3674
3675 o File path parameters may contain non-ascii characters and should use
3676 the exact same byte values as found in the file system directory.
3677 Usually, this means that the configuration file should use the system
3678 default locale encoding.
3679
3680 o The unac_except_trans parameter should be encoded in UTF-8. If your
3681 system locale is not UTF-8, and you need to also specify non-ascii
3682 file paths, this poses a difficulty because common text editors cannot
3683 handle multiple encodings in a single file. In this relatively
3684 unlikely case, you can edit the configuration file as two separate
3685 text files with appropriate encodings, and concatenate them to create
3686 the complete configuration.
3687
3688 5.4.1. Environment variables
3689
3690 RECOLL_CONFDIR
3691
3692 Defines the main configuration directory.
3693
3694 RECOLL_TMPDIR, TMPDIR
3695
3696 Locations for temporary files, in this order of priority. The
3697 default if none of these is set is to use /tmp. Big temporary
3698 files may be created during indexing, mostly for decompressing,
3699 and also for processing, e.g. email attachments.
3700
3701 RECOLL_CONFTOP, RECOLL_CONFMID
3702
3703 Allow adding configuration directories with priorities below and
3704 above the user directory (see above the Configuration overview
3705 section for details).
3706
3707 RECOLL_EXTRA_DBS, RECOLL_ACTIVE_EXTRA_DBS
3708
3709 Help for setting up external indexes. See this paragraph for
3710 explanations.
3711
3712 RECOLL_DATADIR
3713
3714 Defines replacement for the default location of Recoll data files,
3715 normally found in, e.g., /usr/share/recoll).
3716
3717 RECOLL_FILTERSDIR
3718
3719 Defines replacement for the default location of Recoll filters,
3720 normally found in, e.g., /usr/share/recoll/filters).
3721
3722 ASPELL_PROG
3723
3724 aspell program to use for creating the spelling dictionary. The
3725 result has to be compatible with the libaspell which Recoll is
3726 using.
3727
3728 VARNAME
3729
3730 Blabla
3731
3732 5.4.2. The main configuration file, recoll.conf
3733
3734 recoll.conf is the main configuration file. It defines things like what to
3735 index (top directories and things to ignore), and the default character
3736 set to use for document types which do not specify it internally.
3737
3738 The default configuration will index your home directory. If this is not
3739 appropriate, start recoll to create a blank configuration, click Cancel,
3740 and edit the configuration file before restarting the command. This will
3741 start the initial indexing, which may take some time.
3742
3743 Most of the following parameters can be changed from the Index
3744 Configuration menu in the recoll interface. Some can only be set by
3745 editing the configuration file.
3746
3747 5.4.2.1. Parameters affecting what documents we index:
3748
3749 topdirs
3750
3751 Specifies the list of directories or files to index (recursively
3752 for directories). You can use symbolic links as elements of this
3753 list. See the followLinks option about following symbolic links
3754 found under the top elements (not followed by default).
3755
3756 skippedNames
3757
3758 A space-separated list of wildcard patterns for names of files or
3759 directories that should be completely ignored. The list defined in
3760 the default file is:
3761
3762 skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
3763 *~ .beagle .git .hg .bzr loop.ps .xsession-errors \
3764 .recoll* xapiandb recollrc recoll.conf
3765
3766 The list can be redefined at any sub-directory in the indexed
3767 area.
3768
3769 The top-level directories are not affected by this list (that is,
3770 a directory in topdirs might match and would still be indexed).
3771
3772 The list in the default configuration does not exclude hidden
3773 directories (names beginning with a dot), which means that it may
3774 index quite a few things that you do not want. On the other hand,
3775 email user agents like thunderbird usually store messages in
3776 hidden directories, and you probably want this indexed. One
3777 possible solution is to have .* in skippedNames, and add things
3778 like ~/.thunderbird or ~/.evolution in topdirs.
3779
3780 Not even the file names are indexed for patterns in this list. See
3781 the noContentSuffixes variable for an alternative approach which
3782 indexes the file names.
3783
3784 noContentSuffixes
3785
3786 This is a list of file name endings (not wildcard expressions, nor
3787 dot-delimited suffixes). Only the names of matching files will be
3788 indexed (no attempt at MIME type identification, no decompression,
3789 no content indexing). This can be redefined for subdirectories,
3790 and edited from the GUI. The default value is:
3791
3792 noContentSuffixes = .md5 .map \
3793 .o .lib .dll .a .sys .exe .com \
3794 .mpp .mpt .vsd \
3795 .img .img.gz .img.bz2 .img.xz .image .image.gz .image.bz2 .image.xz \
3796 .dat .bak .rdf .log.gz .log .db .msf .pid \
3797 ,v ~ #
3798
3799 skippedPaths and daemSkippedPaths
3800
3801 A space-separated list of patterns for paths of files or
3802 directories that should be skipped. There is no default in the
3803 sample configuration file, but the code always adds the
3804 configuration and database directories in there.
3805
3806 skippedPaths is used both by batch and real time indexing.
3807 daemSkippedPaths can be used to specify things that should be
3808 indexed at startup, but not monitored.
3809
3810 Example of use for skipping text files only in a specific
3811 directory:
3812
3813 skippedPaths = ~/somedir/*.txt
3814
3815
3816 skippedPathsFnmPathname
3817
3818 The values in the *skippedPaths variables are matched by default
3819 with fnmatch(3), with the FNM_PATHNAME flag. This means that '/'
3820 characters must be matched explicitly. You can set
3821 skippedPathsFnmPathname to 0 to disable the use of FNM_PATHNAME
3822 (meaning that /*/dir3 will match /dir1/dir2/dir3).
3823
3824 zipSkippedNames
3825
3826 A space-separated list of patterns for names of files or
3827 directories that should be ignored inside zip archives. This is
3828 used directly by the zip handler, and has a function similar to
3829 skippedNames, but works independently. Can be redefined for
3830 filesystem subdirectories. For versions up to 1.19, you will need
3831 to update the Zip handler and install a supplementary Python
3832 module. The details are described on the Recoll wiki.
3833
3834 followLinks
3835
3836 Specifies if the indexer should follow symbolic links while
3837 walking the file tree. The default is to ignore symbolic links to
3838 avoid multiple indexing of linked files. No effort is made to
3839 avoid duplication when this option is set to true. This option can
3840 be set individually for each of the topdirs members by using
3841 sections. It can not be changed below the topdirs level.
3842
3843 indexedmimetypes
3844
3845 Recoll normally indexes any file which it knows how to read. This
3846 list lets you restrict the indexed MIME types to what you specify.
3847 If the variable is unspecified or the list empty (the default),
3848 all supported types are processed. Can be redefined for
3849 subdirectories.
3850
3851 excludedmimetypes
3852
3853 This list lets you exclude some MIME types from indexing. Can be
3854 redefined for subdirectories.
3855
3856 compressedfilemaxkbs
3857
3858 Size limit for compressed (.gz or .bz2) files. These need to be
3859 decompressed in a temporary directory for identification, which
3860 can be very wasteful if 'uninteresting' big compressed files are
3861 present. Negative means no limit, 0 means no processing of any
3862 compressed file. Defaults to -1.
3863
3864 textfilemaxmbs
3865
3866 Maximum size for text files. Very big text files are often
3867 uninteresting logs. Set to -1 to disable (default 20MB).
3868
3869 textfilepagekbs
3870
3871 If set to other than -1, text files will be indexed as multiple
3872 documents of the given page size. This may be useful if you do
3873 want to index very big text files as it will both reduce memory
3874 usage at index time and help with loading data to the preview
3875 window. A size of a few megabytes would seem reasonable (default:
3876 1MB).
3877
3878 membermaxkbs
3879
3880 This defines the maximum size in kilobytes for an archive member
3881 (zip, tar or rar at the moment). Bigger entries will be skipped.
3882
3883 indexallfilenames
3884
3885 Recoll indexes file names in a special section of the database to
3886 allow specific file names searches using wild cards. This
3887 parameter decides if file name indexing is performed only for
3888 files with MIME types that would qualify them for full text
3889 indexing, or for all files inside the selected subtrees,
3890 independently of MIME type.
3891
3892 usesystemfilecommand
3893
3894 Decide if we execute a system command (file -i by default) as a
3895 final step for determining the MIME type for a file (the main
3896 procedure uses suffix associations as defined in the mimemap
3897 file). This can be useful for files with suffix-less names, but it
3898 will also cause the indexing of many bogus "text" files.
3899
3900 systemfilecommand
3901
3902 Command to use for mime for mime type determination if
3903 usesystefilecommand is set. Recent versions of xdg-mime sometimes
3904 work better than file.
3905
3906 processwebqueue
3907
3908 If this is set, process the directory where Web browser plugins
3909 copy visited pages for indexing.
3910
3911 webqueuedir
3912
3913 The path to the web indexing queue. This is hard-coded in the
3914 Firefox plugin as ~/.recollweb/ToIndex so there should be no need
3915 to change it.
3916
3917 5.4.2.2. Parameters affecting how we generate terms:
3918
3919 Changing some of these parameters will imply a full reindex. Also, when
3920 using multiple indexes, it may not make sense to search indexes that don't
3921 share the values for these parameters, because they usually affect both
3922 search and index operations.
3923
3924 indexStripChars
3925
3926 Decide if we strip characters of diacritics and convert them to
3927 lower-case before terms are indexed. If we don't, searches
3928 sensitive to case and diacritics can be performed, but the index
3929 will be bigger, and some marginal weirdness may sometimes occur.
3930 The default is a stripped index (indexStripChars = 1) for now.
3931 When using multiple indexes for a search, this parameter must be
3932 defined identically for all. Changing the value implies an index
3933 reset.
3934
3935 maxTermExpand
3936
3937 Maximum expansion count for a single term (e.g.: when using
3938 wildcards). The default of 10000 is reasonable and will avoid
3939 queries that appear frozen while the engine is walking the term
3940 list.
3941
3942 maxXapianClauses
3943
3944 Maximum number of elementary clauses we can add to a single Xapian
3945 query. In some cases, the result of term expansion can be
3946 multiplicative, and we want to avoid using excessive memory. The
3947 default of 100 000 should be both high enough in most cases and
3948 compatible with current typical hardware configurations.
3949
3950 nonumbers
3951
3952 If this set to true, no terms will be generated for numbers. For
3953 example "123", "1.5e6", 192.168.1.4, would not be indexed
3954 ("value123" would still be). Numbers are often quite interesting
3955 to search for, and this should probably not be set except for
3956 special situations, ie, scientific documents with huge amounts of
3957 numbers in them. This can only be set for a whole index, not for a
3958 subtree.
3959
3960 nocjk
3961
3962 If this set to true, specific east asian (Chinese Korean Japanese)
3963 characters/word splitting is turned off. This will save a small
3964 amount of cpu if you have no CJK documents. If your document base
3965 does include such text but you are not interested in searching it,
3966 setting nocjk may be a significant time and space saver.
3967
3968 cjkngramlen
3969
3970 This lets you adjust the size of n-grams used for indexing CJK
3971 text. The default value of 2 is probably appropriate in most
3972 cases. A value of 3 would allow more precision and efficiency on
3973 longer words, but the index will be approximately twice as large.
3974
3975 indexstemminglanguages
3976
3977 A list of languages for which the stem expansion databases will be
3978 built. See recollindex(1) or use the recollindex -l command for
3979 possible values. You can add a stem expansion database for a
3980 different language by using recollindex -s, but it will be deleted
3981 during the next indexing. Only languages listed in the
3982 configuration file are permanent.
3983
3984 defaultcharset
3985
3986 The name of the character set used for files that do not contain a
3987 character set definition (ie: plain text files). This can be
3988 redefined for any sub-directory. If it is not set at all, the
3989 character set used is the one defined by the nls environment (
3990 LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
3991
3992 unac_except_trans
3993
3994 This is a list of characters, encoded in UTF-8, which should be
3995 handled specially when converting text to unaccented lowercase.
3996 For example, in Swedish, the letter a with diaeresis has full
3997 alphabet citizenship and should not be turned into an a. Each
3998 element in the space-separated list has the special character as
3999 first element and the translation following. The handling of both
4000 the lowercase and upper-case versions of a character should be
4001 specified, as appartenance to the list will turn-off both standard
4002 accent and case processing. Example for Swedish:
4003
4004 unac_except_trans = aaaa AAaa a:a: A:a: o:o: O:o:
4005
4006
4007 Note that the translation is not limited to a single character,
4008 you could very well have something like u:ue in the list.
4009
4010 The default value set for unac_except_trans can't be listed here
4011 because I have trouble with SGML and UTF-8, but it only contains
4012 ligature decompositions: german ss, oe, ae, fi, fl.
4013
4014 This parameter can't be defined for subdirectories, it is global,
4015 because there is no way to do otherwise when querying. If you have
4016 document sets which would need different values, you will have to
4017 index and query them separately.
4018
4019 maildefcharset
4020
4021 This can be used to define the default character set specifically
4022 for email messages which don't specify it. This is mainly useful
4023 for readpst (libpst) dumps, which are utf-8 but do not say so.
4024
4025 localfields
4026
4027 This allows setting fields for all documents under a given
4028 directory. Typical usage would be to set an "rclaptg" field, to be
4029 used in mimeview to select a specific viewer. If several fields
4030 are to be set, they should be separated with a semi-colon (';')
4031 character, which there is currently no way to escape. Also note
4032 the initial semi-colon. Example: localfields= ;rclaptg=gnus;other
4033 = val, then select specifier viewer with mimetype|tag=... in
4034 mimeview.
4035
4036 testmodifusemtime
4037
4038 If true, use mtime instead of default ctime to determine if a file
4039 has been modified (in addition to size, which is always used).
4040 Setting this can reduce re-indexing on systems where extended
4041 attributes are modified (by some other application), but not
4042 indexed (changing extended attributes only affects ctime). Notes:
4043
4044 o This may prevent detection of change in some marginal file
4045 rename cases (the target would need to have the same size and
4046 mtime).
4047
4048 o You should probably also set noxattrfields to 1 in this case,
4049 except if you still prefer to perform xattr indexing, for
4050 example if the local file update pattern makes it of value
4051 (as in general, there is a risk for pure extended attributes
4052 updates without file modification to go undetected).
4053
4054 Perform a full index reset after changing the value of this
4055 parameter.
4056
4057 noxattrfields
4058
4059 Recoll versions 1.19 and later automatically translate file
4060 extended attributes into document fields (to be processed
4061 according to the parameters from the fields file). Setting this
4062 variable to 1 will disable the behaviour.
4063
4064 metadatacmds
4065
4066 This allows executing external commands for each file and storing
4067 the output in Recoll document fields. This could be used for
4068 example to index external tag data. The value is a list of field
4069 names and commands, don't forget an initial semi-colon. Example:
4070
4071 [/some/area/of/the/fs]
4072 metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
4073
4074
4075 As a specially disgusting hack brought by Recoll 1.19.7, if a
4076 "field name" begins with rclmulti, the data returned by the
4077 command is expected to contain multiple field values, in
4078 configuration file format. This allows setting several fields by
4079 executing a single command. Example:
4080
4081 metadatacmds = ; rclmulti1 = somecmd %f
4082
4083
4084 If somecmd returns data in the form of:
4085
4086 field1 = value1
4087 field2 = value for field2
4088
4089
4090 field1 and field2 will be set inside the document metadata.
4091
4092 5.4.2.3. Parameters affecting where and how we store things:
4093
4094 dbdir
4095
4096 The name of the Xapian data directory. It will be created if
4097 needed when the index is initialized. If this is not an absolute
4098 path, it will be interpreted relative to the configuration
4099 directory. The value can have embedded spaces but starting or
4100 trailing spaces will be trimmed. You cannot use quotes here.
4101
4102 idxstatusfile
4103
4104 The name of the scratch file where the indexer process updates its
4105 status. Default: idxstatus.txt inside the configuration directory.
4106
4107 maxfsoccuppc
4108
4109 Maximum file system occupation before we stop indexing. The value
4110 is a percentage, corresponding to what the "Capacity" df output
4111 column shows. The default value is 0, meaning no checking.
4112
4113 mboxcachedir
4114
4115 The directory where mbox message offsets cache files are held.
4116 This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
4117 to share a directory between different configurations.
4118
4119 mboxcacheminmbs
4120
4121 The minimum mbox file size over which we cache the offsets. There
4122 is really no sense in caching offsets for small files. The default
4123 is 5 MB.
4124
4125 webcachedir
4126
4127 This is only used by the web browser plugin indexing code, and
4128 defines where the cache for visited pages will live. Default:
4129 $RECOLL_CONFDIR/webcache
4130
4131 webcachemaxmbs
4132
4133 This is only used by the web browser plugin indexing code, and
4134 defines the maximum size for the web page cache. Default: 40 MB.
4135 Quite unfortunately, this is only taken into account when creating
4136 the cache file. You need to delete the file for a change to be
4137 taken into account.
4138
4139 idxflushmb
4140
4141 Threshold (megabytes of new text data) where we flush from memory
4142 to disk index. Setting this can help control memory usage. A value
4143 of 0 means no explicit flushing, letting Xapian use its own
4144 default, which is flushing every 10000 (or XAPIAN_FLUSH_THRESHOLD)
4145 documents, which gives little memory usage control, as memory
4146 usage also depends on average document size. The default value is
4147 10, and it is probably a bit low. If your system usually has free
4148 memory, you can try higher values between 20 and 80. In my
4149 experience, values beyond 100 are always counterproductive.
4150
4151 5.4.2.4. Parameters affecting multithread processing
4152
4153 The Recoll indexing process recollindex can use multiple threads to speed
4154 up indexing on multiprocessor systems. The work done to index files is
4155 divided in several stages and some of the stages can be executed by
4156 multiple threads. The stages are:
4157
4158 1. File system walking: this is always performed by the main thread.
4159 2. File conversion and data extraction.
4160 3. Text processing (splitting, stemming, etc.)
4161 4. Xapian index update.
4162
4163 You can also read a longer document about the transformation of Recoll
4164 indexing to multithreading.
4165
4166 The threads configuration is controlled by two configuration file
4167 parameters.
4168
4169 thrQSizes
4170
4171 This variable defines the job input queues configuration. There
4172 are three possible queues for stages 2, 3 and 4, and this
4173 parameter should give the queue depth for each stage (three
4174 integer values). If a value of -1 is used for a given stage, no
4175 queue is used, and the thread will go on performing the next
4176 stage. In practise, deep queues have not been shown to increase
4177 performance. A value of 0 for the first queue tells Recoll to
4178 perform autoconfiguration (no need for the two other values in
4179 this case) - this is the default configuration.
4180
4181 thrTCounts
4182
4183 This defines the number of threads used for each stage. If a value
4184 of -1 is used for one of the queue depths, the corresponding
4185 thread count is ignored. It makes no sense to use a value other
4186 than 1 for the last stage because updating the Xapian index is
4187 necessarily single-threaded (and protected by a mutex).
4188
4189 The following example would use three queues (of depth 2), and 4 threads
4190 for converting source documents, 2 for processing their text, and one to
4191 update the index. This was tested to be the best configuration on the test
4192 system (quadri-processor with multiple disks).
4193
4194 thrQSizes = 2 2 2
4195 thrTCounts = 4 2 1
4196
4197 The following example would use a single queue, and the complete
4198 processing for each document would be performed by a single thread
4199 (several documents will still be processed in parallel in most cases). The
4200 threads will use mutual exclusion when entering the index update stage. In
4201 practise the performance would be close to the precedent case in general,
4202 but worse in certain cases (e.g. a Zip archive would be performed purely
4203 sequentially), so the previous approach is preferred. YMMV... The 2 last
4204 values for thrTCounts are ignored.
4205
4206 thrQSizes = 2 -1 -1
4207 thrTCounts = 6 1 1
4208
4209 The following example would disable multithreading. Indexing will be
4210 performed by a single thread.
4211
4212 thrQSizes = -1 -1 -1
4213
4214 5.4.2.5. Miscellaneous parameters:
4215
4216 autodiacsens
4217
4218 IF the index is not stripped, decide if we automatically trigger
4219 diacritics sensitivity if the search term has accented characters
4220 (not in unac_except_trans). Else you need to use the query
4221 language and the D modifier to specify diacritics sensitivity.
4222 Default is no.
4223
4224 autocasesens
4225
4226 IF the index is not stripped, decide if we automatically trigger
4227 character case sensitivity if the search term has upper-case
4228 characters in any but the first position. Else you need to use the
4229 query language and the C modifier to specify character-case
4230 sensitivity. Default is yes.
4231
4232 loglevel,daemloglevel
4233
4234 Verbosity level for recoll and recollindex. A value of 4 lists
4235 quite a lot of debug/information messages. 2 only lists errors.
4236 The daemversion is specific to the indexing monitor daemon.
4237
4238 logfilename, daemlogfilename
4239
4240 Where the messages should go. 'stderr' can be used as a special
4241 value, and is the default. The daemversion is specific to the
4242 indexing monitor daemon.
4243
4244 checkneedretryindexscript
4245
4246 This defines the name for a command executed by recollindex when
4247 starting indexing. If the exit status of the command is 0,
4248 recollindex retries to index all files which previously could not
4249 be indexed because of data extraction errors. The default value is
4250 a script which checks if any of the common bin directories have
4251 changed (indicating that a helper program may have been
4252 installed).
4253
4254 mondelaypatterns
4255
4256 This allows specify wildcard path patterns (processed with
4257 fnmatch(3) with 0 flag), to match files which change too often and
4258 for which a delay should be observed before re-indexing. This is a
4259 space-separated list, each entry being a pattern and a time in
4260 seconds, separated by a colon. You can use double quotes if a path
4261 entry contains white space. Example:
4262
4263 mondelaypatterns = *.log:20 "this one has spaces*:10"
4264
4265
4266 monixinterval
4267
4268 Minimum interval (seconds) for processing the indexing queue. The
4269 real time monitor does not process each event when it comes in,
4270 but will wait this time for the queue to accumulate to diminish
4271 overhead and in order to aggregate multiple events to the same
4272 file. Default 30 S.
4273
4274 monauxinterval
4275
4276 Period (in seconds) at which the real time monitor will regenerate
4277 the auxiliary databases (spelling, stemming) if needed. The
4278 default is one hour.
4279
4280 monioniceclass, monioniceclassdata
4281
4282 These allow defining the ionice class and data used by the indexer
4283 (default class 3, no data).
4284
4285 filtermaxseconds
4286
4287 Maximum handler execution time, after which it is aborted. Some
4288 postscript programs just loop...
4289
4290 filtermaxmbytes
4291
4292 Recoll 1.20.7 and later. Maximum handler memory utilisation. This
4293 uses setrlimit(RLIMIT_AS) on most systems (total virtual memory
4294 space size limit). Some programs may start with 500 MBytes of
4295 mapped shared libraries, so take this into account when choosing a
4296 value. The default is a liberal 2000MB.
4297
4298 filtersdir
4299
4300 A directory to search for the external input handler scripts used
4301 to index some types of files. The value should not be changed,
4302 except if you want to modify one of the default scripts. The value
4303 can be redefined for any sub-directory.
4304
4305 iconsdir
4306
4307 The name of the directory where recoll result list icons are
4308 stored. You can change this if you want different images.
4309
4310 idxabsmlen
4311
4312 Recoll stores an abstract for each indexed file inside the
4313 database. The text can come from an actual 'abstract' section in
4314 the document or will just be the beginning of the document. It is
4315 stored in the index so that it can be displayed inside the result
4316 lists without decoding the original file. The idxabsmlen parameter
4317 defines the size of the stored abstract. The default value is 250
4318 bytes. The search interface gives you the choice to display this
4319 stored text or a synthetic abstract built by extracting text
4320 around the search terms. If you always prefer the synthetic
4321 abstract, you can reduce this value and save a little space.
4322
4323 idxmetastoredlen
4324
4325 Maximum stored length for metadata fields. This does not affect
4326 indexing (the whole field is processed anyway), just the amount of
4327 data stored in the index for the purpose of displaying fields
4328 inside result lists or previews. The default value is 150 bytes
4329 which may be too low if you have custom fields.
4330
4331 aspellLanguage
4332
4333 Language definitions to use when creating the aspell dictionary.
4334 The value must match a set of aspell language definition files.
4335 You can type "aspell config" to see where these are installed
4336 (look for data-dir). The default if the variable is not set is to
4337 use your desktop national language environment to guess the value.
4338
4339 noaspell
4340
4341 If this is set, the aspell dictionary generation is turned off.
4342 Useful for cases where you don't need the functionality or when it
4343 is unusable because aspell crashes during dictionary generation.
4344
4345 mhmboxquirks
4346
4347 This allows defining location-related quirks for the mailbox
4348 handler. Currently only the tbird flag is defined, and it should
4349 be set for directories which hold Thunderbird data, as their
4350 folder format is weird.
4351
4352 5.4.3. The fields file
4353
4354 This file contains information about dynamic fields handling in Recoll.
4355 Some very basic fields have hard-wired behaviour, and, mostly, you should
4356 not change the original data inside the fields file. But you can create
4357 custom fields fitting your data and handle them just like they were native
4358 ones.
4359
4360 The fields file has several sections, which each define an aspect of
4361 fields processing. Quite often, you'll have to modify several sections to
4362 obtain the desired behaviour.
4363
4364 We will only give a short description here, you should refer to the
4365 comments inside the default file for more detailed information.
4366
4367 Field names should be lowercase alphabetic ASCII.
4368
4369 [prefixes]
4370
4371 A field becomes indexed (searchable) by having a prefix defined in
4372 this section.
4373
4374 [stored]
4375
4376 A field becomes stored (displayable inside results) by having its
4377 name listed in this section (typically with an empty value).
4378
4379 [aliases]
4380
4381 This section defines lists of synonyms for the canonical names
4382 used inside the [prefixes] and [stored] sections
4383
4384 [queryaliases]
4385
4386 This section also defines aliases for the canonic field names,
4387 with the difference that the substitution will only be used at
4388 query time, avoiding any possibility that the value would pick-up
4389 random metadata from documents.
4390
4391 handler-specific sections
4392
4393 Some input handlers may need specific configuration for handling
4394 fields. Only the email message handler currently has such a
4395 section (named [mail]). It allows indexing arbitrary email headers
4396 in addition to the ones indexed by default. Other such sections
4397 may appear in the future.
4398
4399 Here follows a small example of a personal fields file. This would extract
4400 a specific email header and use it as a searchable field, with data
4401 displayable inside result lists. (Side note: as the email handler does no
4402 decoding on the values, only plain ascii headers can be indexed, and only
4403 the first occurrence will be used for headers that occur several times).
4404
4405 [prefixes]
4406 # Index mailmytag contents (with the given prefix)
4407 mailmytag = XMTAG
4408
4409 [stored]
4410 # Store mailmytag inside the document data record (so that it can be
4411 # displayed - as %(mailmytag) - in result lists).
4412 mailmytag =
4413
4414 [queryaliases]
4415 filename = fn
4416 containerfilename = cfn
4417
4418 [mail]
4419 # Extract the X-My-Tag mail header, and use it internally with the
4420 # mailmytag field name
4421 x-my-tag = mailmytag
4422
4423 5.4.3.1. Extended attributes in the fields file
4424
4425 Recoll versions 1.19 and later process user extended file attributes as
4426 documents fields by default.
4427
4428 Attributes are processed as fields of the same name, after removing the
4429 user prefix on Linux.
4430
4431 The [xattrtofields] section of the fields file allows specifying
4432 translations from extended attributes names to Recoll field names. An
4433 empty translation disables use of the corresponding attribute data.
4434
4435 5.4.4. The mimemap file
4436
4437 mimemap specifies the file name extension to MIME type mappings.
4438
4439 For file names without an extension, or with an unknown one, the system's
4440 file -i command will be executed to determine the MIME type (this can be
4441 switched off inside the main configuration file).
4442
4443 The mappings can be specified on a per-subtree basis, which may be useful
4444 in some cases. Example: gaim logs have a .txt extension but should be
4445 handled specially, which is possible because they are usually all located
4446 in one place.
4447
4448 The recoll_noindex mimemap variable has been moved to recoll.conf and
4449 renamed to noContentSuffixes, while keeping the same function, as of
4450 Recoll version 1.21. For older Recoll versions, see the documentation for
4451 noContentSuffixes but use recoll_noindex in mimemap.
4452
4453 5.4.5. The mimeconf file
4454
4455 mimeconf specifies how the different MIME types are handled for indexing,
4456 and which icons are displayed in the recoll result lists.
4457
4458 Changing the parameters in the [index] section is probably not a good idea
4459 except if you are a Recoll developer.
4460
4461 The [icons] section allows you to change the icons which are displayed by
4462 recoll in the result lists (the values are the basenames of the png images
4463 inside the iconsdir directory (specified in recoll.conf).
4464
4465 5.4.6. The mimeview file
4466
4467 mimeview specifies which programs are started when you click on an Open
4468 link in a result list. Ie: HTML is normally displayed using firefox, but
4469 you may prefer Konqueror, your openoffice.org program might be named
4470 oofice instead of openoffice etc.
4471
4472 Changes to this file can be done by direct editing, or through the recoll
4473 GUI preferences dialog.
4474
4475 If Use desktop preferences to choose document editor is checked in the
4476 Recoll GUI preferences, all mimeview entries will be ignored except the
4477 one labelled application/x-all (which is set to use xdg-open by default).
4478
4479 In this case, the xallexcepts top level variable defines a list of MIME
4480 type exceptions which will be processed according to the local entries
4481 instead of being passed to the desktop. This is so that specific Recoll
4482 options such as a page number or a search string can be passed to
4483 applications that support them, such as the evince viewer.
4484
4485 As for the other configuration files, the normal usage is to have a
4486 mimeview inside your own configuration directory, with just the
4487 non-default entries, which will override those from the central
4488 configuration file.
4489
4490 All viewer definition entries must be placed under a [view] section.
4491
4492 The keys in the file are normally MIME types. You can add an application
4493 tag to specialize the choice for an area of the filesystem (using a
4494 localfields specification in mimeconf). The syntax for the key is
4495 mimetype|tag
4496
4497 The nouncompforviewmts entry, (placed at the top level, outside of the
4498 [view] section), holds a list of MIME types that should not be
4499 uncompressed before starting the viewer (if they are found compressed, ie:
4500 mydoc.doc.gz).
4501
4502 The right side of each assignment holds a command to be executed for
4503 opening the file. The following substitutions are performed:
4504
4505 o %D. Document date
4506
4507 o %f. File name. This may be the name of a temporary file if it was
4508 necessary to create one (ie: to extract a subdocument from a
4509 container).
4510
4511 o %i. Internal path, for subdocuments of containers. The format depends
4512 on the container type. If this appears in the command line, Recoll
4513 will not create a temporary file to extract the subdocument, expecting
4514 the called application (possibly a script) to be able to handle it.
4515
4516 o %M. MIME type
4517
4518 o %p. Page index. Only significant for a subset of document types,
4519 currently only PDF, Postscript and DVI files. Can be used to start the
4520 editor at the right page for a match or snippet.
4521
4522 o %s. Search term. The value will only be set for documents with indexed
4523 page numbers (ie: PDF). The value will be one of the matched search
4524 terms. It would allow pre-setting the value in the "Find" entry inside
4525 Evince for example, for easy highlighting of the term.
4526
4527 o %u. Url.
4528
4529 In addition to the predefined values above, all strings like %(fieldname)
4530 will be replaced by the value of the field named fieldname for the
4531 document. This could be used in combination with field customisation to
4532 help with opening the document.
4533
4534 5.4.7. The ptrans file
4535
4536 ptrans specifies query-time path translations. These can be useful in
4537 multiple cases.
4538
4539 The file has a section for any index which needs translations, either the
4540 main one or additional query indexes. The sections are named with the
4541 Xapian index directory names. No slash character should exist at the end
4542 of the paths (all comparisons are textual). An example should make things
4543 sufficiently clear
4544
4545 [/home/me/.recoll/xapiandb]
4546 /this/directory/moved = /to/this/place
4547
4548 [/path/to/additional/xapiandb]
4549 /server/volume1/docdir = /net/server/volume1/docdir
4550 /server/volume2/docdir = /net/server/volume2/docdir
4551
4552
4553 5.4.8. Examples of configuration adjustments
4554
4555 5.4.8.1. Adding an external viewer for an non-indexed type
4556
4557 Imagine that you have some kind of file which does not have indexable
4558 content, but for which you would like to have a functional Open link in
4559 the result list (when found by file name). The file names end in .blob and
4560 can be displayed by application blobviewer.
4561
4562 You need two entries in the configuration files for this to work:
4563
4564 o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
4565 following line:
4566
4567 .blob = application/x-blobapp
4568
4569 Note that the MIME type is made up here, and you could call it
4570 diesel/oil just the same.
4571
4572 o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
4573
4574 application/x-blobapp = blobviewer %f
4575
4576 We are supposing that blobviewer wants a file name parameter here, you
4577 would use %u if it liked URLs better.
4578
4579 If you just wanted to change the application used by Recoll to display a
4580 MIME type which it already knows, you would just need to edit mimeview.
4581 The entries you add in your personal file override those in the central
4582 configuration, which you do not need to alter. mimeview can also be
4583 modified from the Gui.
4584
4585 5.4.8.2. Adding indexing support for a new file type
4586
4587 Let us now imagine that the above .blob files actually contain indexable
4588 text and that you know how to extract it with a command line program.
4589 Getting Recoll to index the files is easy. You need to perform the above
4590 alteration, and also to add data to the mimeconf file (typically in
4591 ~/.recoll/mimeconf):
4592
4593 o Under the [index] section, add the following line (more about the
4594 rclblob indexing script later):
4595
4596 application/x-blobapp = exec rclblob
4597
4598 o Under the [icons] section, you should choose an icon to be displayed
4599 for the files inside the result lists. Icons are normally 64x64 pixels
4600 PNG files which live in /usr/[local/]share/recoll/images.
4601
4602 o Under the [categories] section, you should add the MIME type where it
4603 makes sense (you can also create a category). Categories may be used
4604 for filtering in advanced search.
4605
4606 The rclblob handler should be an executable program or script which exists
4607 inside /usr/[local/]share/recoll/filters. It will be given a file name as
4608 argument and should output the text or html contents on the standard
4609 output.
4610
4611 The filter programming section describes in more detail how to write an
4612 input handler.
4613