1Pinot 2Copyright 2005-2021 Fabrice Colin <fabrice dot colin at gmail dot com> 3 4Homepage - https://github.com/FabriceColin/pinot 5 previously hosted at http://code.google.com/p/pinot-search/ 6 and http://pinot.berlios.de/ 7Translations - https://translations.launchpad.net/pinot/trunk/+pots/pinot 8 9 101. What is Pinot 112. Building Pinot 122. Available engines 133. Indexes 144. Indexing and monitoring 155. Searching 166. Viewing cached results 177. File formats 188. File patterns 199. Digging deeper 2010. Saving results 2111. D-Bus service & daemon 2212. CJKV support 2313. Environment variables and aliases 2414. How to reset indexes 2515. Compiling 26 27 281. What is Pinot 29 30 31 Pinot combines desktop search and metasearch. It consists of : 32 * a D-Bus service daemon that crawls, indexes, monitors your documents 33 and that plugs into the GNOME Shell search system ("pinot-dbus-daemon") 34 * a GTK3-based user interface that enables to query the index built by 35 the service as well as Web engines, and which can display and analyze 36 the results ("pinot") 37 * other command-line tools 38 39 It was developed and tested on GNU/Linux and should work on other Unix-like 40 systems. 41 42 432. Available engines 44 45 46 One of the main functionalities of Pinot is metasearch. This lets you query 47 a variety of sources, including Web-based search engines. By default, the 48 list of available engines is hidden and defaults to internal indexes (see 49 section "3. Indexes"). To show the list of engines, click on the Show All 50 Search Engines button, next to the Query field immediately below the menu 51 bar. Click on the same button again to hide the list. 52 53 Any number of engine or engine group may be selected at any one time. 54 Multi-selection is done like in any other application. All queries are always 55 run against the list of currently selected engines. 56 57 Pinot supports both Sherlock and OpenSearch Description plugins. They are 58 installed in $PREFIX/share/pinot/engines/, where PREFIX is usually /usr. 59 Additional engines can be installed in that directory or in ~/.pinot/engines. 60 Note this directory is not created automatically. 61 62 Sherlock is what Firefox and the Mozilla Suite use. Chances are that somebody 63 wrote a plugin for the engine you are interested in. Beware that a lot are 64 out of date and will require some changes. Use pinot-search on the 65 command-line to run a quick check on a plugin, eg 66 $ pinot-search sherlock $PREFIX/share/pinot/engines/Bozo.src "clowns" 67 68 Plugins are categorized by channels. For Sherlock plugins, the routeType 69 element under SEARCH specifies the name of the channel the plugin belongs to. 70 71 As for OpenSearch, Pinot should work with OpenSearch Description 1.0 and 1.1 72 (draft 2) plugins. Keep in mind that the spec doesn't describe how to parse 73 the results pages returned by search engines, therefore Pinot assumes that 74 engines return results formatted according to the OpenSearch Response 75 standard. 76 In practice, this means that plugins that don't stick to the following rules 77 will be ignored or won't show any result : 78 * For Description 1.1 plugins, the type attribute on the Url field must be 79 set to "application/atom+xml" or "application/rss+xml" (default). 80 "text/html" will be rejected. 81 * The search engine's results page content type must be some form of XML, 82 otherwise Pinot won't attempt parsing it. 83 Pinot differs from the Description spec in that it interprets the Tags field 84 as a channel name. The standard defines Tags as a "space-delimited set of 85 words that are used as keywords to identify and categorize this search 86 content". 87 88 The "Xapian Omega" plugin allows to query a locally installed instance of 89 Xapian Omega at http://localhost/. If Omega is installed elsewhere, edit 90 $PREFIX/share/pinot/engines/OmegaDescription.xml. 91 92 933. Indexes 94 95 96 Pinot has two internal indexes. My Documents is populated by the D-Bus 97 service and contains documents found on your computer. My Web Pages is 98 populated by the UI whenever you : 99 * import an external document, using the Index, Import URL menu 100 * index results returned by Web engines, using the Results, Index menu 101 or through a Stored Query 102 Both index may have any of the file types listed in section "7. File formats". 103 104 Indexes built by any other Xapian-based tools can be added to Pinot. To add 105 an external index, click the + button at the bottom of the engines list. 106 It can either be local, in which case you will have to select the directory 107 where it is found, or served from a remote machine by xapian-tcpsrv. See 108 the manual page for xapian-tcpsrv(1). 109 110 All indexes are grouped together under the channel Current User in the 111 engines list. 112 113 1144. Indexing and monitoring 115 116 117 Pinot can index any directory configured under the Indexing tab of the 118 Preferences box. Monitoring is optional and should be disabled for the 119 directories whose contents seldom change, eg $PREFIX/share/doc. 120 Indexing and monitoring of directories is handled by the D-Bus service. 121 The number of files and directories that can be monitored is capped by 122 the value of /proc/sys/fs/inotify/max_user_watches - 1024. 123 124 Symlinks are not followed but are still indexed, with the MIME type 125 "inode/symlink". 126 127 While Pinot is not currently able to get to and index application-specific 128 data held in dot-directories, it can index common file formats as listed 129 in section "7. File formats". 130 131 All files and directories with a name that starts with a dot, eg 132 ".thunderbird", are skipped and their content is not indexed. If you wish 133 to include the contents of some dot-directory, create a symlink to a 134 directory that is configured in Preferences. For instance, if "~/Documents" 135 is configured for indexing, create a symlink from "~/.thunderbird" to 136 "~/Documents/TMail". For this to work, the dot-directory must not be in a 137 directory configured for indexing. 138 139 If you want to exclude any specific files or directories from indexing, use 140 patterns as described in section "8. File patterns". 141 142 Pinot supports stopwords removal. While no such list is provided by default, 143 they can be easily found on the Internet. Each language has its own stopword 144 list, for instance a stopwords list for English should be copied to 145 $PREFIX/share/pinot/stopwords/stopwords.en 146 147 Language detection is done with libexttextcat. Ensure that the paths listed 148 in /etc/pinot/textcat_conf.txt are correct. 149 150 The pinot-index program allows indexing and peeking at documents' properties 151 from the command-line. Using the -i/--index option with the My Documents or 152 My Web Pages index is not recommended. For more details, see the manual page 153 for pinot-index(1). 154 155 1565. Searching 157 158 159 Searches are run differently based on the type of engine being queried. 160 161 When querying a Web engine, Pinot assumes this engine understands the query, 162 which is sent as is. No pre-processing is performed on the text of the query, 163 and the results list is more or less presented as retrieved from the Web 164 engine. 165 166 When querying an index, things are somewhat different. Queries can be 167 expressed in a very natural way, using a combination of operators, filters 168 and ranges. This query syntax is the syntax supported natively by Xapian's 169 QueryParser and is documented at http://www.xapian.org/docs/queryparser.html 170 For instance, the query "type:text/html AND lang:en AND (tcp NEAR ip)" will 171 look for HTML files in English that mention TCP/IP. Note that all operators 172 should be specified in capitals, eg "AND" not "and". The latter will be 173 treated as a regular term. 174 175 Pinot supports these query filters : 176 "site" for host name, eg "site:github.com" 177 "file" for file name, eg "file:index.html" 178 "ext" for file extension, eg "ext:html" 179 "title" for title, eg "title:pinot" 180 "url" for URL, eg "url:https://github.com/" 181 "dir" for directory, eg "dir:/home/fabrice" 182 "inurl" for documents embedded in a URL, eg "inurl:file:///home/fabrice/Documents/backup.tar.gz" 183 "lang" for ISO language code, eg "lang:en" 184 "type" for MIME type, eg "type:text/html" 185 "class" for MIME type classification, eg "class:text" 186 "label" for label, eg "label:Important" 187 188 The directory filter is recursive, ie it applies to sub-directories. 189 Allowed language codes are "da", "nl", "en", "fi", "fr", "de", "hu", "it", 190 "nn", "pt", "ro", "ru", "es", "sv" and "tr". 191 192 Stemming is available to stored queries for which a stemming language is 193 defined. If such a query doesn't return any exact match, the query terms are 194 stemmed and the query is run again. Stopwords are also then removed if a 195 stopwords list was found for the stemming language. 196 197 The values of "file", "url", "dir" and "label" may be double-quoted. It's also 198 worth pointing out that the query "dir:/X/Y" will return files and directories 199 located in /X/Y, but not Y itself, which is what "dir:/X file:Y" would do. 200 201 In addition, these ranges are supported : 202 "YYYYMMDD..YYYYMMDD" for date ranges, eg "20070801..20070831" 203 "HHMMSS..HHMMSS" for time ranges, eg "090000..180000" 204 "size0..size1b" for size in bytes, eg "0..10240b" 205 206 See the manual page for pinot-search(1) for examples. 207 208 2096. Viewing cached results 210 211 212 Results returned by search engines can be viewed "live" by selecting the View 213 menuitem under Results. This opens whatever application defined for the 214 result's MIME type and/or protocol scheme. 215 In addition, Pinot allows to view the page as cached by Google and the Wayback 216 Machine. Cache providers are actually configured in globalconfig.xml, located 217 in /etc/pinot/. For instance : 218 <cache> 219 <name>Google</name> 220 <location>http://www.google.com/search?q=cache:%url0</location> 221 <protocols>http, https</protocols> 222 </cache> 223 224 This is self-explanatory :-) Here it configures a cache provider called 225 "Google" that handles both http and https. The location field supports 226 two parameters that are substituted to obtain the URL to open : 227 * %url is the result's URL as displayed by the UI, eg 228 https://github.com/FabriceColin/pinot 229 * %url0 is the result's URL without the protocol, eg 230 github.com/FabriceColin/pinot 231 232 2337. File formats 234 235 236 The following document types are supported internally : 237 * plain text 238 * HTML 239 * XML 240 * mbox, including attachments and embedded documents 241 * MP3, Ogg Vorbis, FLAC 242 * JPEG 243 * common archive formats (tar, Z, gz, bzip2, deb) 244 * ISO 9660 images 245 246 The following document types are supported through external programs : 247 * PDF (pdftotext required) 248 * RTF (unrtf required) 249 * ReStructured Text (rst2txt required) 250 * OpenDocument/StarOffice files (unzip required) 251 * MS Word (antiword required) 252 * PowerPoint (catppt required) 253 * Excel (xls2csv required) 254 * DVI (catdvi required) 255 * DjVu (djvutext required) 256 * RPM (rpm required) 257 258 For other document types, Pinot will only index metadata such as name, 259 location etc... If you wish to add support for another document type, and 260 know of a command-line program that can handle that type, add it to 261 external-filters.xml, located in /etc/pinot/. 262 263 2648. File patterns 265 266 267 It is possible to skip indexing of files that match glob(3) patterns. 268 These patterns are configured in the Indexing tab of the Preferences box, 269 and can be used as a blacklist or a whitelist. 270 271 Patterns apply to files and directories. For instance, blacklisting 272 "*/Desktop*" will skip "~/Desktop" and not crawl nor monitor this directory's 273 contents. Similarly, a blacklist entry for "*.avi" means that Pinot will not 274 attempt indexing the content of AVI files, and will ignore all monitor events 275 related to these files. 276 277 If you have never run Pinot before, the list will be pre-configured to skip 278 some picture, video and archive file types such as GIF, MPG and RAR. 279 280 2819. Digging deeper 282 283 284 Pinot offers two ways you can dig deeper in your documents : More Like This 285 suggests terms specific to documents that may help in finding related 286 documents, and Search This For allows to search in results. 287 Both features are enabled if one or more of the results currently selected 288 is indexed, and only operate on those. 289 290 When activated, More Like This will create a new Stored Query prefixed with 291 "More Like". For instance, if you run a Stored Query with name "Me", the 292 expanded query's name will be "More Like Me". 293 294 Search For This will search those results for the Stored Query selected in 295 the sub-menu and will present results in a new tab. For instance, running 296 the Stored Query "Me" on a set of results will open a "Me In Results" tab. 297 298 In addition to these, Pinot may suggest alternative spellings for queries 299 that don't return any result. If it does, a new Stored Query prefixed with 300 "Corrected" will be created. 301 302 30310. Saving results 304 305 306 Lists of results can be saved to disk by selecting the Save As menuitem 307 under Results. Two output formats are available to choose from in the file 308 selector opened by Save As : 309 * CSV, a text format 310 The semi-colon character (';') is used to delimit fields. 311 * OpenSearch response, a XML/RSS format 312 See https://en.wikipedia.org/wiki/OpenSearch for details. 313 314 31511. D-Bus service & daemon 316 317 318 Unless Pinot was built without support for D-Bus, the daemon program 319 "pinot-dbus-daemon" implements the D-Bus service and should be 320 auto-started through the desktop file installed at 321 /etc/xdg/autostart/pinot-dbus-daemon.desktop. 322 323 D-Bus activation makes sure the service is running whenever one of its 324 methods is invoked by any consumer application. For instance, clicking 325 OK on the Preferences box will call the service's Reload method, which 326 should start the service. This method also causes the service to reload 327 the configuration file. 328 329 A few things to keep in mind : 330 * when starting, the service will first crawl all configured locations 331 and (re)index new and modified files. The daemon's scheduling priority 332 is set very low (15, can be adjusted with --priority) so that it 333 hopefully doesn't prevent other activities. Crawling is suspended 334 while the system is on battery. 335 * when finished crawling, the service will monitor some locations for 336 changes (as per preferences) and should consume little resources, unless 337 a huge quantity of files needs its attention. 338 * any change detected by the monitor is queued and acted upon as soon as 339 possible, eg reindex a file that was modified. 340 * operations that involve communicating with the service, such as editing 341 documents metadata, may timeout if the system is under heavy load and/or 342 the daemon is busy. In most cases, the message will have been received 343 by the daemon, but the reply may take longer than expected. The Pinot 344 UI may report that the operation failed, even though it was queued for 345 processing and will be acted upon by the daemon. 346 347 See section "13. Environment variables and aliases" for some tips on how to 348 query the D-Bus interface. A list of available D-Bus methods can be found 349 in the file pinot-dbus-daemon.xml. 350 351 Pinot v1.20 implements the GNOME Shell search provider interface to allow 352 searching the contents of files the daemon found at locations it crawled, 353 basically the My Documents index. Go to the GNOME Settings' Search screen 354 to enable Pinot as a provider. For this to work, the file 355 com.github.fabricecolin.Pinot.search-provider.ini should be in the folder 356 $PREFIX/share/gnome-shell/search-providers/ 357 358 35912. CJKV support 360 361 362 Pinot supports indexing and searching CJKV text. 363 364 At search time, queries that include CJKV characters are processed in a manner 365 compatible with the CJKV indexing scheme. There is no need to format the query 366 in a specific format, ie no need to separate characters with spaces. 367 For example, the query : 368 Fabrice 你好 title:身体好吗 369 will be modified internally to : 370 Fabrice (你 你好 好) title:身 title:身体 title:体 title:体好 title:好 title:好吗 title:吗 371 372 It is recommended that filters (eg "title") be used at the end of the query 373 for it to be processed as expected. 374 375 You can get a list of documents in which CJKV characters were detected 376 by the indexer with the special filter "tokens:CJKV". 377 378 37913. Environment variables and aliases 380 381 382 Pinot tries to provide reasonable defaults for most systems, but there may be 383 situations where you want to tweak these values through environment variables : 384 * PINOT_SPELLING_DB 385 By default, Pinot builds indexes with a spelling database. This spelling 386 database may make up as much as a third of the size of the index. 387 If your system is low on disk space, you can disable this with 388 $ export PINOT_SPELLING_DB=NO 389 Make sure this is set for your login session, ie whenever the daemon is 390 auto-started. You will also have to reset indexes, as described in 391 section "16. How to reset indexes". 392 * PINOT_MINIMUM_DISK_SPACE 393 The daemon will stop crawling and indexing files when the partition on 394 which the index resides runs out of free space. By default, this means 395 less than 50 Mb. To change this value to 100 Mb for instance, use 396 $ export PINOT_MINIMUM_DISK_SPACE=100 397 * PINOT_MAXIMUM_INDEX_THREADS 398 This sets the maximum number of concurrent indexing threads used by the 399 daemon. The default value is 1. 400 * PINOT_MAXIMUM_NESTED_SIZE 401 This limits the extraction of documents nested inside others, such as 402 archives or mail messages, based on their size. By default, this is 403 deactivated and set to 0. 404 * PINOT_MAXIMUM_QUERY_RESULTS 405 This overrides the number of results returned by queries run through 406 the UI's Query field as well as the number of results initially set 407 for new stored queries. 408 409 Another environment variable that you may want to tweak comes from Xapian. 410 XAPIAN_FLUSH_THRESHOLD can be set to the number of documents after which 411 Xapian is to flush changes to the index. The default value is set to 10000 412 at the time of writing this. 413 Lowering this value should decrease the amount of memory used to cache 414 changes to the index. 415 416 Pinot provides a "tagged cd" script that enables to change a shell's 417 current directory to the directory that matches the path elements passed 418 as parameter. For instance, after setting : 419 $ alias pcd='. $PREFIX/share/pinot/pinot-cd.sh' 420 if ~/Documents is configured for indexing in Preferences, the following 421 command would change the current directory to ~/Documents/Web/Stats : 422 $ pcd Documents Stats 423 If other directories match the given paths, pinot-cd.sh will display a list 424 of matches. Future work will focus on disambiguation. 425 426 If you have dbus-send installed, you may also want to set the following 427 aliases : 428 $ alias pinot-stats='dbus-send --session --print-reply --type=method_call \ 429 --dest=com.github.fabricecolin.Pinot /com/github/fabricecolin/Pinot com.github.fabricecolin.Pinot.GetStatistics' 430 $ alias pinot-stop='dbus-send --session --print-reply --type=method_call \ 431 --dest=com.github.fabricecolin.Pinot /com/github/fabricecolin/Pinot com.github.fabricecolin.Pinot.Stop' 432 The first will start the service daemon by calling its GetStatistics method, 433 while the second alias will send it a request to stop and exit. 434 435 43614. How to reset indexes 437 438 439 You may wish to reset one of the index and start from scratch. There 440 are several ways to do this, depending on which index it is. 441 442 If you want to reset My Web Pages, you can either : 443 * use Pinot to unindex every single document by selecting them all 444 and choosing Unindex in the Index menu 445 * or stop Pinot and delete ~/.pinot/index recursively 446 447 If you want to reset My Documents, special considerations apply because 448 of the historical data maintained by the daemon. There are two ways to 449 proceed, and both require that the daemon be stopped. 450 451 The manual way is to delete the index with 452 $ rm -rf ~/.pinot/daemon 453 and remove historical data with 454 $ sqlite3 ~/.pinot/history-daemon "delete from CrawlHistory; delete from CrawlSources; delete from ActionQueue;" 455 If you want to start from scratch and drop metadata (eg labels) that may 456 exist on some documents, remove the history file altogether with 457 $ rm -f ~/.pinot/history-daemon 458 459 The automated way is to tell the daemon to reindex everything by launching 460 it with the "--reindex" option, ie 461 $ pinot-dbus-daemon --reindex 462 It may be useful to take a look at the log file located at 463 ~/.pinot/pinot-dbus-daemon.log. 464 46515. Compiling 466 467 468 Pinot's configure understands the following optional switches. 469 470 --enable-debug enable debug [default=no] 471 --enable-dbus enable DBus support [default=yes] 472 --enable-libnotify enable libnotify support [default=no] 473 --enable-mempool enable memory pool [default=no] 474 --enable-libarchive [enable the libarchive filter [default=no] 475 --enable-chmlib [enable the chmlib filter [default=no] 476 477 Enable support for libarchive and chmlib if the necessary 478 libraries are available. Enable libnotify support when building 479 on BSD systems. Other switches should most likely stay unchanged. 480 481 See the list below for dependencies. The version numbers indicate 482 the minimum version Pinot has been tested with; older versions may 483 or may not work. 484 485--------------------------------------------------------------- 486Libraries and tools Version 487--------------------------------------------------------------- 488SQLite 3.3.1 489http://www.sqlite.org/ 490 491xapian-core 1.4.10 492http://www.xapian.org/ 493 494 zlib 1.2.0 495 http://www.gzip.org/zlib/ 496 497curl (1) 7.13.1 498http://curl.haxx.se/ 499- OR - 500neon (1) 0.24.7 501http://www.webdav.org/neon/ 502 503gdbus-codegen-glibmm (2) 504https://github.com/Pelagicore/gdbus-codegen-glibmm 505 506gtkmm 3.24 507http://www.gtkmm.org/ 508 509libxml++ 2.12.0 510http://libxmlplusplus.sourceforge.net/ 511 512libexttextcat 3.2 513http://cgit.freedesktop.org/libreoffice/libexttextcat/ 514 515gmime (3) 2.6.0 516http://spruce.sourceforge.net/gmime 517 518boost (4) 1.75 519http://www.boost.org/ 520 521D-Bus with GLib bindings 0.61 522http://www.freedesktop.org/wiki/Software/dbus 523 524shared-mime-info 0.17 525http://freedesktop.org/Software/shared-mime-info 526 527desktop-file-utils 0.10 528http://www.freedesktop.org/software/desktop-file-utils 529 530TagLib 1.4 531http://ktown.kde.org/~wheeler/taglib/ 532 533libarchive (5) 2.6.2 534http://people.freebsd.org/~kientzle/libarchive/ 535 536exiv2 0.21 537http://www.exiv2.org/ 538 539chmlib (6) 0.40 540http://www.jedrea.com/chmlib/ 541 542openssh-askpass (7) 4.3 543http://www.openssh.com/portable.html 544 545--------------------------------------------------------------- 546External filter programs 547--------------------------------------------------------------- 548unzip 549http://www.info-zip.org/pub/infozip/UnZip.html 550 551pdftotext 552http://www.foolabs.com/xpdf/ 553http://poppler.freedesktop.org/ 554 555antiword 556http://www.winfield.demon.nl/ 557 558unrtf 559http://www.gnu.org/software/unrtf/unrtf.html 560 561rst2txt 562https://github.com/stephenfin/rst2txt 563 564djvutxt 565http://djvu.sourceforge.net/ 566 567catdvi 568http://catdvi.sourceforge.net/ 569 570catppt 571xls2csv 572http://www.wagner.pp.ru/~vitus/software/catdoc/ 573 574--------------------------------------------------------------------- 575Notes : 576(1) enabled with "./configure --with-http=neon|curl" 577(2) only to regenerate DBus code, with "make dbus-code" 578(3) for gmime 2.4.0 support, edit configure.in 579(4) for building only 580 with boost > 1.48 and < 1.54, turning off memory pooling with "./configure --enable-mempool=no" may be preferable 581(5) optional - enabled with "./configure --enable-libarchive=yes" 582(6) optional - enabled with "./configure --enable-chmlib=yes" 583(7) experimental - required only if _SSH_TUNNEL is set 584--------------------------------------------------------------------- 585 586