1 2 .:. LIBPUZZLE .:. 3 4 http://libpuzzle.pureftpd.org 5 6 7 ------------------------ BLURB ------------------------ 8 9 10The Puzzle library is designed to quickly find visually similar images (gif, 11png, jpg), even if they have been resized, recompressed, recolored or slightly 12modified. 13 14The library is free, lightweight yet very fast, configurable, easy to use and 15it has been designed with security in mind. This is a C library, but is also 16comes with a command-line tool and PHP bindings. 17 18 19 ------------------------ REFERENCE ------------------------ 20 21 22The Puzzle library is a implementation of "An image signature for any kind of 23image", by H. CHI WONG, Marschall BERN and David GOLDBERG. 24 25 26 ------------------------ COMPILATION ------------------------ 27 28 29In order to load images, the library relies on the GD2 library. 30You need to install gdlib2 and its development headers before compiling 31libpuzzle. 32The GD2 library is available as a pre-built package for most operating systems. 33Debian and Ubuntu users should install the "libgd2-dev" or the "libgd2-xpm-dev" 34package. 35Gentoo users should install "media-libs/gd". 36OpenBSD, NetBSD and DragonflyBSD users should install the "gd" package. 37MacPorts users should install the "gd2" package. 38X11 support is not required for the Puzzle library. 39 40Once GD2 has been installed, configure the Puzzle library as usual: 41 42./configure 43 44This is a standard autoconf script, if you're not familiar with it, please 45have a look at the INSTALL file. 46 47Compile the beast: 48 49make 50 51Try the built-in tests: 52 53make check 54 55If everything looks fine, install the software: 56 57make install 58 59If anything goes wrong, please submit a bug report to: 60 libpuzzle [at] pureftpd [dot] org 61 62 63 ------------------------ USAGE ------------------------ 64 65 66The API is documented in the libpuzzle(3) and puzzle_set(3) man pages. 67You can also play with the puzzle-diff test application. 68See puzzle-diff(8) for more info about the puzzle-diff application. 69 70In order to be thread-safe, every exported function of the library requires a 71PuzzleContext object. That object stores various run-time tunables. 72 73Out of a bitmap picture, the Puzzle library can fill a PuzzleCVec object : 74 75 PuzzleContext context; 76 PuzzleCVec cvec; 77 78 puzzle_init_context(&context); 79 puzzle_init_cvec(&context, &cvec); 80 puzzle_fill_cvec_from_file(&context, &cvec, "directory/filename.jpg"); 81 82The PuzzleCvec structure holds two fields: 83 signed char *vec: a pointer to the first element of the vector 84 size_t sizeof_vec: the number of elements 85 86The size depends on the "lambdas" value (see puzzle_set(3)). 87 88PuzzleCvec structures can be compared: 89 90 d = puzzle_vector_normalized_distance(&context, &cvec1, &cvec2, 1); 91 92d is the normalized distance between both vectors. If d is below 0.6, pictures 93are probably similar. 94 95If you need further help, feel free to subscribe to the mailing-list (see 96below). 97 98 99 ------------------------ INDEXING ------------------------ 100 101 102How to quickly find similar pictures, if they are millions of records? 103 104The original paper has a simple, yet efficient answer. 105 106Cut the vector in fixed-length words. For instance, let's consider the 107following vector: 108 109[ a b c d e f g h i j k l m n o p q r s t u v w x y z ] 110 111With a word length (K) of 10, you can get the following words: 112 113[ a b c d e f g h i j ] found at position 0 114[ b c d e f g h i j k ] found at position 1 115[ c d e f g h i j k l ] found at position 2 116etc. until position N-1 117 118Then, index your vector with a compound index of (word + position). 119 120Even with millions of images, K = 10 and N = 100 should be enough to have very 121little entries sharing the same index. 122 123Here's a very basic sample database schema: 124 125+-----------------------------+ 126| signatures | 127+-----------------------------+ 128| sig_id | signature | pic_id | 129+--------+-----------+--------+ 130 131+--------------------------+ 132| words | 133+--------------------------+ 134| pos_and_word | fk_sig_id | 135+--------------+-----------+ 136 137I'd recommend splitting at least the "words" table into multiple tables and/or 138servers. 139 140By default (lambas=9) signatures are 544 bytes long. In order to save storage 141space, they can be compressed to 1/third of their original size through the 142puzzle_compress_cvec() function. Before use, they must be uncompressed with 143puzzle_uncompress_cvec(). 144 145 146 ------------------------ PUZZLE-DIFF ------------------------ 147 148 149A command-line tool is also available for scripting or testing. 150 151It is installed as "puzzle-diff" and comes with a man page. 152 153Sample usage: 154 155- Output distance between two images: 156 157$ puzzle-diff pic-a-0.jpg pics-a-1.jpg 1580.102286 159 160- Compare two images, exit with 10 if they look the same, exit with 20 if 161they don't (may be useful for scripts): 162 163$ puzzle-diff -e pic-a-0.jpg pics-a-1.jpg 164$ echo $? 16510 166 167- Compute distance, without cropping and with computing the average intensity 168of the whole blocks: 169 170$ puzzle-diff -p 1.0 -c pic-a-0.jpg pic-a-1.jpg 1710.0523151 172 173 174 ------------------------ COMPARING IMAGES WITH PHP ------------------------ 175 176 177A PHP extension is bundled with the Libpuzzle package, and it provides PHP 178bindings to most functions of the library. 179 180Documentation for the Libpuzzle PHP extension is available in the README-PHP 181file. 182 183 184 ------------------------ APPS USING LIBPUZZLE ------------------------ 185 186 187Here are third-party projects using libpuzzle: 188 189* ftwin - http://jok.is-a-geek.net/ftwin.php 190 ftwin is a tool useful to find duplicate files according to their content on 191your file system. 192 193 194 ------------------------ CONTACT ------------------------ 195 196 197The main web site for the project is: http://libpuzzle.pureftpd.org 198 199If you need to share ideas with other users, or if you need help, feel free to 200subscribe to the mailing-list. 201 202In order to subscribe, just send a mail with random content to: 203 204listpuzzle-subscribe at pureftpd dot org 205 206For anything else, you can get in touch with me at: 207 208libpuzzle at pureftpd dot org 209 210If you are interested in bindings for Ruby, Python, PHP, etc. just ask! 211 212 213Thank you, 214 215 -Frank. 216