1
2                               .:. LIBPUZZLE .:.
3
4                         http://libpuzzle.pureftpd.org
5
6
7            ------------------------ BLURB ------------------------
8
9
10The Puzzle library is designed to quickly find visually similar images (gif,
11png, jpg), even if they have been resized, recompressed, recolored or slightly
12modified.
13
14The library is free, lightweight yet very fast, configurable, easy to use and
15it has been designed with security in mind. This is a C library, but is also
16comes with a command-line tool and PHP bindings.
17
18
19          ------------------------ REFERENCE ------------------------
20
21
22The Puzzle library is a implementation of "An image signature for any kind of
23image", by H. CHI WONG, Marschall BERN and David GOLDBERG.
24
25
26         ------------------------ COMPILATION ------------------------
27
28
29In order to load images, the library relies on the GD2 library.
30You need to install gdlib2 and its development headers before compiling
31libpuzzle.
32The GD2 library is available as a pre-built package for most operating systems.
33Debian and Ubuntu users should install the "libgd2-dev" or the "libgd2-xpm-dev"
34package.
35Gentoo users should install "media-libs/gd".
36OpenBSD, NetBSD and DragonflyBSD users should install the "gd" package.
37MacPorts users should install the "gd2" package.
38X11 support is not required for the Puzzle library.
39
40Once GD2 has been installed, configure the Puzzle library as usual:
41
42./configure
43
44This is a standard autoconf script, if you're not familiar with it, please
45have a look at the INSTALL file.
46
47Compile the beast:
48
49make
50
51Try the built-in tests:
52
53make check
54
55If everything looks fine, install the software:
56
57make install
58
59If anything goes wrong, please submit a bug report to:
60                       libpuzzle [at] pureftpd [dot] org
61
62
63            ------------------------ USAGE ------------------------
64
65
66The API is documented in the libpuzzle(3) and puzzle_set(3) man pages.
67You can also play with the puzzle-diff test application.
68See puzzle-diff(8) for more info about the puzzle-diff application.
69
70In order to be thread-safe, every exported function of the library requires a
71PuzzleContext object. That object stores various run-time tunables.
72
73Out of a bitmap picture, the Puzzle library can fill a PuzzleCVec object :
74
75  PuzzleContext context;
76  PuzzleCVec cvec;
77
78  puzzle_init_context(&context);
79  puzzle_init_cvec(&context, &cvec);
80  puzzle_fill_cvec_from_file(&context, &cvec, "directory/filename.jpg");
81
82The PuzzleCvec structure holds two fields:
83  signed char *vec:  a pointer to the first element of the vector
84  size_t sizeof_vec: the number of elements
85
86The size depends on the "lambdas" value (see puzzle_set(3)).
87
88PuzzleCvec structures can be compared:
89
90  d = puzzle_vector_normalized_distance(&context, &cvec1, &cvec2, 1);
91
92d is the normalized distance between both vectors. If d is below 0.6, pictures
93are probably similar.
94
95If you need further help, feel free to subscribe to the mailing-list (see
96below).
97
98
99          ------------------------ INDEXING ------------------------
100
101
102How to quickly find similar pictures, if they are millions of records?
103
104The original paper has a simple, yet efficient answer.
105
106Cut the vector in fixed-length words. For instance, let's consider the
107following vector:
108
109[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
110
111With a word length (K) of 10, you can get the following words:
112
113[ a b c d e f g h i j ] found at position 0
114[ b c d e f g h i j k ] found at position 1
115[ c d e f g h i j k l ] found at position 2
116etc. until position N-1
117
118Then, index your vector with a compound index of (word + position).
119
120Even with millions of images, K = 10 and N = 100 should be enough to have very
121little entries sharing the same index.
122
123Here's a very basic sample database schema:
124
125+-----------------------------+
126|          signatures         |
127+-----------------------------+
128| sig_id | signature | pic_id |
129+--------+-----------+--------+
130
131+--------------------------+
132|           words          |
133+--------------------------+
134| pos_and_word | fk_sig_id |
135+--------------+-----------+
136
137I'd recommend splitting at least the "words" table into multiple tables and/or
138servers.
139
140By default (lambas=9) signatures are 544 bytes long. In order to save storage
141space, they can be compressed to 1/third of their original size through the
142puzzle_compress_cvec() function. Before use, they must be uncompressed with
143puzzle_uncompress_cvec().
144
145
146         ------------------------ PUZZLE-DIFF ------------------------
147
148
149A command-line tool is also available for scripting or testing.
150
151It is installed as "puzzle-diff" and comes with a man page.
152
153Sample usage:
154
155- Output distance between two images:
156
157$ puzzle-diff pic-a-0.jpg pics-a-1.jpg
1580.102286
159
160- Compare two images, exit with 10 if they look the same, exit with 20 if
161they don't (may be useful for scripts):
162
163$ puzzle-diff -e pic-a-0.jpg pics-a-1.jpg
164$ echo $?
16510
166
167- Compute distance, without cropping and with computing the average intensity
168of the whole blocks:
169
170$ puzzle-diff -p 1.0 -c pic-a-0.jpg pic-a-1.jpg
1710.0523151
172
173
174  ------------------------ COMPARING IMAGES WITH PHP ------------------------
175
176
177A PHP extension is bundled with the Libpuzzle package, and it provides PHP
178bindings to most functions of the library.
179
180Documentation for the Libpuzzle PHP extension is available in the README-PHP
181file.
182
183
184    ------------------------ APPS USING LIBPUZZLE ------------------------
185
186
187Here are third-party projects using libpuzzle:
188
189* ftwin - http://jok.is-a-geek.net/ftwin.php
190  ftwin is a tool useful to find duplicate files according to their content on
191your file system.
192
193
194           ------------------------ CONTACT ------------------------
195
196
197The main web site for the project is: http://libpuzzle.pureftpd.org
198
199If you need to share ideas with other users, or if you need help, feel free to
200subscribe to the mailing-list.
201
202In order to subscribe, just send a mail with random content to:
203
204listpuzzle-subscribe at pureftpd dot org
205
206For anything else, you can get in touch with me at:
207
208libpuzzle at pureftpd dot org
209
210If you are interested in bindings for Ruby, Python, PHP, etc. just ask!
211
212
213Thank you,
214
215               -Frank.
216