1Introduction 2------------------------------------------------------------------------------- 3jdupes is a program for identifying and taking actions upon duplicate files. 4 5A WORD OF WARNING: jdupes IS NOT a drop-in compatible replacement for fdupes! 6Do not blindly replace fdupes with jdupes in scripts and expect everything to 7work the same way. Option availability and meanings differ between the two 8programs. For example, the `-I` switch in jdupes means "isolate" and blocks 9intra-argument matching, while in fdupes it means "immediately delete files 10during scanning without prompting the user." 11 12Please consider financially supporting continued development of jdupes: 13 14https://www.subscribestar.com/JodyBruchon 15 16 17v1.19.0 specific: extfilter behavior has changed, check your scripts! 18------------------------------------------------------------------------------- 19There were some inconsistencies in the behavior of the extfilter framework that 20stemmed from its origins in the exclusion option `-x`. These inconsistencies 21have been resolved and extfilters now work correctly. Unfortunately, this also 22means that the meaning of several filters has changed, particularly the size 23filters. The `-X size[+-=]` option now includes by the specified size criteria, 24rather than excluding, which will cause problems with existing shell scripts. 25It is extremely important that any shell scripts currently using the size 26extfilter be revised to take the new meaning into account. Use `jdupes -v` 27output in your script to do a version check if needed. 28 29v1.15+ specific: Why is the addition of single files not working? 30------------------------------------------------------------------------------- 31If a file was added through recursion and also added explicitly, that file 32would end up matching itself. This issue can be seen in v1.14.1 or older 33versions that support single file addition using a command like this in the 34jdupes source code directory: 35 36/usr/src/jdupes$ jdupes -rH testdir/isolate/1/ testdir/isolate/1/1.txt 37testdir/isolate/1/1.txt 38testdir/isolate/1/1.txt 39testdir/isolate/1/2.txt 40 41Even worse, using the special dot directory will make it happen without the -H 42option, which is how I discovered this bug: 43 44 45/usr/src/jdupes/testdir/isolate/1$ jdupes . 1.txt 46./1.txt 47./2.txt 481.txt 49 50This works for any path with a single dot directory anywhere in the path, so it 51has a good deal of potential for data loss in some use cases. As such, the best 52option was to shove out a new minor release with this feature turned off until 53some additional checking can be done, e.g. by making sure the canonical paths 54aren't identical between any two files. 55 56A future release will fix this safely. 57 58 59Why use jdupes instead of the original fdupes or other duplicate finders? 60------------------------------------------------------------------------------- 61The biggest reason is raw speed. In testing on various data sets, jdupes is 62over 7 times faster than fdupes-1.51 on average. 63 64jdupes provides a native Windows port. Most duplicate scanners built on Linux 65and other UNIX-like systems do not compile for Windows out-of-the-box and even 66if they do, they don't support Unicode and other Windows-specific quirks and 67features. 68 69jdupes is generally stable. All releases of jdupes are compared against a known 70working reference versions of fdupes or jdupes to be certain that output does 71not change. You get the benefits of an aggressive development process without 72putting your data at increased risk. 73 74Code in jdupes is written with data loss avoidance as the highest priority. If 75a choice must be made between being aggressive or careful, the careful way is 76always chosen. 77 78jdupes includes features that are not always found elsewhere. Examples of such 79features include block-level data deduplication and control over which file is 80kept when a match set is automatically deleted. jdupes is not afraid of 81dropping features of low value; a prime example is the `-1` switch which 82outputs all matches in a set on one line, a feature which was found to be 83useless in real-world tests and therefore thrown out. 84 85While jdupes maintains some degree of compatibility with fdupes from which it 86was originally derived, there is no guarantee that it will continue to maintain 87such compatibility in the future. However, compatibility will be retained 88between minor versions, i.e. jdupes-1.6 and jdupes-1.6.1 should not have any 89significant differences in results with identical command lines. 90 91If the program eats your dog or sets fire to your lawn, the authors cannot be 92held responsible. If you notice a bug, please report it. 93 94 95What jdupes is not: a similar (but not identical) file finding tool 96------------------------------------------------------------------------------- 97Please note that jdupes ONLY works on 100% exact matches. It does not have any 98sort of "similarity" matching, nor does it know anything about any specific 99file formats such as images or sounds. Something as simple as a change in 100embedded metadata such as the ID3 tags in an MP3 file or the EXIF information 101in a JPEG image will not change the sound or image presented to the user when 102opened, but technically it makes the file no longer identical to the original. 103 104Plenty of excellent tools already exist to "fuzzy match" specific file types 105using knowledge of their file formats to help. There are no plans to add this 106type of matching to jdupes. 107 108There are some match options available in jdupes that enable dangerous file 109matching based on partial or likely but not 100% certain matching. These are 110considered expert options for special situations and are clearly and loudly 111documented as being dangerous. The `-Q` and `-T` options are notable examples, 112and the extreme danger of the `-T` option is safeguarded by a requirement to 113specify it twice so it can't be used accidentally. 114 115 116How can I do stuff with jdupes that isn't supported by jdupes? 117------------------------------------------------------------------------------- 118The standard output format of jdupes is extremely simple. Match sets are 119presented with one file path per line, and match sets are separated by a blank 120line. This is easy to process with fairly simple shell scripts. You can find 121example shell scripts in the "example scripts" directory in the jdupes source 122code. The main example script, "example.sh", is easy to modify to take basic 123actions on each file in a match set. These scripts are used by piping the 124standard jdupes output to them: 125 126jdupes dir1 dir2 dir3 | example.sh scriptparameters 127 128 129Usage 130------------------------------------------------------------------------------- 131``` 132Usage: jdupes [options] DIRECTORY... 133``` 134 135Duplicate file sets will be printed by default unless a different action 136option is specified (delete, summarize, link, dedupe, etc.) 137 138``` 139 -@ --loud output annoying low-level debug info while running 140 -0 --printnull output nulls instead of CR/LF (like 'find -print0') 141 -1 --one-file-system do not match files on different filesystems/devices 142 -A --nohidden exclude hidden files from consideration 143 -B --dedupe do a copy-on-write (reflink/clone) deduplication 144 -C --chunksize=# override I/O chunk size (min 4096, max 16777216) 145 -d --delete prompt user for files to preserve and delete all 146 others; important: under particular circumstances, 147 data may be lost when using this option together 148 with -s or --symlinks, or when specifying a 149 particular directory more than once; refer to the 150 documentation for additional information 151 -D --debug output debug statistics after completion 152 -f --omitfirst omit the first file in each set of matches 153 -h --help display this help message 154 -H --hardlinks treat any linked files as duplicate files. Normally 155 linked files are treated as non-duplicates for safety 156 -i --reverse reverse (invert) the match sort order 157 -I --isolate files in the same specified directory won't match 158 -j --json produce JSON (machine-readable) output 159 -l --linksoft make relative symlinks for duplicates w/o prompting 160 -L --linkhard hard link all duplicate files without prompting 161 Windows allows a maximum of 1023 hard links per file 162 -m --summarize summarize dupe information 163 -M --printwithsummary will print matches and --summarize at the end 164 -N --noprompt together with --delete, preserve the first file in 165 each set of duplicates and delete the rest without 166 prompting the user 167 -o --order=BY select sort order for output, linking and deleting: 168 by mtime (BY=time) or filename (BY=name, the default) 169 -O --paramorder sort output files in order of command line parameter 170sequence 171 Parameter order is more important than selected -o sort 172 which applies should several files share the same 173parameter order 174 -p --permissions don't consider files with different owner/group or 175 permission bits as duplicates 176 -P --print=type print extra info (partial, early, fullhash) 177 -q --quiet hide progress indicator 178 -Q --quick skip byte-by-byte duplicate verification. WARNING: 179 this may delete non-duplicates! Read the manual first! 180 -r --recurse for every directory, process its subdirectories too 181 -R --recurse: for each directory given after this option follow 182 subdirectories encountered within (note the ':' at 183 the end of the option, manpage for more details) 184 -s --symlinks follow symlinks 185 -S --size show size of duplicate files 186 -t --nochangecheck disable security check for file changes (aka TOCTTOU) 187 -T --partial-only match based on partial hashes only. WARNING: 188 EXTREMELY DANGEROUS paired with destructive actions! 189 -T must be specified twice to work. Read the manual! 190 -u --printunique print only a list of unique (non-matched) files 191 -U --notravcheck disable double-traversal safety check (BE VERY CAREFUL) 192 This fixes a Google Drive File Stream recursion issue 193 -v --version display jdupes version and license information 194 -X --extfilter=x:y filter files based on specified criteria 195 Use '-X help' for detailed extfilter help 196 -z --zeromatch consider zero-length files to be duplicates 197 -Z --softabort If the user aborts (i.e. CTRL-C) act on matches so far 198 You can send SIGUSR1 to the program to toggle this 199 200 201Detailed help for jdupes -X/--extfilter options 202General format: jdupes -X filter[:value][size_suffix] 203 204noext:ext1[,ext2,...] Exclude files with certain extension(s) 205 206onlyext:ext1[,ext2,...] Only include files with certain extension(s) 207 208size[+-=]:size[suffix] Only Include files matching size criteria 209 Size specs: + larger, - smaller, = equal to 210 Specs can be mixed, i.e. size+=:100k will 211 only include files 100KiB or more in size. 212 213nostr:text_string Exclude all paths containing the string 214onlystr:text_string Only allow paths containing the string 215 HINT: you can use these for directories: 216 -X nostr:/dir_x/ or -X onlystr:/dir_x/ 217newer:datetime Only include files newer than specified date 218older:datetime Only include files older than specified date 219 Date/time format: "YYYY-MM-DD HH:MM:SS" 220 Time is optional (remember to escape spaces!) 221 222Some filters take no value or multiple values. Filters that can take 223a numeric option generally support the size multipliers K/M/G/T/P/E 224with or without an added iB or B. Multipliers are binary-style unless 225the -B suffix is used, which will use decimal multipliers. For example, 22616k or 16kib = 16384; 16kb = 16000. Multipliers are case-insensitive. 227 228Filters have cumulative effects: jdupes -X size+:99 -X size-:101 will 229cause only files of exactly 100 bytes in size to be included. 230 231Extension matching is case-insensitive. 232Path substring matching is case-sensitive. 233``` 234 235The `-U`/`--notravcheck` option disables the double-traversal prevention tree. 236In the VAST MAJORITY of circumstances, this SHOULD NOT BE DONE, as it protects 237against several dangerous user errors, including specifying the same files or 238directories twice causing them to match themselves and potentially be lost or 239irreversibly damaged, or a symbolic link to a directory making an endless loop 240of recursion that will cause the program to hang indefinitely. This option was 241added because Google Drive File Stream presents directories in the virtual hard 242drive used by GDFS with identical device:inode pairs despite the directories 243actually being different. This triggers double-traversal prevention against 244every directory, effectively blocking all recursion. Disabling this check will 245reduce safety, but will allow duplicate scanning inside Google Drive File 246Stream drives. This also results in a very minor speed boost during recursion, 247but the boost is unlikely to be noticeable. 248 249The `-t`/`--nochangecheck` option disables file change checks during/after 250scanning. This opens a security vulnerability that is called a TOCTTOU (time of 251check to time of use) vulnerability. The program normally runs checks 252immediately before scanning or taking action upon a file to see if the file has 253changed in some way since it was last checked. With this option enabled, the 254program will not run any of these checks, making the algorithm slightly faster, 255but also increasing the risk that the program scans a file, the file is changed 256after the scan, and the program still acts like the file was in its previous 257state. This is particularly dangerous when considering actions such as linking 258and deleting. In the most extreme case, a file could be deleted during scanning 259but match other files prior to that deletion; if the file is the first in the 260list of duplicates and auto-delete is used, all of the remaining matched files 261will be deleted as well. This option was added due to user reports of some 262filesystems (particularly network filesystems) changing the reported file 263information inappropriately, rendering the entire program unusable on such 264filesystems. 265 266The `-n`/`--noempty` option was removed for safety. Matching zero-length files 267as duplicates now requires explicit use of the `-z`/`--zeromatch` option 268instead. 269 270Duplicate files are listed together in groups with each file displayed on a 271separate line. The groups are then separated from each other by blank lines. 272 273The `-s`/`--symlinks` option will treat symlinked files as regular files, but 274direct symlinks will be treated as if they are hard linked files and the 275-H/--hardlinks option will apply to them in the same manner. 276 277When using `-d` or `--delete`, care should be taken to insure against 278accidental data loss. While no information will be immediately lost, using this 279option together with `-s` or `--symlink` can lead to confusing information 280being presented to the user when prompted for files to preserve. Specifically, 281a user could accidentally preserve a symlink while deleting the file it points 282to. A similar problem arises when specifying a particular directory more than 283once. All files within that directory will be listed as their own duplicates, 284leading to data loss should a user preserve a file without its "duplicate" (the 285file itself!) 286 287Using `-1` or `--one-file-system` prevents matches that cross filesystems, but 288a more relaxed form of this option may be added that allows cross-matching for 289all filesystems that each parameter is present on. 290 291`-Z` or `--softabort` used to be `--hardabort` in jdupes prior to v1.5 and had 292the opposite behavior. Defaulting to taking action on abort is probably not 293what most users would expect. The decision to invert rather than reassign to a 294different option was made because this feature was still fairly new at the time 295of the change. 296 297On non-Windows platforms that support SIGUSR1, you can toggle the state of the 298`-Z` option by sending a SIGUSR1 to the program. This is handy if you want to 299abort jdupes, didn't specify `-Z`, and changed your mind and don't want to lose 300all the work that was done so far. Just do '`killall -USR1 jdupes`' and you will 301be able to abort with `-Z`. This works in reverse: if you want to prevent a 302`-Z` from happening, a SIGUSR1 will toggle it back off. That's a lot less 303useful because you can just stop and kill the program to get the same effect, 304but it's there if you want it for some reason. Sending the signal twice while 305the program is stopped will behave as if it was only sent once, as per normal 306POSIX signal behavior. 307 308The `-O` or `--paramorder` option allows the user greater control over what 309appears in the first position of a match set, specifically for keeping the `-N` 310option from deleting all but one file in a set in a seemingly random way. All 311directories specified on the command line will be used as the sorting order of 312result sets first, followed by the sorting algorithm set by the `-o` or 313`--order` option. This means that the order of all match pairs for a single 314directory specification will retain the old sorting behavior even if this 315option is specified. 316 317When used together with options `-s` or `--symlink`, a user could accidentally 318preserve a symlink while deleting the file it points to. 319 320The `-Q` or `--quick option` only reads each file once, hashes it, and performs 321comparisons based solely on the hashes. There is a small but significant risk 322of a hash collision which is the purpose of the failsafe byte-for-byte 323comparison that this option explicitly bypasses. Do not use it on ANY data set 324for which any amount of data loss is unacceptable. You have been warned! 325 326The `-T` or `--partial-only` option produces results based on a hash of the 327first block of file data in each file, ignoring everything else in the file. 328Partial hash checks have always been an important exclusion step in the jdupes 329algorithm, usually hashing the first 4096 bytes of data and allowing files that 330are different at the start to be rejected early. In certain scenarios it may be 331a useful heuristic for a user to see that a set of files has the same size and 332the same starting data, even if the remaining data does not match; one example 333of this would be comparing files with data blocks that are damaged or missing 334such as an incomplete file transfer or checking a data recovery against 335known-good copies to see what damaged data can be deleted in favor of restoring 336the known-good copy. This option is meant to be used with informational actions 337and can result in EXTREME DATA LOSS if used with options that delete files, 338create hard links, or perform other destructive actions on data based on the 339matching output. Because of the potential for massive data destruction, this 340option MUST BE SPECIFIED TWICE to take effect and will error out if it is only 341specified once. 342 343The `-I`/`--isolate` option attempts to block matches that are contained in the 344same specified directory parameter on the command line. Due to the underlying 345nature of the jdupes algorithm, a lot of matches will be blocked by this option 346that probably should not be. This code could use improvement. 347 348The `-C`/`--chunksize` option overrides the size of the I/O "chunk" used for 349all file operations. Larger numbers will increase the amount of data read at 350once from each file and may improve performance when scanning lots of files 351that are larger than the default chunk size by reducing "thrashing" of the hard 352disk heads. Smaller numbers may increase algorithm speed depending on the 353characteristics of your CPU but will usually increase I/O and system call 354overhead as well. The number also directly affects memory usage: I/O chunk size 355is used for at least three allocations in the program, so using a chunk size of 35616777216 (16 MiB) will require 48 MiB of RAM. The default is usually between 35732768 and 65536 which results in the fastest raw speed of the algorithm and 358generally good all-around performance. Feel free to experiment with the number 359on your data set and report your experiences (preferably with benchmarks and 360info on your data set.) 361 362Using `-P`/`--print` will cause the program to print extra information that may 363be useful but will pollute the output in a way that makes scripted handling 364difficult. Its current purpose is to reveal more information about the file 365matching process by printing match pairs that pass certain steps of the process 366prior to full file comparison. This can be useful if you have two files that 367are passing early checks but failing after full checks. 368 369 370Hard and soft (symbolic) linking status symbols and behavior 371------------------------------------------------------------------------------- 372A set of arrows are used in file linking to show what action was taken on each 373link candidate. These arrows are as follows: 374 375`---->` File was hard linked to the first file in the duplicate chain 376 377`-@@->` File was symlinked to the first file in the chain 378 379`-==->` Already a hard link to the first file in the chain 380 381`-//->` File linking failed due to an error during the linking process 382 383If your data set has linked files and you do not use `-H` to always consider 384them as duplicates, you may still see linked files appear together in match 385sets. This is caused by a separate file that matches with linked files 386independently and is the correct behavior. See notes below on the "triangle 387problem" in jdupes for technical details. 388 389 390Microsoft Windows platform-specific notes 391------------------------------------------------------------------------------- 392Windows has a hard limit of 1024 hard links per file. There is no way to change 393this. The documentation for CreateHardLink() states: "The maximum number of 394hard links that can be created with this function is 1023 per file. If more 395than 1023 links are created for a file, an error results." (The number is 396actually 1024, but they're ignoring the first file.) 397 398 399The current jdupes algorithm's "triangle problem" 400------------------------------------------------------------------------------- 401Pairs of files are excluded individually based on how the two files compare. 402For example, if `--hardlinks` is not specified then two files which are hard 403linked will not match one another for duplicate scanning purposes. The problem 404with only examining files in pairs is that certain circumstances will lead to 405the exclusion being overridden. 406 407Let's say we have three files with identical contents: 408 409``` 410a/file1 411a/file2 412a/file3 413``` 414 415and `a/file1` is linked to `a/file3`. Here's how `jdupes a/` sees them: 416 417--- 418 Are 'a/file1' and 'a/file2' matches? Yes 419 [point a/file1->duplicates to a/file2] 420 421 Are 'a/file1' and 'a/file3' matches? No (hard linked already, `-H` off) 422 423 Are 'a/file2' and 'a/file3' matches? Yes 424 [point a/file2->duplicates to a/file3] 425--- 426 427Now you have the following duplicate list: 428 429``` 430a/file1->duplicates ==> a/file2->duplicates ==> a/file3 431``` 432 433The solution is to split match sets into multiple sets, but doing this will 434also remove the guarantee that files will only ever appear in one match set and 435could result in data loss if handled improperly. In the future, options for 436"greedy" and "sparse" may be introduced to switch between allowing triangle 437matches to be in the same set vs. splitting sets after matching finishes 438without the "only ever appears once" guarantee. 439 440 441Does jdupes meet the "Good Practice when Deleting Duplicates" by rmlint? 442------------------------------------------------------------------------------- 443Yes. If you've not read this list of cautions, it is available at 444http://rmlint.readthedocs.io/en/latest/cautions.html 445 446Here's a breakdown of how jdupes addresses each of the items listed. 447 448### "Backup your data"/"Measure twice, cut once" 449These guidelines are for the user of duplicate scanning software, not the 450software itself. Back up your files regularly. Use jdupes to print a list of 451what is found as duplicated and check that list very carefully before 452automatically deleting the files. 453 454### "Beware of unusual filename characters" 455The only character that poses a concern in jdupes is a newline `\n` and that is 456only a problem because the duplicate set printer uses them to separate file 457names. Actions taken by jdupes are not parsed like a command line, so spaces 458and other weird characters in names aren't a problem. Escaping the names 459properly if acting on the printed output is a problem for the user's shell 460script or other external program. 461 462### "Consider safe removal options" 463This is also an exercise for the user. 464 465### "Traversal Robustness" 466jdupes tracks each directory traversed by dev:inode pair to avoid adding the 467contents of the same directory twice. This prevents the user from being able to 468register all of their files twice by duplicating an entry on the command line. 469Symlinked directories are only followed if they weren't already followed 470earlier. Files are renamed to a temporary name before any linking is done and 471if the link operation fails they are renamed back to the original name. 472 473### "Collision Robustness" 474jdupes uses xxHash for file data hashing. This hash is extremely fast with a 475low collision rate, but it still encounters collisions as any hash function 476will ("secure" or otherwise) due to the pigeonhole principle. This is why 477jdupes performs a full-file verification before declaring a match. It's slower 478than matching by hash only, but the pigeonhole principle puts all data sets 479larger than the hash at risk of collision, meaning a false duplicate detection 480and data loss. The slower completion time is not as important as data 481integrity. Checking for a match based on hashes alone is irresponsible, and 482using secure hashes like MD5 or the SHA families is orders of magnitude slower 483than xxHash while still suffering from the risk brought about by the 484pigeonholing. An example of this problem is as follows: if you have 365 days in 485a year and 366 people, the chance of having at least two birthdays on the same 486day is guaranteed; likewise, even though SHA512 is a 512-bit (64-byte) wide 487hash, there are guaranteed to be at least 256 pairs of data streams that causes 488a collision once any of the data streams being hashed for comparison is 65 489bytes (520 bits) or larger. 490 491### "Unusual Characters Robustness" 492jdupes does not protect the user from putting ASCII control characters in their 493file names; they will mangle the output if printed, but they can still be 494operated upon by the actions (delete, link, etc.) in jdupes. 495 496### "Seek Thrash Robustness" 497jdupes uses an I/O chunk size that is optimized for reading as much as possible 498from disk at once to take advantage of high sequential read speeds in 499traditional rotating media drives while balancing against the significantly 500higher rate of CPU cache misses triggered by an excessively large I/O buffer 501size. Enlarging the I/O buffer further may allow for lots of large files to be 502read with less head seeking, but the CPU cache misses slow the algorithm down 503and memory usage increases to hold these large buffers. jdupes is benchmarked 504periodically to make sure that the chosen I/O chunk size is the best compromise 505for a wide variety of data sets. 506 507### "Memory Usage Robustness" 508This is a very subjective concern considering that even a cell phone in 509someone's pocket has at least 1GB of RAM, however it still applies in the 510embedded device world where 32MB of RAM might be all that you can have. Even 511when processing a data set with over a million files, jdupes memory usage 512(tested on Linux x86-64 with -O3 optimization) doesn't exceed 2GB. A low 513memory mode can be chosen at compile time to reduce overall memory usage with a 514small performance penalty. 515 516 517Contact information 518------------------------------------------------------------------------------- 519For all jdupes inquiries, contact Jody Bruchon <jody@jodybruchon.com> 520Please DO NOT contact Adrian Lopez about issues with jdupes. 521 522 523Legal information and software license 524------------------------------------------------------------------------------- 525jdupes is Copyright (C) 2015-2020 by Jody Bruchon <jody@jodybruchon.com> 526Derived from the original 'fdupes' 1.51 (C) 1999-2014 by Adrian Lopez 527Includes other code libraries which are (C) 2015-2020 by Jody Bruchon 528 529The MIT License 530 531Permission is hereby granted, free of charge, to any person obtaining a copy of 532this software and associated documentation files (the "Software"), to deal in 533the Software without restriction, including without limitation the rights to 534use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 535of the Software, and to permit persons to whom the Software is furnished to do 536so, subject to the following conditions: 537 538The above copyright notice and this permission notice shall be included in all 539copies or substantial portions of the Software. 540 541THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 542IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 543FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 544AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 545LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 546OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 547SOFTWARE. 548