rss logo

How to find duplicate files in GNU/Linux

Tux logo

Thanks to http://tips4linux.com/, I've found out how to track duplicate files on my GNU/Linux system. I've modified the proposed solution to suit my needs. In short, the command retrieves the size of each file, and compares them to see if they are the same files sizes. If so, an md5 hash will be performed to ensure that the files are exactly the same.

Command

We set a SEARCH variable, which will contain the path where we wish to search for duplicate files:

root@host:~# SEARCH=/data root@host:~# find $SEARCH -not -empty -type f -printf %s\\n | sort -rn | uniq -d | xargs -I{} -n1 find $SEARCH -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Explanations

find $SEARCH -not -empty -type f -printf %s\\n
  • -not: same as ! expr
  • -empty: empty
  • -type f: search for files only
  • -printf %s\\n: prints result (file size) followed by a carriage return
sort -rn
  • -rn: reverse the result of comparisons and compares according to the numerical value of the string
uniq -d
  • -d: print only duplicate lines, one for each group
xargs -I{} -n1 find $SEARCH -type f -size {}c -print0
  • xargs -I{} -n1: replaces standard input element with {} and uses a maximum of 1 argument per command line.
  • find -type f -size {}c -print0: print files names whose size is equal to {} (given by xargs)
xargs -0 md5sum
  • -0: Input elements are terminated by a null character instead of a blank space. Useful when returned elements may contain spaces, quotation marks or backslashes.
uniq -w32 -all-repeated=separate
  • -w: compares only the first 32 characters of each line (in order to compare only hash results)
  • --all-repeated=separate: group duplicated lines.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contact :

contact mail address