SYNOPSIS
samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]
DESCRIPTION
samefile reads a list of filenames (one filename per line) from stdin.
For each filename pair with identical contents, a line consisting of
six fields is output: The size in bytes, two filenames, the character
``='' if the two files are on the same device, ``X'' otherwise, and the
link counts of the two files. The output is sorted in reverse order by
size as the primary key and the filenames as the secondary key.
OPTIONS
-0 Indicates that the input list of file names is NUL terminated,
for example as generated by implementations of find(1) that sup-
port the -print0 option. Without this option, the file names
are assumed to be newline terminated.
-a Do not sort files with same size alphabetically.
-g size
Compare only files with size greater than size bytes. Default is
0.
-i Allow files with the same device/i-node pair to be added to the
binary tree. This might be useful if output will be fed into
some other program. If this option is used, the statistics dis-
played when using -v will not contain the ``You have a total of
x bytes in identical files'' line because -i prohibits proper
calculation of this value.
-l Do not check if files with identical contents are hard links
created by ln(1). By default, samefile checks if files with
identical contents are hard linked and, if they are, does not
write a name pair to stdout. A slight speedup is gained when
using this option. This option is incompatible with the -r
option.
-q Do not issue warning messages when open(2) fails. When you
encounter such a warning, open probably failed due to a 'permis-
sion denied' error on files or directories for which you have no
read permission. Useful if you are not root and want to compare
your files against files in a system directory like /etc
-r Report whether identical files are hard linked. The separator
string followed by the [bracketed] link count is appended to
each name pair if they are hard links created with ln. This
option is incompatible with the -l option. Note that this kind
of output has only four fields and will appear unsorted before
the actual output of samefile.
-s sep Use string sep as the output field separator, defaults to a tab
character. Useful if filenames contain tab characters and output
in identical files'' line because -x prohibits proper calcula-
tion of this value.
INTERNALS
samefile uses two stages to give optimum performance.
In the first stage, all non-plain files are skipped (directories,
devices, FIFOs, sockets, symbolic links) as well as files for which
stat(2) fails and files that have a size less than or equal to size.
Output of the first stage (the filenames) is written into a binary tree
with one node for every file size. It is also at this early stage
where checks for hard links are done. If hard links are found, and -r
is requested, the name pairs are output immediately. The whole list of
hard linked name pairs will therefore appear before any output of the
second stage.
For any i-node only one filename will be added to the binary tree
(unless -i was requested.)
In the second stage all files having the same size are compared against
each other. The rules of mathematical logic are applied to reduce work
and output noise (unless -x is requested): if files a, b, and c have
the same size and samefile finds that a = b and a = c then it will not
compare b against c (and will not output a line for b and c) but only
for a = b and a = c. Note however, that because only the first filename
per i-node gets into the second stage, the output for a group of iden-
tical files with different i-node numbers is also minimized. Suppose
you have six identical files of size 100 in an i-node group consisting
of the three i-nodes with numbers 10, 20 and 30 (the term 'i-node
group' has nothing to do with the i-node group notion of some file sys-
tems - it merely refers to a set of i-nodes addressing files with iden-
tical contents):
% ls -i
10 file1 20 file4 30 file6
10 file2 20 file5
10 file3
% ls | samefile
100 file1 file4 = 3 2
100 file1 file6 = 3 1
The sum of the sizes in the first column is the amount of disk space
you could gain by making all 6 files links to only one file or remove
all but one of the files. To be precise, disk space is allocated in
blocks - you will probably gain two blocks here, rather than 200 bytes.
Note that it is not enough to just remove file4 and file6 (you would
gain only 100 bytes because file5 still exists.) The proper way is to
use the -i option. The output will look like
100 file1 file2 = 3 3
100 file1 file3 = 3 3
100 file1 file4 = 3 2
100 file1 file6 = 3 1
100 file2 file3 = 3 3
100 file2 file4 = 3 2
100 file2 file5 = 3 2
100 file2 file6 = 3 1
100 file3 file4 = 3 2
100 file3 file5 = 3 2
100 file3 file6 = 3 1
100 file4 file5 = 2 2
100 file4 file6 = 2 1
100 file5 file6 = 2 1
EXAMPLES
Find all identical files in the current working directory:
% ls | samefile
Find all identical files in my HOME directory and subdirectories and
also tell me if there are hard links:
% find $HOME -type f -print | samefile -r
Find all identical files in the /usr directory tree that are bigger
than 10000 bytes and write the result to /tmp/usr (that one is for the
sysadmin folks, you may want to 'amp' - put it in the background with
the ampersand & - this command because it takes a few minutes.)
% find /usr -type f -print | samefile -g 10000 >/tmp/usr
DIAGNOSTICS
You will see a short usage message if you use illegal options.
malloc - free = xxxx
I didn't free the memory I've malloc(3)ed. You found a bug.
Please report it to the author.
Allocation failed for 'expr' ...
Oops! You ran out of virtual memory. You must have a real big
filename list. Try to use a smaller one or increase resources
available to your processes. For more information see ulimit(1)
or your similar shell builtin.
SEE ALSO
ln(1), find(1), rm(1), df(1)
NOTES
Input filenames must not have leading or trailing white space unless
the white space is part of the filename.
BUGS
Man(1) output converted with
man2html