SYNOPSIS

       samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]


DESCRIPTION

       samefile  reads a list of filenames (one filename per line) from stdin.
       For each filename pair with identical contents, a  line  consisting  of
       six  fields  is output: The size in bytes, two filenames, the character
       ``='' if the two files are on the same device, ``X'' otherwise, and the
       link counts of the two files.  The output is sorted in reverse order by
       size as the primary key and the filenames as the secondary key.


OPTIONS

       -0     Indicates that the input list of file names is  NUL  terminated,
              for example as generated by implementations of find(1) that sup-
              port the -print0 option.  Without this option,  the  file  names
              are assumed to be newline terminated.

       -a     Do not sort files with same size alphabetically.

       -g size
              Compare only files with size greater than size bytes. Default is
              0.

       -i     Allow files with the same device/i-node pair to be added to  the
              binary  tree.  This  might  be useful if output will be fed into
              some other program.  If this option is used, the statistics dis-
              played  when using -v will not contain the ``You have a total of
              x bytes in identical files'' line because  -i  prohibits  proper
              calculation of this value.

       -l     Do  not  check  if  files with identical contents are hard links
              created by ln(1).  By default, samefile  checks  if  files  with
              identical  contents  are  hard linked and, if they are, does not
              write a name pair to stdout. A slight  speedup  is  gained  when
              using  this  option.   This  option  is incompatible with the -r
              option.

       -q     Do not issue warning messages  when  open(2)  fails.   When  you
              encounter such a warning, open probably failed due to a 'permis-
              sion denied' error on files or directories for which you have no
              read permission.  Useful if you are not root and want to compare
              your files against files in a system directory like /etc

       -r     Report whether identical files are hard linked.   The  separator
              string  followed  by  the  [bracketed] link count is appended to
              each name pair if they are hard links  created  with  ln.   This
              option  is  incompatible with the -l option. Note that this kind
              of output has only four fields and will appear  unsorted  before
              the actual output of samefile.

       -s sep Use  string sep as the output field separator, defaults to a tab
              character. Useful if filenames contain tab characters and output
              in  identical  files'' line because -x prohibits proper calcula-
              tion of this value.


INTERNALS

       samefile uses two stages to give optimum performance.

       In the first stage,  all  non-plain  files  are  skipped  (directories,
       devices,  FIFOs,  sockets,  symbolic  links) as well as files for which
       stat(2) fails and files that have a size less than or  equal  to  size.
       Output of the first stage (the filenames) is written into a binary tree
       with one node for every file size.  It is  also  at  this  early  stage
       where  checks  for hard links are done. If hard links are found, and -r
       is requested, the name pairs are output immediately.  The whole list of
       hard  linked  name pairs will therefore appear before any output of the
       second stage.

       For any i-node only one filename will  be  added  to  the  binary  tree
       (unless -i was requested.)

       In the second stage all files having the same size are compared against
       each other. The rules of mathematical logic are applied to reduce  work
       and  output  noise  (unless -x is requested): if files a, b, and c have
       the same size and samefile finds that a = b and a = c then it will  not
       compare  b  against c (and will not output a line for b and c) but only
       for a = b and a = c. Note however, that because only the first filename
       per  i-node gets into the second stage, the output for a group of iden-
       tical files with different i-node numbers is  also  minimized.  Suppose
       you  have six identical files of size 100 in an i-node group consisting
       of the three i-nodes with numbers 10,  20  and  30  (the  term  'i-node
       group' has nothing to do with the i-node group notion of some file sys-
       tems - it merely refers to a set of i-nodes addressing files with iden-
       tical contents):

       % ls -i
          10 file1     20 file4     30 file6
          10 file2     20 file5
          10 file3
       % ls | samefile
       100     file1   file4   =       3       2
       100     file1   file6   =       3       1

       The  sum  of  the sizes in the first column is the amount of disk space
       you could gain by making all 6 files links to only one file  or  remove
       all  but  one  of  the files. To be precise, disk space is allocated in
       blocks - you will probably gain two blocks here, rather than 200 bytes.
       Note  that  it  is not enough to just remove file4 and file6 (you would
       gain only 100 bytes because file5 still exists.) The proper way  is  to
       use the -i option.  The output will look like

       100     file1   file2   =       3       3
       100     file1   file3   =       3       3
       100     file1   file4   =       3       2
       100     file1   file6   =       3       1
       100     file2   file3   =       3       3
       100     file2   file4   =       3       2
       100     file2   file5   =       3       2
       100     file2   file6   =       3       1
       100     file3   file4   =       3       2
       100     file3   file5   =       3       2
       100     file3   file6   =       3       1
       100     file4   file5   =       2       2
       100     file4   file6   =       2       1
       100     file5   file6   =       2       1



EXAMPLES

       Find all identical files in the current working directory:

       % ls | samefile

       Find  all  identical  files in my HOME directory and subdirectories and
       also tell me if there are hard links:

       % find $HOME -type f -print | samefile -r

       Find all identical files in the /usr directory  tree  that  are  bigger
       than  10000 bytes and write the result to /tmp/usr (that one is for the
       sysadmin folks, you may want to 'amp' - put it in the  background  with
       the ampersand & - this command because it takes a few minutes.)

       % find /usr -type f -print | samefile -g 10000 >/tmp/usr



DIAGNOSTICS

       You will see a short usage message if you use illegal options.

       malloc - free = xxxx
              I  didn't  free  the  memory I've malloc(3)ed.  You found a bug.
              Please report it to the author.

       Allocation failed for 'expr' ...
              Oops! You ran out of virtual memory. You must have  a  real  big
              filename  list.  Try  to use a smaller one or increase resources
              available to your processes.  For more information see ulimit(1)
              or your similar shell builtin.


SEE ALSO

       ln(1), find(1), rm(1), df(1)


NOTES

       Input  filenames  must  not have leading or trailing white space unless
       the white space is part of the filename.


BUGS



Man(1) output converted with man2html