[LUNI] variation on a theme suggested by SyL

Martin Maney maney at two14.net
Wed Dec 13 18:39:26 CST 2006


Syl has so much disk space that he doesn't know what media files he
might have dupes of.  SO he wanted to find likely candidates, and
today's notion was just to troll the output of, eg., "du -a" for
matchinf sizes and basename.  Since you need the full path names, I
figured it was easier to knock up something using whichecver scripting
language was handy, and it turned out to look something like this:

<file name="dupes.py">
# usage: du [-a] <target> | python dupes.py
# prints sets of files with the same sizes and basename

import sys

sizes = {}
for l in sys.stdin.readlines():
    size,path = l.strip().split(None, 1)
    key = (size, path.split('/')[-1])
    sizes[key] = sizes.get(key, [])
    sizes[key].append(path)
candidates = [paths for paths in sizes.values() if len(paths) > 1]
print "found %d candidate sets" % len(candidates)
for cs in candidates:
    print ', '.join(cs)
</file>

Prolly should have called it filedupes, or filedupes1, since it's
really a fairly crude approximation (misses even trivially different
names, reports coincidental matches (which may be less unlikely for
other uses than this particular case)...

-- 
We've all heard that a million monkeys banging on
a million typewriters will eventually reproduce
the entire works of Shakespeare.  Now, thanks to the
Internet, we know this is not true.  -- Robert Wilensky



More information about the luni mailing list