[LUNI] variation on a theme suggested by SyL

Williamson, Brad Brad.Williamson at uop.com
Thu Dec 14 13:32:47 CST 2006


I pulled this out of a "SpaceHog" script that I had lying around:

find /home -type f -print0 | xargs -0 md5sum |tee /root/checksums.txt
cat /root/checksums.txt |sort|uniq -w 32 -D |tee /root/duplicates.txt

It's not done, but it is supposed to trawl through gigantic shares and
find checksums, filenames, paths, sizes, etc. and dump them to a
searchable MySQL database. Once there, you can find duplicate files,
largest/smallest, abusive users, 'illegal' files, backup files, etc.

It still needs incremental updating capability, speed issues addressed,
etc. but these two lines work fine. Expect the CPU and disk to be
thrashed while running...

Brad 


-----Original Message-----
From: luni-bounces at luni.org [mailto:luni-bounces at luni.org] On Behalf Of
John Mason
Sent: Thursday, December 14, 2006 12:30 PM
To: Linux Users Of Northern Illinois - Technical Discussion
Subject: Re: [LUNI] variation on a theme suggested by SyL

On Wed, Dec 13, 2006 at 06:39:26PM -0600, Martin Maney wrote:
> 
> Syl has so much disk space that he doesn't know what media files he
> might have dupes of.  SO he wanted to find likely candidates, and
> today's notion was just to troll the output of, eg., "du -a" for
> matchinf sizes and basename.  Since you need the full path names, I
> figured it was easier to knock up something using whichecver scripting
> language was handy, and it turned out to look something like this:
> 
> <file name="dupes.py">
> # usage: du [-a] <target> | python dupes.py
> # prints sets of files with the same sizes and basename
> 
> import sys
> 
> sizes = {}
> for l in sys.stdin.readlines():
>     size,path = l.strip().split(None, 1)
>     key = (size, path.split('/')[-1])
>     sizes[key] = sizes.get(key, [])
>     sizes[key].append(path)
> candidates = [paths for paths in sizes.values() if len(paths) > 1]
> print "found %d candidate sets" % len(candidates)
> for cs in candidates:
>     print ', '.join(cs)
> </file>
> 

consider md5sums of candidate dups.
-- 
%40 <- Ceci n'est pas une @.                           John Mason -
jlm at uic.edu
University of Illinois at Chicago - Academic Computing and Communcations
Center
   Usenet Administrator, Listserv Administrator, Sun Software Contact et
al.
-- 
Linux Users Of Northern Illinois - Technical Discussion 
http://luni.org/mailman/listinfo/luni



More information about the luni mailing list