[LUNI] multi-dimensional hash

Jay Strauss me at heyjay.com
Sun Nov 1 20:01:05 CST 2009


Well, I'm actually reading a file, and so I don't have all the values in
memory.  I need to keep track of what I've seen.

While I'd love to use pre/post processing, my script may run on ms-dos as
well as *nix.  Furthermore, I need to report the duplicate values to the
user.  (I don't think sort -u will write a file of duplicates)

Thanks
Jay

On Sat, Oct 31, 2009 at 5:31 PM, Martin Maney <maney at two14.net> wrote:

> On Sat, Oct 31, 2009 at 12:38:43PM -0500, Keith T. Garner wrote:
> > On Oct 31, 2009, at 10:01 AM, Jay Strauss wrote:
> > > I have an array of keys, and need to weed out duplicate key
> > > sequences.  I figured I'd use a hash to determine if I've seen the
> > > key before.  I don't know how many fields are in the key prior to
> > > execution.
>
> <probably-unfair-paraphrase>
> Some people, when faced with a problem, think "I'll use a hash!"  Now
> they have two problems.  :-)
> </probably-unfair-paraphrase>
>
> Well, you did say "array", which suggests you've got all the data in
> memory anyway.  In that case hashing is surely unnecessary.  Sort it in
> key order, then do a simple "uniq" pass, eliminating all but the first
> of the (now adjacent) duplicate keys.  For that matter you can probably
> do this using the CLI sort and uniq commands with maybe a little pre
> and post processing, and that will scale to about half the size of your
> filesystem, albeit with more disk traffic than if you only needed to
> read, process in memory, then write...
>
> > But your concatenation idea might be easier, if you can guarantee a :
> > is never part of a key.
>
> What difference does that make?  Sure, it *could* cause a false
> positive, but the of flase positives is inherent in using hashes
> anyway, so you have to do two passes if all you can hold in memory at
> once are the hashes (but you may want to lookup "trie" before you give
> up on holding all the key data in RAM).  First pass finds hashes with
> collisions, 2nds pass compares the actual keys that hash to those
> values and tosses the ones that actually match.
>
> --
> Education makes people easy to lead, but difficult to drive;
> easy to govern, but impossible to enslave.  -- Henry Peter Brougham
>
> --
> Linux Users Of Northern Illinois (Chicago) - Technical Discussion
> http://luni.org/mailman/listinfo/luni
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://luni.org/pipermail/luni/attachments/20091101/fdb90a5b/attachment.html 


More information about the LUNI mailing list