[LUNI] multi-dimensional hash
Jay Strauss
me at heyjay.com
Sun Nov 1 20:01:05 CST 2009
Well, I'm actually reading a file, and so I don't have all the values in
memory. I need to keep track of what I've seen.
While I'd love to use pre/post processing, my script may run on ms-dos as
well as *nix. Furthermore, I need to report the duplicate values to the
user. (I don't think sort -u will write a file of duplicates)
Thanks
Jay
On Sat, Oct 31, 2009 at 5:31 PM, Martin Maney <maney at two14.net> wrote:
> On Sat, Oct 31, 2009 at 12:38:43PM -0500, Keith T. Garner wrote:
> > On Oct 31, 2009, at 10:01 AM, Jay Strauss wrote:
> > > I have an array of keys, and need to weed out duplicate key
> > > sequences. I figured I'd use a hash to determine if I've seen the
> > > key before. I don't know how many fields are in the key prior to
> > > execution.
>
> <probably-unfair-paraphrase>
> Some people, when faced with a problem, think "I'll use a hash!" Now
> they have two problems. :-)
> </probably-unfair-paraphrase>
>
> Well, you did say "array", which suggests you've got all the data in
> memory anyway. In that case hashing is surely unnecessary. Sort it in
> key order, then do a simple "uniq" pass, eliminating all but the first
> of the (now adjacent) duplicate keys. For that matter you can probably
> do this using the CLI sort and uniq commands with maybe a little pre
> and post processing, and that will scale to about half the size of your
> filesystem, albeit with more disk traffic than if you only needed to
> read, process in memory, then write...
>
> > But your concatenation idea might be easier, if you can guarantee a :
> > is never part of a key.
>
> What difference does that make? Sure, it *could* cause a false
> positive, but the of flase positives is inherent in using hashes
> anyway, so you have to do two passes if all you can hold in memory at
> once are the hashes (but you may want to lookup "trie" before you give
> up on holding all the key data in RAM). First pass finds hashes with
> collisions, 2nds pass compares the actual keys that hash to those
> values and tosses the ones that actually match.
>
> --
> Education makes people easy to lead, but difficult to drive;
> easy to govern, but impossible to enslave. -- Henry Peter Brougham
>
> --
> Linux Users Of Northern Illinois (Chicago) - Technical Discussion
> http://luni.org/mailman/listinfo/luni
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://luni.org/pipermail/luni/attachments/20091101/fdb90a5b/attachment.html
More information about the LUNI
mailing list