The hash table is a wonderful data structure. Unfortunately no one wraps it in the right abstraction. Typically, hash table implementations do some hashing internally, which is insufficient unless you’re hashing pointers, and wastes time if you’re already providing a good hash function.
But how do you know if your hash function is good enough? Hash-table implementations make it hard to diagnose when a cheap hash function is causing clustering, because set and map abstractions hide the bucket array from the user. Of course, clustering can also be a denial-of-service vulnerability.
Hence, a modest proposal: all set and map abstractions wrapped around a hash table should provide a method for measuring clustering. When I teach hash tables, I suggest one reasonable way to measure clustering, which boils down to measuring the sample variance of bucket sizes. Excessively high sample variance probably means a bad hash function is causing clustering.