123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195(** [Hash_intf.S] is the interface which a hash function must support.
The functions of [Hash_intf.S] are only allowed to be used in specific sequence:
[alloc], [reset ?seed], [fold_..*], [get_hash_value], [reset ?seed], [fold_..*],
[get_hash_value], ...
(The optional [seed]s passed to each reset may differ.)
The chain of applications from [reset] to [get_hash_value] must be done in a
single-threaded manner (you can't use [fold_*] on a state that's been used
before). More precisely, [alloc ()] creates a new family of states. All functions that
take [t] and produce [t] return a new state from the same family.
At any point in time, at most one state in the family is "valid". The other states are
"invalid".
- The state returned by [alloc] is invalid.
- The state returned by [reset] is valid (all of the other states become invalid).
- The [fold_*] family of functions requires a valid state and produces a valid state
(thereby making the input state invalid).
- [get_hash_value] requires a valid state and makes it invalid.
These requirements are currently formally encoded in the [Check_initialized_correctly]
module in bench/bench.ml. *)open!Import0moduletypeS=sig(** Name of the hash-function, e.g., "internalhash", "siphash" *)valdescription:string(** [state] is the internal hash-state used by the hash function. *)typestate(** [fold_<T> state v] incorporates a value [v] of type <T> into the hash-state,
returning a modified hash-state. Implementations of the [fold_<T>] functions may
mutate the [state] argument in place, and return a reference to it. Implementations
of the fold_<T> functions should not allocate. *)valfold_int:state->int->statevalfold_int64:state->int64->statevalfold_float:state->float->statevalfold_string:state->string->state(** [seed] is the type used to seed the initial hash-state. *)typeseed(** [alloc ()] returns a fresh uninitialized hash-state. May allocate. *)valalloc:unit->state(** [reset ?seed state] initializes/resets a hash-state with the given [seed], or else a
default-seed. Argument [state] may be mutated. Should not allocate. *)valreset:?seed:seed->state->state(** [hash_value] The type of hash values, returned by [get_hash_value]. *)typehash_value(** [get_hash_value] extracts a hash-value from the hash-state. *)valget_hash_value:state->hash_valuemoduleFor_tests:sigvalcompare_state:state->state->intvalstate_to_string:state->stringendendmoduletypeBuiltin_hash_fold_intf=sigtypestatetype'afolder=state->'a->statevalhash_fold_nativeint:nativeintfoldervalhash_fold_int64:int64foldervalhash_fold_int32:int32foldervalhash_fold_char:charfoldervalhash_fold_int:intfoldervalhash_fold_bool:boolfoldervalhash_fold_string:stringfoldervalhash_fold_float:floatfoldervalhash_fold_unit:unitfoldervalhash_fold_option:'afolder->'aoptionfoldervalhash_fold_list:'afolder->'alistfoldervalhash_fold_lazy_t:'afolder->'alazy_tfolder(** Hash support for [array] and [ref] is provided, but is potentially DANGEROUS, since
it incorporates the current contents of the array/ref into the hash value. Because
of this we add a [_frozen] suffix to the function name.
Hash support for [string] is also potentially DANGEROUS, but strings are mutated
less often, so we don't append [_frozen] to it.
Also note that we don't support [bytes]. *)valhash_fold_ref_frozen:'afolder->'areffoldervalhash_fold_array_frozen:'afolder->'aarrayfolderendmoduletypeBuiltin_hash_intf=sigtypehash_valuevalhash_nativeint:nativeint->hash_valuevalhash_int64:int64->hash_valuevalhash_int32:int32->hash_valuevalhash_char:char->hash_valuevalhash_int:int->hash_valuevalhash_bool:bool->hash_valuevalhash_string:string->hash_valuevalhash_float:float->hash_valuevalhash_unit:unit->hash_valueendmoduletypeBuiltin_intf=sigincludeBuiltin_hash_fold_intfincludeBuiltin_hash_intfendmoduletypeFull=sigincludeS(** @inline *)type'afolder=state->'a->state(** [create ?seed ()] is a convenience. Equivalent to [reset ?seed (alloc ())]. *)valcreate:?seed:seed->unit->state(** [of_fold fold] constructs a standard hash function from an existing fold
function. *)valof_fold:(state->'a->state)->'a->hash_valuemoduleBuiltin:Builtin_intfwithtypestate:=stateandtype'afolder:='afolderandtypehash_value:=hash_value(** [run ?seed folder x] runs [folder] on [x] in a newly allocated hash-state,
initialized using optional [seed] or a default-seed.
The following identity exists: [run [%hash_fold: T]] == [[%hash: T]]
[run] can be used if we wish to run a hash-folder with a non-default seed. *)valrun:?seed:seed->'afolder->'a->hash_valueendmoduletypeHash=sigmoduletypeFull=FullmoduletypeS=SmoduleF(Hash:S):Fullwithtypehash_value=Hash.hash_valueandtypestate=Hash.stateandtypeseed=Hash.seed(** The code of [ppx_hash] is agnostic to the choice of hash algorithm that is
used. However, it is not currently possible to mix various choices of hash algorithms
in a given code base.
We experimented with:
- (a) custom hash algorithms implemented in OCaml and
- (b) in C;
- (c) OCaml's internal hash function (which is a custom version of Murmur3,
implemented in C);
- (d) siphash, a modern hash function implemented in C.
Our findings were as follows:
- Implementing our own custom hash algorithms in OCaml and C yielded very little
performance improvement over the (c) proposal, without providing the benefit of being
a peer-reviewed, widely used hash function.
- Siphash (a modern hash function with an internal state of 32 bytes) has a worse
performance profile than (a,b,c) above (hashing takes more time). Since its internal
state is bigger than an OCaml immediate value, one must either manage allocation of
such state explicitly, or paying the cost of allocation each time a hash is computed.
While being a supposedly good hash function (with good hash quality), this quality was
not translated in measurable improvements in our macro benchmarks. (Also, based on
the data available at the time of writing, it's unclear that other hash algorithms in
this class would be more than marginally faster.)
- By contrast, using the internal combinators of OCaml hash function means that we do
not allocate (the internal state of this hash function is 32 bit) and have the same
quality and performance as Hashtbl.hash.
Hence, we are here making the choice of using this Internalhash (that is, Murmur3, the
OCaml hash algorithm as of 4.03) as our hash algorithm. It means that the state of the
hash function does not need to be preallocated, and makes for simpler use in hash
tables and other structures. *)(** @open *)includeFullwithtypestate=Base_internalhash_types.stateandtypeseed=Base_internalhash_types.seedandtypehash_value=Base_internalhash_types.hash_valueend