This provides a scalar implementation and one using the SHA1
instruction set extensions.
For the scalar implementation, the w array is kept in registers,
speeding up the whole operations. For a 10 GiB file on my Windows
2023 Dev Kit (ARM Cortex A78C / ARM Cortex X1C):
Performance core:
pre 43.1s (238 MB/s)
generic 41.3s (247 MB/s)
scalar 35.0s (293 MB/s)
sha1 12.8s (800 MB/s)
Efficiency core:
pre 54.2s (189 MB/s)
generic 55.9s (183 MB/s)
scalar 43.0s (238 MB/s)
sha1 16.2s (632 MB/s)
Reviewed by: getz
Differential Revision: https://reviews.freebsd.org/D45444