短Hash算法方案

最近在需求中遇到一个问题：在对hash值长度敏感的场景下如何选择合适的hash算法，生成的hash值要尽量短的同时要保证安全hash算法的安全侧基本要求。

如生成短链接目前常用的转换算法进制算法、随机数算法、hash算法三个方案中其中包含7/11位短Hash值的诉求，而目前常见的安全哈希算法的hash值基本为128/256bit。

hash算法概念

hash算法的解释用wikipedia上对TDF使用的一张图片可以做更好的解释 hash算法就是一个建立在压缩函数上的单向函数（One-way function），即根据输入x可以快速计算输出\(f(x)\)，但是根据输出\(f(x)\)很难根据该值推算出对应的输入\(x\)。

从hash算法的特性中可以提炼两个特征

可以抵抗原像攻击：根据hash值\(f(x)\)很难获取输入\(x\)
可以抵抗第二原像攻击（hash碰撞）：给定\(x\)，对于\(x\) != \(x\)'，对于输出应遵循\(f(x)\) != \(f(x')\)

常见Hash算法

目前常用的hash算法主要包括MD5、SHA家族，简单的对比：

算法名称	输出散列值长度-bits	是否存在碰撞
CRC-32	32	是
MD5	128	是
SHA-1	160	是
SHA-256	256	否
SHA-512	512	否

hash算法抗碰撞性

抗碰撞性或在n长度下两个\(x\),\(x'\)经过hash算法获取输出\(f(x)\)=\(f(x')\)的概率是多少？这个问题其实也是生日问题（在>=23个人中至少有两个人生日相同的概率大于50%）的一种具体体现，即如果随机给定N个均小于K的非负整数，其中至少两个相等的概率为\(p(N|K)\)，根据生日问题的数学论证方式分配0.5的碰撞概率，可以得到预期论证公示为 \(\approx\) \(1-e^{ {-k^2/(2N)} }\) - hash长度为16bit做hash碰撞计算

In [14]: import math
In [15]: N = math.pow(2, 16)
In [16]: for i in range(1, 11):
    ...:     K = 50 * i
    ...:     p = 1 - math.exp(-0.5 * K * (K -1) / N)
    ...:     print("N:{}, K:{}, p:{}".format(N, K, p))
    ...:
N:65536.0, K:50, p:0.01851840426367346
N:65536.0, K:100, p:0.07274902012520568
N:65536.0, K:150, p:0.1567712182076978
N:65536.0, K:200, p:0.26188076496494694
N:65536.0, K:250, p:0.3780712916236951
N:65536.0, K:300, p:0.4955853795330586
N:65536.0, K:350, p:0.6062072573580228
N:65536.0, K:400, p:0.7040756309248388
N:65536.0, K:450, p:0.7859443315915255
N:65536.0, K:500, p:0.850959013777904

\(p(300|2^16)\) \(\approx\) 50%

hash长度32bit做hash碰撞计算

In [1]: import math
In [2]: N = math.pow(2, 32)
In [7]: for i in range(1, 11):
   ...:     K = 20000 * i
   ...:     p = 1 - math.exp(-0.5 * K * (K -1) / N)
   ...:     print("N:{}, K:{}, p:{}".format(N, K, p))
   ...:
N:4294967296.0, K:20000, p:0.04549633911131801
N:4294967296.0, K:40000, p:0.16994213048124063
N:4294967296.0, K:60000, p:0.3423537943601189
N:4294967296.0, K:80000, p:0.5252888411157739
N:4294967296.0, K:100000, p:0.687809461338744
N:4294967296.0, K:120000, p:0.8129476331069347
N:4294967296.0, K:140000, p:0.8978919803274777
N:4294967296.0, K:160000, p:0.9492180149513215
N:4294967296.0, K:180000, p:0.9769902093325205
N:4294967296.0, K:200000, p:0.990501197966016

\(p(77000|2^32)\) \(\approx\) 50%

hash长度为64bit做hash碰撞计算

In [23]: for i in range(1, 11):
    ...:     K = math.pow(10, i)
    ...:     p = 1 - math.exp(-0.5 * K * (K -1) / N)
    ...:     print("N:{}, K:{}, p:{}".format(N, K, p))
    ...:
N:1.8446744073709552e+19, K:10.0, p:0.0
N:1.8446744073709552e+19, K:100.0, p:2.220446049250313e-16
N:1.8446744073709552e+19, K:1000.0, p:2.708944180085382e-14
N:1.8446744073709552e+19, K:10000.0, p:2.710276447714932e-12
N:1.8446744073709552e+19, K:100000.0, p:2.710478508305414e-10
N:1.8446744073709552e+19, K:1000000.0, p:2.7105026889628903e-08
N:1.8446744073709552e+19, K:10000000.0, p:2.710501486702377e-06
N:1.8446744073709552e+19, K:100000000.0, p:0.0002710138095318726
N:1.8446744073709552e+19, K:1000000000.0, p:0.026741008873954697
N:1.8446744073709552e+19, K:10000000000.0, p:0.9334968145485968

\(p(5.1*10^9|2^32)\) \(\approx\) 50%

短hash值方案

由上面可知输出\(x\)经过hash算法输出的\(f(x)\)长度越长做相同\(x\)组合下产生hash碰撞的几率越低，但如文章一开始所说实际业务场景对\(f(x)\)hash长度敏感且要尽量减少\(f(x)\) = \(f(x')\)碰撞概率应该怎么选择合适且安全的Hash算法？以生成\(f(x)\)长度为32bit为目的反推合适的hash算法，自己认为可行的方案有三种：
- 选择CRC-32短hash算法。
- 选择目前相对安全的，且hash算法\(f(x)\) 长度为32*2n，n的长度不等于1且越小越好，如MD5 \(f(x)\)=128bit，(1)然后以64bit做分割 \(f(x)\) = <\(L_64\), \(R_64\)>, \(L_64\)和\(R_64\)做异或运算后再做32bit截取后做异或运算生成目标32bit。
- 选择目前没有安全缺陷、hash碰撞低的安全Hash算法如sha-256生成256bit后截取前32bit分割\(f(x) = <L_32, R_224>\)，直接获取\(L_32\)作为结果，\(R_224\)直接丢弃。

方案对比
- CRC-32在一定输入\(x\)组合下产生hash碰撞的几率非常高。
- 使用XOR可以对输入的bit做相对均匀的0、1分布，但是MD5算法目前为非安全Hash安全，且两次的异或运算是否会增加Hash碰撞几率。
- sha-256目前为安全的hash算法，其具有均匀分布性，截取前32bit做短hash值应该是相对可行的方案，但是如何证明可行性？

从NIST中找到一篇和hash值截取的文章¹，其中提到

Some applications may require a value that is shorter than the (full-length) message
digest provided by an approved hash function as specified in FIPS 180-4. In such cases, it
may be appropriate to use a subset of the bits produced by the hash function as the
(shortened) message digest.
Let the (shortened) message digest be called a truncated message digest, and let λ be its
desired length in bits. A truncated message digest may be used if the following
requirements are met:
1. The length of the output block of the approved hash function to be used shall be
greater than λ (i.e., L > λ).
2. The λ left-most bits of the full-length message digest shall be selected as the
truncated message digest.
For example, if a truncated message digest of 96 bits is desired, the SHA-256
hash function could be used (e.g., because it is available to the application, and
provides an output larger than 96 bits). The leftmost 96 bits of the 256-bit
message digest generated by SHA-256 are selected as the truncated message
digest, and the rightmost 160 bits of the message digest are discarded.
3. If collision resistance is required, λ shall be at least twice the required collision
resistance strength s (in bits) for the truncated message digest (i.e., λ ≥ 2s).

即需要的hash值长度为\(λ\)，应选择目前安全的hash算法且其\(f(x)\) hash值长度L应满足\(L>λ\)，然后选择\(f(x)\)摘要的最左边\(λ\) bits作为所需的消息摘要即可，但如需要抗hash碰撞性，\(λ\)应该为抗碰撞的\(s\) bit两倍。

参考

SP 800-107 Rev. 1- Recommendation for Applications Using Approved Hash Algorithms