最近在需求中遇到一个问题:在对hash值长度敏感的场景下如何选择合适的hash算法,生成的hash值要尽量短的同时要保证安全hash算法的安全侧基本要求。

如生成短链接目前常用的转换算法进制算法、随机数算法、hash算法三个方案中其中包含7/11位短Hash值的诉求,而目前常见的安全哈希算法的hash值基本为128/256bit。

hash算法概念

hash算法的解释用wikipedia上对TDF使用的一张图片可以做更好的解释 hash算法就是一个建立在压缩函数上的单向函数(One-way function),即根据输入x可以快速计算输出\(f(x)\),但是根据输出\(f(x)\)很难根据该值推算出对应的输入\(x\)

从hash算法的特性中可以提炼两个特征

  • 可以抵抗原像攻击:根据hash值\(f(x)\)很难获取输入\(x\)
  • 可以抵抗第二原像攻击(hash碰撞):给定\(x\),对于\(x\) != \(x\)',对于输出应遵循\(f(x)\) != \(f(x')\)

常见Hash算法

目前常用的hash算法主要包括MD5、SHA家族,简单的对比:

算法名称 输出散列值长度-bits 是否存在碰撞
CRC-32 32
MD5 128
SHA-1 160
SHA-256 256
SHA-512 512

hash算法抗碰撞性

抗碰撞性或在n长度下两个\(x\),\(x'\)经过hash算法获取输出\(f(x)\)=\(f(x')\)的概率是多少?这个问题其实也是生日问题(在>=23个人中至少有两个人生日相同的概率大于50%)的一种具体体现,即如果随机给定N个均小于K的非负整数,其中至少两个相等的概率为\(p(N|K)\),根据生日问题的数学论证方式分配0.5的碰撞概率,可以得到预期论证公示为 \(\approx\) \(1-e^{ {-k^2/(2N)} }\) - hash长度为16bit做hash碰撞计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
In [14]: import math
In [15]: N = math.pow(2, 16)
In [16]: for i in range(1, 11):
...: K = 50 * i
...: p = 1 - math.exp(-0.5 * K * (K -1) / N)
...: print("N:{}, K:{}, p:{}".format(N, K, p))
...:
N:65536.0, K:50, p:0.01851840426367346
N:65536.0, K:100, p:0.07274902012520568
N:65536.0, K:150, p:0.1567712182076978
N:65536.0, K:200, p:0.26188076496494694
N:65536.0, K:250, p:0.3780712916236951
N:65536.0, K:300, p:0.4955853795330586
N:65536.0, K:350, p:0.6062072573580228
N:65536.0, K:400, p:0.7040756309248388
N:65536.0, K:450, p:0.7859443315915255
N:65536.0, K:500, p:0.850959013777904
\(p(300|2^16)\) \(\approx\) 50%

  • hash长度32bit做hash碰撞计算

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    In [1]: import math
    In [2]: N = math.pow(2, 32)
    In [7]: for i in range(1, 11):
    ...: K = 20000 * i
    ...: p = 1 - math.exp(-0.5 * K * (K -1) / N)
    ...: print("N:{}, K:{}, p:{}".format(N, K, p))
    ...:
    N:4294967296.0, K:20000, p:0.04549633911131801
    N:4294967296.0, K:40000, p:0.16994213048124063
    N:4294967296.0, K:60000, p:0.3423537943601189
    N:4294967296.0, K:80000, p:0.5252888411157739
    N:4294967296.0, K:100000, p:0.687809461338744
    N:4294967296.0, K:120000, p:0.8129476331069347
    N:4294967296.0, K:140000, p:0.8978919803274777
    N:4294967296.0, K:160000, p:0.9492180149513215
    N:4294967296.0, K:180000, p:0.9769902093325205
    N:4294967296.0, K:200000, p:0.990501197966016
    \(p(77000|2^32)\) \(\approx\) 50%

  • hash长度为64bit做hash碰撞计算

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    In [23]: for i in range(1, 11):
    ...: K = math.pow(10, i)
    ...: p = 1 - math.exp(-0.5 * K * (K -1) / N)
    ...: print("N:{}, K:{}, p:{}".format(N, K, p))
    ...:
    N:1.8446744073709552e+19, K:10.0, p:0.0
    N:1.8446744073709552e+19, K:100.0, p:2.220446049250313e-16
    N:1.8446744073709552e+19, K:1000.0, p:2.708944180085382e-14
    N:1.8446744073709552e+19, K:10000.0, p:2.710276447714932e-12
    N:1.8446744073709552e+19, K:100000.0, p:2.710478508305414e-10
    N:1.8446744073709552e+19, K:1000000.0, p:2.7105026889628903e-08
    N:1.8446744073709552e+19, K:10000000.0, p:2.710501486702377e-06
    N:1.8446744073709552e+19, K:100000000.0, p:0.0002710138095318726
    N:1.8446744073709552e+19, K:1000000000.0, p:0.026741008873954697
    N:1.8446744073709552e+19, K:10000000000.0, p:0.9334968145485968
    \(p(5.1*10^9|2^32)\) \(\approx\) 50%

短hash值方案

由上面可知输出\(x\)经过hash算法输出的\(f(x)\)长度越长做相同\(x\)组合下产生hash碰撞的几率越低,但如文章一开始所说实际业务场景对\(f(x)\)hash长度敏感且要尽量减少\(f(x)\) = \(f(x')\)碰撞概率应该怎么选择合适且安全的Hash算法?以生成\(f(x)\)长度为32bit为目的反推合适的hash算法,自己认为可行的方案有三种:
- 选择CRC-32短hash算法。
- 选择目前相对安全的,且hash算法\(f(x)\) 长度为32*2n,n的长度不等于1且越小越好,如MD5 \(f(x)\)=128bit,(1)然后以64bit做分割 \(f(x)\) = <\(L_64\), \(R_64\)>, \(L_64\)\(R_64\)做异或运算后再做32bit截取后做异或运算生成目标32bit。
- 选择目前没有安全缺陷、hash碰撞低的安全Hash算法如sha-256生成256bit后截取前32bit分割\(f(x) = <L_32, R_224>\),直接获取\(L_32\)作为结果,\(R_224\)直接丢弃。

方案对比
- CRC-32在一定输入\(x\)组合下产生hash碰撞的几率非常高。
- 使用XOR可以对输入的bit做相对均匀的0、1分布,但是MD5算法目前为非安全Hash安全,且两次的异或运算是否会增加Hash碰撞几率。
- sha-256目前为安全的hash算法,其具有均匀分布性,截取前32bit做短hash值应该是相对可行的方案,但是如何证明可行性?

从NIST中找到一篇和hash值截取的文章1,其中提到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Some applications may require a value that is shorter than the (full-length) message
digest provided by an approved hash function as specified in FIPS 180-4. In such cases, it
may be appropriate to use a subset of the bits produced by the hash function as the
(shortened) message digest.
Let the (shortened) message digest be called a truncated message digest, and let λ be its
desired length in bits. A truncated message digest may be used if the following
requirements are met:
1. The length of the output block of the approved hash function to be used shall be
greater than λ (i.e., L > λ).
2. The λ left-most bits of the full-length message digest shall be selected as the
truncated message digest.
For example, if a truncated message digest of 96 bits is desired, the SHA-256
hash function could be used (e.g., because it is available to the application, and
provides an output larger than 96 bits). The leftmost 96 bits of the 256-bit
message digest generated by SHA-256 are selected as the truncated message
digest, and the rightmost 160 bits of the message digest are discarded.
3. If collision resistance is required, λ shall be at least twice the required collision
resistance strength s (in bits) for the truncated message digest (i.e., λ ≥ 2s).
即需要的hash值长度为\(λ\),应选择目前安全的hash算法且其\(f(x)\) hash值长度L应满足\(L>λ\),然后选择\(f(x)\)摘要的最左边\(λ\) bits作为所需的消息摘要即可,但如需要抗hash碰撞性,\(λ\)应该为抗碰撞的\(s\) bit两倍。

参考