网页长URL分享时,对用户不友好,同时给同一个长URL生成多个短URL,每个短URL对应一个引流渠道,提升用户体验的同时,便于商业数据分析。
为长URL生成不重复的短URL,使用62进制编码,使用0-9, A-Z, a-Z表示0-61。为保持短URL的简洁性,假定短URL长度为6,此时可以生成URL。虽然可支持的URL数据已经足够多,但考虑到长URL中可以包含查询字符串和路径参数,导致长URL本身可变范围较大,需要的短URL规模也相应地增大,同时短URL具有一定的时效性,所以短URL需要支持过期回收重用。
长URL 利用 MD5 或者 SHA256 等单项散列算法,得到 128bit 或者 256bit 的 Hash 值。然后对该 Hash 值进行 Base64 编码,得到 22 个或者 43 个 Base64 字符,再根据某种规则得到 6 个字符作为短URL 。
该方式可能会发生 Hash 冲突,即不同的长 URL,计算得到相同的短URL。对此需要事先校验该短URL 是否已经映射为其他的长 URL,如果是,则需要重新计算得到新的短 URL, 再重复上述查重过程,直至得到不重复的短URL。该方法需要在生成短URL时,实时查重验证并重试,存在性能瓶颈。
离线预先生成全部可选的6位短URL,获取短URL时直接返回预先生成的缓存结果。假设需要生成n=20000000000个短URL,可以通过生成200亿个不重复的1到q=40000000000范围内的随机数,并将其转换为62进制编码,得到200亿个短URL,同时需要保证生成的随机数序列乱序,防止短URL被外界预测。离线生成短URL可以有下面两种方式:
每次生成一个[1,p]之间的随机数r,并通过布隆过滤器查询r是否已经生成过,如若未被生成过,r为有效随机数,转换为62进制编码得到短url,并将r加入布隆过滤器;如若布隆过滤器反馈r已经存在,则尝试重新生成随机数,重复上述过程,直到生成n个不重复短URL。
1public List<String> bloomFilterApproach() {2 // 布隆过滤器的预期插入量为n,误报率为p3 BloomFilter<Long> bloomFilter = BloomFilter.create(Funnels.longFunnel(), n, p);4 Random random = new Random();5 List<String> urls = new LinkedList<>();6 int count = 0;7 while (count < n) {8 // [1,q]间随机数9 long r = (long) (random.nextDouble() * q + 1);10 // 判重11 if (!bloomFilter.mightContain(r)) {12 bloomFilter.put(r);13 String url = long2Base62String(r);14 urls.add(url);15 count++;16 }17 }18 return urls;19}该方法生成的随机数间完全独立,保证了生成随机序列{r}的无序性,保证了随机数在[1,p]间均匀分布;同时利用布隆过滤器保证了随机序列中无重复值。
时间与空间复杂度
在生成第i(i从0开始)个随机数时,生成的随机数在之前未出现过的概率是i个不重复随机数的尝试次数的期望为n个不重复随机数一共需要尝试的次数是
对于布隆过滤器,假设其误判概率为p,插入数据规模为n,用于记录数据的比特数组长度为m,hash函数个数为k。则在插入n个数据后,比特数组的任一下标处仍然为0的概率为
此时判断某个数x是否存在时,如果产生误判,即x经过k个hash函数指向的k位置都被置为1的概率为
将p随着k的增大,先减小再增大。使得p最小的
需要的比特数组长度
每次判断随机数是否存在以及将随机数加入布隆过滤器都需要进行k次哈希运算,所以总的时间复杂度为
时间复杂度为
空间复杂度与比特数组长度相关, 空间复杂度为
在[1, p]间生成n个不重复的随机数,如果将得到的随机数序列{x}进行排序,则两相邻随机数间间隔的期望为
--> 首先定义第一个随机数URL,
-->生成URL,
-->生成URL,
以此类推,直到生成全部的随机数序列
x1urls = []2curr = "000001"3for i from 0 up to n-1:4 urls.append(curr)5 r ← random integer such that 1 ≤ j < 2 * p/n 6 curr = base62_plus(curr, r)7
对于随机递增值[1, p]内均匀分布。并且通过递增方式生成的序列保证了随机数间无重复,不需要判重处理。
最后使用Fisher-Yates shuffle1洗牌算法将序列
xxxxxxxxxx31for i from n−1 down to 1 do2 j ← random integer such that 0 ≤ j ≤ i3 exchange a[j] and a[i]
注意:由于Fisher-Yates shuffle要求集合支持随机下标访问,所以需要使用数组保存数据,但是由于数据量超过数组的长度限制,所以需要将随机序列Fisher-Yates shuffle也需要做出变化。
xxxxxxxxxx471
2private List<List<String>> randomIncreaseApproach() {3 // 每段数据长度4 int len = Integer.MAX_VALUE / 1024;5 // 数据段数6 int m = (int) Math.ceil(1.0 * n / len);7 List<List<String>> urls = new ArrayList<>(m);8 for (int i = 0; i < m; ++i) {9 urls.add(new ArrayList<>(len));10 }11 Random random = new Random();12 // 初始值13 StringBuilder builder = new StringBuilder();14 for (int i = 0; i < l - 1; i++) {15 builder.append('0');16 }17 builder.append('1');18 String curr = builder.toString();19
20 for (long i = 0; i < n; i++) {21 urls.get((int) (i / len)).add(curr);22 // [1,2b] 间随机数23 int r = 1 + random.nextInt(2 * b - 1);24 // 62进制加法25 curr = base62Plus(curr, r);26 }27 // 打乱顺序28 fisherYatesShuffle(urls, random, n, len);29 return urls;30}31
32public static void fisherYatesShuffle(List<List<String>> urls, Random random, long n, int l) {33 random.nextLong();34 for (long i = n - 1; i > 0; i--) {35 // [0,i] 间随机数36 long j = (long) (random.nextDouble() * (i + 1));37 // 确定i和j分别属于哪一段38 int segmentI = (int) (i / l);39 int indexI = (int) (i % l);40 int segmentJ = (int) (j / l);41 int indexJ = (int) (j % l);42 // 交换元素43 String temp = urls.get(segmentI).get(indexI);44 urls.get(segmentI).set(indexI, urls.get(segmentJ).get(indexJ));45 urls.get(segmentJ).set(indexJ, temp);46 }47}完整代码将附录小结。
时间复杂度:
空间复杂度:
时间:递增序列+随机乱序方法耗时约为重复随机生成+布隆过滤器去重方法的四分之一。
空间:递增序列+随机乱序方法无需额外空间,重复随机生成+布隆过滤器去重方法需要额外
分布是否均匀:使用卡方检验2检测随机序列分布是否在[0, p]内均匀分布。重复随机生成+布隆过滤器去重方法的p-value为0.96;递增序列+随机乱序方法的p-value为1.0,均满足随机分布假设。
分布顺序是否随机:使用排列测试3检测随机序内部数据顺序是否随机。重复随机生成+布隆过滤器去重方法和递增序列+随机乱序方法的p-value都为1.0,均满足随机排列假设。
xxxxxxxxxx1821import com.google.common.hash.BloomFilter;2import com.google.common.hash.Funnels;3
4import java.io.BufferedWriter;5import java.io.FileWriter;6import java.io.IOException;7import java.util.*;8
9public class RandomGenerator {10 // 短url长度11 private static final int l = 3;12 // 数据范围[1, q]13 private static final long q = (long) Math.pow(62, l);14 // 预期插入量15 private static final long n = q / 10;16 // 误报率17 private static final double p = 0.01;18 // 随机数间间隔期望19 private static final int b = (int) (q / n);20
21
22 private static final Map<Character, Integer> CHAR_TO_INT_MAP = new HashMap<>();23 private static final Map<Integer, Character> INT_TO_CHAR_MAP = new HashMap<>();24
25 static {26 String base62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";27 for (int i = 0; i < base62.length(); i++) {28 char c = base62.charAt(i);29 CHAR_TO_INT_MAP.put(c, i);30 INT_TO_CHAR_MAP.put(i, c);31 }32 }33
34 public static void main(String[] args) throws IOException {35 RandomGenerator generator = new RandomGenerator();36 // warm up+jit37 for (int i = 0; i < 32; ++i) {38 generator.bloomFilterApproach();39 generator.randomIncreaseApproach();40 }41
42 long totalTime = 0;43 int iterations = 32;44 List<String> bloomFilterUrls = null;45 for (int i = 0; i < iterations; i++) {46 System.gc();47 long startTime = System.nanoTime();48 bloomFilterUrls = generator.bloomFilterApproach();49 long endTime = System.nanoTime();50 totalTime += (endTime - startTime);51 }52 long averageTime = totalTime / iterations;53 System.out.println("bf:" + averageTime / 1000 + "ms");54
55 // write to .txt56 try (BufferedWriter writer = new BufferedWriter(new FileWriter("./out/bloomFilterUrls.txt"))) {57 for (String url : bloomFilterUrls) {58 writer.write(url);59 writer.newLine();60 }61 }62
63 totalTime = 0;64 List<List<String>> randomIncreaseUrls = null;65 for (int i = 0; i < iterations; i++) {66 System.gc();67 long startTime = System.nanoTime();68 randomIncreaseUrls = generator.randomIncreaseApproach();69 long endTime = System.nanoTime();70 totalTime += (endTime - startTime);71 }72 averageTime = totalTime / iterations;73 System.out.println("ri:" + averageTime / 1000 + "ms");74
75
76 // write to .txt77 try (BufferedWriter writer = new BufferedWriter(new FileWriter("./out/randomIncreaseUrls.txt"))) {78 for (List<String> urls : randomIncreaseUrls) {79 for (String url : urls) {80 writer.write(url);81 writer.newLine();82 }83 }84 }85 }86
87 public List<String> bloomFilterApproach() {88 // 布隆过滤器的预期插入量为n,误报率为p89 BloomFilter<Long> bloomFilter = BloomFilter.create(Funnels.longFunnel(), n, p);90 Random random = new Random();91 List<String> urls = new LinkedList<>();92 int count = 0;93 while (count < n) {94 // [1,q]间随机数95 long r = (long) (random.nextDouble() * q + 1);96 // 判重97 if (!bloomFilter.mightContain(r)) {98 bloomFilter.put(r);99 String url = long2Base62String(r);100 urls.add(url);101 count++;102 }103 }104 return urls;105 }106
107 private static String long2Base62String(long value) {108 StringBuilder builder = new StringBuilder();109 for (int i = 0; i < l; i++) {110 builder.append('0');111 }112 for (int i = l - 1; i >= 0 && value > 0; i--) {113 builder.setCharAt(i, INT_TO_CHAR_MAP.get((int) (value % 62)));114 value /= 62;115 }116 return builder.toString();117 }118
119 private List<List<String>> randomIncreaseApproach() {120 // 每段数据长度121 int len = Integer.MAX_VALUE / 1024;122 // 数据段数123 int m = (int) Math.ceil(1.0 * n / len);124 List<List<String>> urls = new ArrayList<>(m);125 for (int i = 0; i < m; ++i) {126 urls.add(new ArrayList<>(len));127 }128 Random random = new Random();129 // 初始值130 StringBuilder builder = new StringBuilder();131 for (int i = 0; i < l - 1; i++) {132 builder.append('0');133 }134 builder.append('1');135 String curr = builder.toString();136
137 for (long i = 0; i < n; i++) {138 urls.get((int) (i / len)).add(curr);139 // [1,2b] 间随机数140 int r = 1 + random.nextInt(2 * b - 1);141 // 62进制加法142 curr = base62Plus(curr, r);143 }144 // 打乱顺序145 fisherYatesShuffle(urls, random, n, len);146 return urls;147 }148
149 public static String base62Plus(String curr, int r) {150 StringBuilder result = new StringBuilder(curr);151 // 初始化进位为r152 int carry = r;153 for (int i = l - 1; i >= 0 && carry > 0; i--) {154 // 计算当前位的和155 int sum = CHAR_TO_INT_MAP.get(curr.charAt(i)) + carry;156 // 更新当前位157 result.setCharAt(i, INT_TO_CHAR_MAP.get(sum % 62));158 // 更新进位159 carry = sum / 62;160 }161 return result.toString();162 }163
164 public static void fisherYatesShuffle(List<List<String>> urls, Random random, long n, int l) {165 random.nextLong();166 for (long i = n - 1; i > 0; i--) {167 // [0,i] 间随机数168 long j = (long) (random.nextDouble() * (i + 1));169 // 确定i和j分别属于哪一段170 int segmentI = (int) (i / l);171 int indexI = (int) (i % l);172 int segmentJ = (int) (j / l);173 int indexJ = (int) (j % l);174 // 交换元素175 String temp = urls.get(segmentI).get(indexI);176 urls.get(segmentI).set(indexI, urls.get(segmentJ).get(indexJ));177 urls.get(segmentJ).set(indexJ, temp);178 }179 }180}181
182
xxxxxxxxxx791from scipy.stats import rankdata2from scipy.stats import norm3import numpy as np4from scipy.stats import chisquare5import numpy as np6
7
8def chi_squared_test(sequence, m, bound):9 n = len(sequence)10 # 每个区间的期望出现次数11 expected = n / m 12
13 # 划分区间并计算每个区间的实际出现次数14 counts = np.histogram(sequence, bins=m, range=(0, bound))[0]15
16 chi_squared_stat, p_value = chisquare(counts, expected * np.ones(m))17
18 return chi_squared_stat, p_value19
20
21def permutation_test(sequence, num_permutations=10000):22 n = len(sequence)23 original_rank_sum = sum(rankdata(sequence))24
25 count = 026 for _ in range(num_permutations):27 permuted_sequence = np.random.permutation(sequence)28 permuted_rank_sum = sum(rankdata(permuted_sequence))29
30 if permuted_rank_sum >= original_rank_sum:31 count += 132
33 p_value = (count + 1) / (num_permutations + 1)34 return p_value35
36
37def base62_to_decimal(base62):38 decimal = 039 base = 6240 for i, char in enumerate(base62):41 if '0' <= char <= '9':42 value = ord(char) - ord('0')43 elif 'A' <= char <= 'Z':44 value = ord(char) - ord('A') + 1045 elif 'a' <= char <= 'z':46 value = ord(char) - ord('a') + 3647 else:48 raise ValueError(49 "Invalid character in base62 string: {}".format(char))50 decimal = decimal*base + value51 return decimal52
53
54if __name__ == '__main__':55 bound = 62**356 sequence = []57 with open('./out/bloomFilterUrls.txt', mode='r', encoding='utf8') as f:58 for line in f.readlines():59 sequence.append(base62_to_decimal(line[:-1]))60 m = 1024 61 chi_squared_stat, p = chi_squared_test(sequence, m, bound)62 # p较小(0.05、0.01 和 0.001),则表明序列的分布与均匀分布显著不同63 print(f"P-value: {p}")64
65 p = permutation_test(sequence)66 print(f"P-value: {p}")67
68 sequence = []69 with open('./out/randomIncreaseUrls.txt', mode='r', encoding='utf8') as f:70 for line in f.readlines():71 sequence.append(base62_to_decimal(line[:-1]))72 m = 1024 73 chi_squared_stat, p = chi_squared_test(sequence, m, bound)74 # p较小(0.05、0.01 和 0.001),则表明序列的分布与均匀分布显著不同75 print(f"P-value: {p}")76
77 p = permutation_test(sequence)78 print(f"P-value: {p}")79