captcha: add wp

2020-11-07 10:19:50 +08:00
parent a6ff5b9ea5
commit 540ec71c84
2 changed files with 136 additions and 0 deletions
@@ -1 +1,72 @@
 # 证验码
+
+首先我们来分析题目的设置：
+
+- 正常提交验证码，提示信息是 shuffle 过后的，如果验证码正确，可以得到被 shuffle 之后的 flag；
+- shuffle 模式提交验证码，验证码是 shuffle 过后的，随意提交，提示信息是正常顺序；
+- 因为是选择字符数量，提交验证码的时候不需要字符顺序正确；
+- 从题目附件脚本来看，产生验证码的全部信息都已知，包括字体、字符集、长度等；
+- 验证码有 16 个字符，10 条彩色随机噪音；
+- 附件使用的随机产生器都是 SystemRandom；
+
+根据题意和以上设置可以推测：
+
+- shuffle 模式下如果验证码正确，可以得到正常模式的 flag；
+- 只需要还原有哪些字符，不用还原字符顺序（应该也没法还原）；
+- 不用考虑随机数预测；
+
+所以我们的任务就是：通过 shuffle 后的验证码图片还原其中的字符个数。
+
+观察渲染生成的验证码中的文字，发现这些文字的像素值是均匀分布在黑色到白色间的，共有 256 种可能的取值，边缘处逐渐由黑变白，并不是只有纯黑和纯白，我们忽略纯白的那种取值，还剩下 255 种。
+
+记 pix(x) 为字符 x 生成图片中像素值的统计向量（忽略白色，维度：255），我们可以得到以下恒等关系：
+
+forall x, y, pix(x . y) = pix(x) + pix(y),
+
+其中 `.` 为字符串连接，所以对于验证码来说，
+
+pix(captcha) = sum(pix(c) for c in captcha) = sum(n(captcha, x) * pix(x) for x in alphabet),
+
+其中 captcha 是验证码，alphabet 是字符集（共 62 个），n(captcha, x) 是某个字符在验证码中出现的次数。
+
+如果我们把字符集中所有字符的 pix() 计算出，就可以排列成维度为 (62, 255) 的字体像素矩阵 A，其中 aij 代表第 i 个字符的图像有多少个像素值为 j 的像素。同时将 shuffle 后验证码整体统计得到的 pix() 计算出，记为维度为 255 的向量 b，那么我们想要求解的就是字符数量就是维度为 62 的向量 x = (n(captcha, 'a'), n(captcha, 'b'), ...)，并且有以下方程：
+
+Ax = b,
+
+啊这。这不是线性方程吗，解就完事儿了。
+
+## 噪音的处理方法
+
+噪音使得上式不完全相等，不能应用一些精确求解办法。首先我们在统计 pix(captcha) 时忽略所有彩色的像素，由于彩色的噪声遮盖了部分字符，我们的 b 会比真实的 b_true 略小一点，这样得到的方程是：
+
+Ax = b_true - noise,
+
+其中 noise 为非负的噪声。
+
+## 解线性方程的方法
+
+由于噪声未知（是随机生成的）但是很小，对上式变形，用优化方法最小化噪声，求解 x* = argmin_x(noise^2) = argmin_x((Ax - b_true)^2) 即可。
+
+有很多方法可以求解此式，下面给出一种参考方法。
+
+```python
+# char-pix matrix, shape: (62, 255)
+A = np.array([count_pix(img_generate(c)) for c in alphabet])
+
+# pix sum vector, shape: (255, )
+b = count_pix(img)
+
+# Solve A^T.x = b using least-squares mathod
+xf, *_ = np.linalg.lstsq(A.T, b, rcond=None)
+
+# number matrix, shape: (62, )
+x = xf.round().astype(np.int).tolist()
+```
+
+## 其他做法
+
+本题也可以用线性回归、神经网络等，直接拟合 pix(captcha) -> x 的映射关系，准确率足以通过本题。
+
+## 完整程序
+
+见 [payload.py](payload.py)。
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# filename: captcha_payload.py
+
+# STEP 0: pip3 install requests numpy matplotlib Pillow
+# STEP 1: run `wget https://github.com/adobe-fonts/source-code-pro/raw/release/TTF/SourceCodePro-Light.ttf`
+# STEP 2: download attachment `shuffle.py`
+# STEP 3: fill in your `TOKEN`
+# STEP 4: run `python3 captcha_payload.py`
+
+BASE = "http://202.38.93.111:10150/"
+TOKEN = "123:AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
+
+import requests
+from io import BytesIO
+from shuffle import *
+
+
+def solve(img):
+    
+    def count_pix(img):
+        pix = np.array(img)
+        x, y, z = pix.shape
+
+        cnt = np.zeros(256, dtype=np.int)
+
+        for i in range(x):
+            for j in range(y):
+                if len(set(pix[i, j])) == 1:
+                    # do not count noises
+                    cnt[pix[i, j, 0]] += 1
+
+        # ignore pure white
+        return cnt[:-1]
+    
+    # char-pix matrix, shape: (62, 255)
+    A = np.array([count_pix(img_generate(c)) for c in alphabet])
+    
+    # pix sum vector, shape: (255, )
+    b = count_pix(img)
+    
+    # Solve A^T.x = b using least-squares mathod
+    xf, *_ = np.linalg.lstsq(A.T, b, rcond=None)
+    
+    # number matrix, shape: (62, )
+    x = xf.round().astype(np.int).tolist()
+    
+    return x
+
+
+s = requests.Session()
+
+r = s.get(BASE, params={'token': TOKEN}) # pass token
+r = s.get(BASE + "captcha_shuffled.bmp") # request shuffled captcha
+img = Image.open(BytesIO(r.content))     # load image data
+
+x = solve(img)
+
+# construct url parameters
+url = "result?"
+for idx, n in enumerate(x):
+    url += "&r_{c}={n}".format(c=alphabet[idx], n=n)
+
+r = s.get(BASE + url)
+print(r.text)