爬虫之路: 字体文件反爬二(动态字体文件)

上一篇解决了但页面的字体反爬, 这篇记录下如何解决动态字体文件, 编码不同, 文字顺序不同的情况

源码在最后

冷静分析页面

打开一个页面, 发现字体文件地址是动态的, 这个倒是好说, 写个正则, 就可以动态匹配出来

先下载下来一个新页面的字体文件, 做一下对比, 如图

爬虫之路: 字体文件反爬二(动态字体文件)

mmp, 发现编码, 字体顺序那那都不一样, 这可就过分了, 心里一万个xxx在奔腾

头脑风暴ing.gif

(与伙伴对话ing...)

不着急, 还是要冷静下来, 再想想哪里还有突破点

同一个页面的字体文件地址是动态的, 但是, 里面的字体编码和顺序是不会变的呀

可以使用某一个页面的字体文件做一个标准的字体映射表呀!

好像发现了新世界的大门, 可门还没开开, 就被自己堵死了, 就想 做出来映射表然后呢!(又要奔腾了)

想呀想呀想呀想, 最后叫上小伙伴一起想

突然就想到了, 虽然那么多不一样, 但是, 但是, 相同文字的坐标点相同呀! 突然又打开了大门

首先排除特别的文字的情况下, 只是在这个字体文件的情况下, 60%的字坐标点一样

那剩下的怎么办呢! 先不管了, 先把这60%给弄出来

提取60%的字体映射表

 制作标准编码文字映射表(某页字体文件为准)

def extract_ttf_file(self, file_name, get_word_map=True):
        _font = TTFont(file_name)
        uni_list = _font.getGlyphOrder()[1:]

        # 被替换的字体的列表
        word_list = [
            "坏", "少", "远", "大", "九", "左", "近", "呢", "十", "高", "着",
            "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了",
            "地", "好", "多", "七", "不", "长", "低", "三", "五", "六", "下", "更",
            "和", "四", "上"
        ]
        utf_word_map = {}
        utf_coordinates_map = {}

        for index, uni_code in enumerate(uni_list):
            utf_word_map[uni_code] = word_list[index]
            utf_coordinates_map[uni_code] = list(_font[‘glyf‘][uni_code].coordinates)

        if get_word_map:
            return utf_word_map, utf_coordinates_map
        return utf_coordinates_map

    # self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name)

制作新标准编码映射表

下载要破解的字体文件, 并替换标准编码字体映射表

def replace_ttf_map(self):
        unicode_mlist_map = []
        new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False)
        for local_unicode, local_coordinate in self.local_utf_coordinates_map.items():
            coordinate_equal_list = []
            for new_unicode, new_coordinate in new_utf_coordinates_map.items():
                if len(new_coordinate) == len(local_coordinate):
                    coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate})

            if len(coordinate_equal_list) == 1:
                unicode_mlist_map.append(coordinate_equal_list[0])for unicode_dict in unicode_mlist_map:
            self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]

        print("new unicode map extract success\n", self.new_unicode_map)

会得到22个字体的映射表, 共38个:

{‘uniED23‘: ‘坏‘, ‘uniED75‘: ‘少‘, ‘uniEC18‘: ‘九‘, ‘uniECB0‘: ‘左‘, ‘uniECF7‘: ‘呢‘, ‘uniEC7A‘: ‘高‘, ‘uniECA7‘: ‘矮‘, ‘uniEDD6‘: ‘八‘, ‘uniEDA0‘: ‘二‘, ‘uniEDF2‘: ‘右‘, ‘uniEC96‘: ‘的‘, ‘uniEDF0‘: ‘小‘, ‘uniEC8B‘: ‘短‘, ‘uniECDB‘: ‘一‘, ‘uniEDD5‘: ‘地‘, ‘uniED8F‘: ‘好‘, ‘uniECC1‘: ‘多‘, ‘uniEDBA‘: ‘七‘, ‘uniEC2A‘: ‘长‘, ‘uniED59‘: ‘下‘, ‘uniEC94‘: ‘和‘, ‘uniED73‘: ‘四‘}

替换40%的字体映射表

重组新标准映射表

接下来, 就用坐标点来解决, 以下为思路

使用两点坐标差来判断, 但是这个偏差值拿不准

相同文字, 坐标点几乎一致, 即所有坐标点相差的绝对值的和最小的就为同一个字

来先试试

def get_distence(self, norm_coordinate, new_coordinate):
        distance_total = 0
        for index, coordinate_point in enumerate(norm_coordinate):
            distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1])
        return distance_total

然后在重组标准编码, 标准坐标, 新的编码, 和新坐标

(这是想, 找出最相近的坐标, 使用新坐标提取出标准编码, 然后用标准编码提取对应的文字, 在替换成使用本页用的编码映射表)

# 准备替换的编码坐标映射表
{"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}

提取所有坐标点加起来最小的元素

def handle_subtraction(self, coordinate_equal_list):
        coordinate_min_list = []
        for coordinate_equal in coordinate_equal_list:
            n = self.get_distence(coordinate_equal.get(‘norm_coordinate‘), coordinate_equal.get(‘new_coordinate‘))
            coordinate_min_list.append(n)

        return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))]

替换, 生成新的标准映射表

self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]

加入判断

在以上替换60%的字体映射表再加入一个判断, 改成如下

def replace_ttf_map(self):
        unicode_mlist_map = []
        new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False)
        for local_unicode, local_coordinate in self.local_utf_coordinates_map.items():
            coordinate_equal_list = []
            for new_unicode, new_coordinate in new_utf_coordinates_map.items():
                if len(new_coordinate) == len(local_coordinate):
                    coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate})

            if len(coordinate_equal_list) == 1:
                unicode_mlist_map.append(coordinate_equal_list[0])
            elif len(coordinate_equal_list) > 1:
                min_word = self.handle_subtraction(coordinate_equal_list)
                unicode_mlist_map.append(min_word)
        for unicode_dict in unicode_mlist_map:
            self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]

        print("new unicode map extract success\n", self.new_unicode_map)

输出一个标准的坐标值, 这里我就不上图进行对比了, 经过对比, 发现没什么问题

{‘uniED23‘: ‘坏‘, ‘uniED75‘: ‘少‘, ‘uniEDC5‘: ‘远‘, ‘uniED5A‘: ‘大‘, ‘uniEC18‘: ‘九‘, ‘uniECB0‘: ‘左‘, ‘uniECA5‘: ‘近‘, ‘uniECF7‘: ‘呢‘, ‘uniED3F‘: ‘十‘, ‘uniEC7A‘: ‘高‘, ‘uniEC44‘: ‘着‘, ‘uniECA7‘: ‘矮‘, ‘uniEDD6‘: ‘八‘, ‘uniEDA0‘: ‘二‘, ‘uniEDF2‘: ‘右‘, ‘uniED09‘: ‘是‘, ‘uniEC32‘: ‘得‘, ‘uniEC96‘: ‘的‘, ‘uniEDF0‘: ‘小‘, ‘uniEC8B‘: ‘短‘, ‘uniED3D‘: ‘很‘, ‘uniECDB‘: ‘一‘, ‘uniEC60‘: ‘了‘, ‘uniEDD5‘: ‘地‘, ‘uniED8F‘: ‘好‘, ‘uniECC1‘: ‘多‘, ‘uniEDBA‘: ‘七‘, ‘uniED2D‘: ‘不‘, ‘uniEC2A‘: ‘长‘, ‘uniED11‘: ‘低‘, ‘uniEC5E‘: ‘三‘, ‘uniECDD‘: ‘五‘, ‘uniEDBC‘: ‘六‘, ‘uniED59‘: ‘下‘, ‘uniEE02‘: ‘更‘, ‘uniEC94‘: ‘和‘, ‘uniED73‘: ‘四‘, ‘uniED6A‘: ‘上‘}

补充

如有以上有错误, 恳求大神指出

其余的就和上篇的一致了, 点击这里查看

源码

# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date:   2020-01-10 14:51:53
# @Last Modified by:   Mehaei
# @Last Modified time: 2020-01-13 10:10:13
import re
import os
import requests
from lxml import etree
from fontTools.ttLib import TTFont


class NotFoundFontFileUrl(Exception):
    pass


class CarHomeFont(object):
    def __init__(self, url, *args, **kwargs):
        self.local_ttf_name = "norm_font.ttf"
        self.download_ttf_name = ‘new_font.ttf‘
        self.new_unicode_map = {}
        self._making_local_code_map()
        self._download_ttf_file(url, self.download_ttf_name)

    def _download_ttf_file(self, url, file_name):
        self.page_html = self.download(url) or ""
        # 获取字体的连接文件
        font_file_name = (re.findall(r",url\(‘(//.*\.ttf)?‘\) format", self.page_html) or [""])[0]
        if not font_file_name:
            raise NotFoundFontFileUrl("not found font file name")
        # 下载字体文件
        file_content = self.download("https:%s" % font_file_name, content=True)
        # 讲字体文件保存到本地
        with open(file_name, ‘wb‘) as f:
            f.write(file_content)
        print("font file download success")

    def _making_local_code_map(self):
        if not os.path.exists(self.local_ttf_name):
            # 这个url为标准字体文件地址, 如要更改, 请手动更改字体列表
            url = "https://club.autohome.com.cn/bbs/thread/62c48ae0f0ae73ef/75904283-1.html"
            self._download_ttf_file(url, self.local_ttf_name)
        self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name)
        print("local ttf load done")

    def get_distence(self, norm_coordinate, new_coordinate):
        distance_total = 0
        for index, coordinate_point in enumerate(norm_coordinate):
            distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1])
        return distance_total

    def handle_subtraction(self, coordinate_equal_list):
        coordinate_min_list = []
        for coordinate_equal in coordinate_equal_list:
            n = self.get_distence(coordinate_equal.get(‘norm_coordinate‘), coordinate_equal.get(‘new_coordinate‘))
            coordinate_min_list.append(n)

        return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))]

    def replace_ttf_map(self):
        unicode_mlist_map = []
        new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False)
        for local_unicode, local_coordinate in self.local_utf_coordinates_map.items():
            coordinate_equal_list = []
            for new_unicode, new_coordinate in new_utf_coordinates_map.items():
                if len(new_coordinate) == len(local_coordinate):
                    coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate})

            if len(coordinate_equal_list) == 1:
                unicode_mlist_map.append(coordinate_equal_list[0])
            elif len(coordinate_equal_list) > 1:
                min_word = self.handle_subtraction(coordinate_equal_list)
                unicode_mlist_map.append(min_word)
        for unicode_dict in unicode_mlist_map:
            self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]

        print("new unicode map extract success\n", self.new_unicode_map)

    def extract_ttf_file(self, file_name, get_word_map=True):
        _font = TTFont(file_name)
        uni_list = _font.getGlyphOrder()[1:]

        # 被替换的字体的列表
        word_list = [
            "坏", "少", "远", "大", "九", "左", "近", "呢", "十", "高", "着",
            "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了",
            "地", "好", "多", "七", "不", "长", "低", "三", "五", "六", "下", "更",
            "和", "四", "上"
        ]
        utf_word_map = {}
        utf_coordinates_map = {}

        for index, uni_code in enumerate(uni_list):
            utf_word_map[uni_code] = word_list[index]
            utf_coordinates_map[uni_code] = list(_font[‘glyf‘][uni_code].coordinates)

        if get_word_map:
            return utf_word_map, utf_coordinates_map
        return utf_coordinates_map

    def repalce_source_code(self):
        replaced_html = self.page_html
        for utf_code, word in self.new_unicode_map.items():
            replaced_html = replaced_html.replace("&#x%s;" % utf_code[3:].lower(), word)
        return replaced_html

    def get_subject_content(self):
        normal_html = self.repalce_source_code()
        # 使用xpath 获取 主贴
        xp_html = etree.HTML(normal_html)
        subject_text = ‘‘.join(xp_html.xpath(‘//div[@xname="content"]//div[@class="tz-paragraph"]//text()‘))
        return subject_text

    def download(self, url, *args, try_time=5, method="GET", content=False, **kwargs):
        kwargs.setdefault("headers", {})
        kwargs["headers"].update({"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"})
        while try_time:
            try:
                response = requests.request(method.upper(), url, *args, **kwargs)
                if response.ok:
                    if content:
                        return response.content
                    return response.text
                else:
                    continue
            except Exception as e:
                try_time -= 1
                print("download error: %s" % e)


if __name__ == "__main__":
    url = "https://club.autohome.com.cn/bbs/thread/34d6bcc159b717a9/85794510-1.html#pvareaid=6830286"
    car = CarHomeFont(url)
    car.replace_ttf_map()
    text = car.get_subject_content()
    print(text)

相关推荐