Deterministic Finite Automaton算法(确定有穷自动机算法)。它的基本思想是基于状态转移来检索敏感词,只需要扫描一次待检测文本,就能对所有敏感词进行检测。

 电报群:https://t.me/joinchat/FZtb9xTePTsfNr8Wi44JPg
# -*- coding: utf-8 -*-
# telegram_bot_dfafilter.py

import logging
import json
from telegram.ext import MessageHandler
from telegram.ext import Filters

g_dfa_filter_chat_id = -350108987


class DFAFilter(object):
    def __init__(self):
        self.keyword_chains = {}
        self.delimit = '\x00'

    def parse(self, path):
        with open(path) as f:
            content = f.read()
            keywords = content.lower().decode('utf-8').strip().split('@')
            for item in keywords:
                self.add(item)
        logging.info(json.dumps(self.keyword_chains).decode('unicode-escape'))

    def add(self, kw):
        if not kw:
            return
        level = self.keyword_chains
        for i in range(len(kw)):
            if kw[i] in level:
                level = level[kw[i]]
            else:
                if not isinstance(level, dict):
                    break
                for j in range(i, len(kw)):
                    level[kw[j]] = {}
                    last_level, last_char = level, kw[j]
                    level = level[kw[j]]
                last_level[last_char] = {self.delimit: 0}
                break
            if i == len(kw) - 1:
                level[self.delimit] = 0

    def filter(self, message, rep='*'):
        message = message.lower()
        ret = []
        start = 0
        while start < len(message):
            level = self.keyword_chains
            step_ins = 0
            is_handled = False
            for char in message[start:]:
                # logging.info("start : %s char : %s " % (message[start], char))
                if char in level:
                    step_ins += 1
                    if self.delimit not in level[char]:
                        level = level[char]
                    else:
                        ret.append(rep * step_ins)
                        is_handled = True
                        start += step_ins - 1
                        break
                else:
                    ret.append(message[start])
                    is_handled = True
                    # logging.info("append start : %s char : %s " % (message[start], char))
                    break
            if not is_handled:
                ret.append(message[start])
            start += 1
        return ''.join(ret)


def dfa_filter(bot, update):
    if update.message.chat_id == g_dfa_filter_chat_id:
        g_dfa_filter = DFAFilter()
        g_dfa_filter.parse("sensitive_words.txt")
        result = g_dfa_filter.filter(update.message.text)
        bot.send_message(chat_id=update.message.chat_id, text=result)
    else:
        logging.info("bot name : %s chat_id:%s text:%s" % (bot.name, update.message.chat_id, update.message.text))


def start_dfa_filter(dispatcher):
    logging.info('start_dfa_filter')

    dfa_filter_handler = MessageHandler(Filters.text, dfa_filter)
    dispatcher.add_handler(dfa_filter_handler)
# -*- coding: utf-8 -*-
# telegram_bot_jobs.py

import logging
from telegram.ext import Updater
from datetime import datetime

import telegram_bot_dfafilter


def error(bot, update, err):
logging.error('bot : %s update : %s error : %s' % (bot, update, err))


def echo(bot, update):
logging.info("bot : %s chat_id:%d text:%s" % (bot, update.message.chat_id, update.message.text))


def start_bot():
try:
start_time = datetime.now() # 获取当前datetime
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)

# 温顺的莱昂
updater = Updater(token='739344882:AAF_BMyjc7S45nado1dK5E6sMt-0jYH5VMA')
dp = updater.dispatcher
dp.add_error_handler(error)

# #5 电报机器人-DFA敏感词过滤
telegram_bot_dfafilter.start_dfa_filter(dp)

updater.start_polling()
logging.info('telegram bot updater start_polling...')
updater.idle()

end_time = datetime.now() # 获取当前datetime
logging.info('花费时间:%f秒' % (end_time - start_time).total_seconds())
except BaseException as start_bot_ex:
logging.error("telegram_bot_jobs ex : %s" % start_bot_ex)

3 对 “#5 电报机器人-DFA敏感词过滤”的想法;

发表评论

电子邮件地址不会被公开。 必填项已用*标注