敏感词过滤算法

前言

在游戏设计中一个最基本的功能就是处理屏蔽字、敏感字,至于为什么这个需求这么重要?你懂的。在网上搜了很多资料,发现主要有三种实现方式:
对于输入的一句话message,

  • 1、循环使用所有的屏蔽词去message中查找,看是否存在,如果存在则判定为敏感词。
    优点:简单,只要学过几个月编程的都会;
    缺点:效率太低,而且效果不是太好;
  • 2、将有共同起始内容的屏蔽词分为一组,然后再使用方式1。
    优点:比方案1效率高一些;
    缺点:效率仍然很低,而且效果太差;
  • 3、使用DFA算法。
    优点:效率高,内存消耗低;
    缺点:与前两者相比,实现复杂。

综上:为了应对越来越多的敏感词,寻找一个高效率的敏感词过滤算法就摆在了各个程序员的面前。而DFA是所有方法里面效率最高的。
在网上看了很多现有的实现方式,觉得存在各种各样的问题,于是自己就决定来写一个。

数据结构

算法与数据结构是密不可分的。数据结构决定了在其之上的算法。DFA算法采用了Trie树这种数据结构来存储数据。
Trie树的详细介绍

算法实现

Golang Version

  • trieNode类型的定义
const (
    INIT_TRIE_CHILDREN_NUM = 128 // Since we need to deal all kinds of language, so we use 128 instead of 26
)

// trieNode data structure
// trieNode itself doesn't have any value. The value is represented on the path
type trieNode struct {
    // if this node is the end of a word
    isEndOfWord bool

    // the collection of children of this node
    children map[rune]*trieNode
}

// Create new trieNode
func newtrieNode() *trieNode {
    return &trieNode{
        isEndOfWord: false,
        children:    make(map[rune]*trieNode, INIT_TRIE_CHILDREN_NUM),
    }
}
  • 匹配时上、下界索引的定义
// Match index object
type matchIndex struct {
    start int // start index
    end   int // end index
}

// Construct from scratch
func newMatchIndex(start, end int) *matchIndex {
    return &matchIndex{
        start: start,
        end:   end,
    }
}

// Construct from existing match index object
func buildMatchIndex(obj *matchIndex) *matchIndex {
    return &matchIndex{
        start: obj.start,
        end:   obj.end,
    }
}
  • 核心算法实现
// dfa util
type DFAUtil struct {
    // The root node
    root *trieNode
}

func (this *DFAUtil) insertWord(word []rune) {
    currNode := this.root
    for _, c := range word {
        if cildNode, exist := currNode.children[c]; !exist {
            cildNode = newtrieNode()
            currNode.children[c] = cildNode
            currNode = cildNode
        } else {
            currNode = cildNode
        }
    }

    currNode.isEndOfWord = true
}

// Check if there is any word in the trie that starts with the given prefix.
func (this *DFAUtil) startsWith(prefix []rune) bool {
    currNode := this.root
    for _, c := range prefix {
        if cildNode, exist := currNode.children[c]; !exist {
            return false
        } else {
            currNode = cildNode
        }
    }

    return true
}

// Searc and make sure if a word is existed in the underlying trie.
func (this *DFAUtil) searcWord(word []rune) bool {
    currNode := this.root
    for _, c := range word {
        if cildNode, exist := currNode.children[c]; !exist {
            return false
        } else {
            currNode = cildNode
        }
    }

    return currNode.isEndOfWord
}

// Searc a whole sentence and get all the matcing words and their indices
// Return:
// A list of all the matc index object
func (this *DFAUtil) searcSentence(sentence string) (matchIndexList []*matchIndex) {
    start, end := 0, 1
    sentenceRuneList := []rune(sentence)

    // Iterate the sentence from the beginning to the end.
    startsWith := false
    for end <= len(sentenceRuneList) {
        // Check if a sensitive word starts with word range from [start:end)
        // We find the longest possible path
        // Then we check any sub word is the sensitive word from long to short
        if this.startsWith(sentenceRuneList[start:end]) {
            startsWith = true
            end += 1
        } else {
            if startsWith == true {
                // Check any sub word is the sensitive word from long to short
                for index := end - 1; index > start; index-- {
                    if this.searcWord(sentenceRuneList[start:index]) {
                        matchIndexList = append(matchIndexList, newMatchIndex(start, index-1))
                        break
                    }
                }
            }
            start, end = end-1, end+1
            startsWith = false
        }
    }

    // If finishing not because of unmatching, but reaching the end, we need to
    // check if the previous startsWith is true or not.
    // If it's true, we need to check if there is any candidate?
    if startsWith {
        for index := end - 1; index > start; index-- {
            if this.searcWord(sentenceRuneList[start:index]) {
                matchIndexList = append(matchIndexList, newMatchIndex(start, index-1))
                break
            }
        }
    }

    return
}

// Judge if input sentence contains some special caracter
// Return:
// Matc or not
func (this *DFAUtil) IsMatch(sentence string) bool {
    return len(this.searcSentence(sentence)) > 0
}

// Handle sentence. Use specified caracter to replace those sensitive caracters.
// input: Input sentence
// replaceCh: candidate
// Return:
// Sentence after manipulation
func (this *DFAUtil) HandleWord(sentence string, replaceCh rune) string {
    matchIndexList := this.searcSentence(sentence)
    if len(matchIndexList) == 0 {
        return sentence
    }

    // Manipulate
    sentenceList := []rune(sentence)
    for _, matchIndexObj := range matchIndexList {
        for index := matchIndexObj.start; index <= matchIndexObj.end; index++ {
            sentenceList[index] = replaceCh
        }
    }

    return string(sentenceList)
}

// Create new DfaUtil object
// wordList:word list
func NewDFAUtil(wordList []string) *DFAUtil {
    this := &DFAUtil{
        root: newtrieNode(),
    }

    for _, word := range wordList {
        wordRuneList := []rune(word)
        if len(wordRuneList) > 0 {
            this.insertWord(wordRuneList)
        }
    }

    return this
}
  • 测试用例
import (
    "testing"
)

func TestIsMatch(t *testing.T) {
    sensitiveList := []string{"中国", "中国人"}
    input := "我来自中国cd"

    util := NewDFAUtil(sensitiveList)
    if util.IsMatch(input) == false {
        t.Errorf("Expected true, but got false")
    }
}

func TestHandleWord(t *testing.T) {
    sensitiveList := []string{"中国", "中国人"}
    input := "我来自中国cd"

    util := NewDFAUtil(sensitiveList)
    newInput := util.HandleWord(input, '*')
    expected := "我来自**cd"
    if newInput != expected {
        t.Errorf("Expected %s, but got %s", expected, newInput)
    }
}

C# Version

  • TrieNode类型的定义
    /// <summary>
    /// TrieNode data structure
    /// TrieNode itself doesn't have any value. The value is represented on the path
    /// </summary>
    internal sealed class TrieNode
    {
        /// <summary>
        /// if this node is the end of a word
        /// </summary>
        public Boolean IsEndOfWord { get; set; }

        /// <summary>
        /// the collection of children of this node
        /// </summary>
        public Dictionary<Char, TrieNode> Children = new Dictionary<Char, TrieNode>();
    }
  • 匹配时上、下界索引的定义
    /// <summary>
    /// Match index object
    /// </summary>
    internal sealed class MatchIndex
    {
        /// <summary>
        /// start index
        /// </summary>
        public Int32 Start { get; set; }

        /// <summary>
        /// end index
        /// </summary>
        public Int32 End { get; set; }

        /// <summary>
        /// construct from scratch
        /// </summary>
        /// <param name="start">start index</param>
        /// <param name="end">end index</param>
        public MatchIndex(Int32 start, Int32 end)
        {
            this.Start = start;
            this.End = end;
        }

        /// <summary>
        /// construct from existing match index object
        /// </summary>
        /// <param name="other">existing match index object</param>
        public MatchIndex(MatchIndex other)
        {
            this.Start = other.Start;
            this.End = other.End;
        }
    }
  • 核心算法实现
    /// <summary>
    /// Dfa util
    /// </summary>
    public sealed class DFAUtil
    {
        /// <summary>
        /// The root node
        /// </summary>
        private readonly TrieNode Root = new TrieNode();

        /// <summary>
        /// Create new DfaUtil object
        /// </summary>
        /// <param name="wordList">word list</param>
        public DFAUtil(IEnumerable<String> wordList)
        {
            foreach (var word in wordList)
            {
                Char[] chArr = word.ToCharArray();
                if (chArr.Length > 0)
                {
                    InsertWord(chArr);
                }
            }
        }

        private void InsertWord(Char[] word)
        {
            var currNode = this.Root;
            foreach (var ch in word)
            {
                if (currNode.Children.ContainsKey(ch) == false)
                {
                    var childNode = new TrieNode();
                    currNode.Children[ch] = childNode;
                    currNode = childNode;
                }
                else
                {
                    currNode = currNode.Children[ch];
                }
            }

            currNode.IsEndOfWord = true;
        }

        /// <summary>
        /// Check if there is any word in the trie that starts with the given prefix.
        /// </summary>
        /// <param name="prefix">prefix</param>
        /// <returns></returns>
        private Boolean StartsWith(Char[] prefix)
        {
            var currNode = this.Root;
            foreach (var c in prefix)
            {
                if (currNode.Children.ContainsKey(c) == false)
                {
                    return false;
                }
                else
                {
                    currNode = currNode.Children[c];
                }
            }

            return true;
        }

        /// <summary>
        /// Search and make sure if a word is existed in the underlying trie.
        /// </summary>
        /// <param name="word">The char array of a word</param>
        /// <returns></returns>
        private Boolean SearchWord(Char[] word)
        {
            var currNode = this.Root;
            foreach (var c in word)
            {
                if (currNode.Children.ContainsKey(c) == false)
                {
                    return false;
                }
                else
                {
                    currNode = currNode.Children[c];
                }
            }

            return currNode.IsEndOfWord;
        }

        /// <summary>
        /// Search a whole sentence and get all the matching words and their indices
        /// </summary>
        /// <param name="sentence">Input sentence</param>
        /// <returns>A list of all the match index object</returns>
        private List<MatchIndex> SearchSentence(String sentence)
        {
            var start = 0;
            var end = 1;
            List<MatchIndex> matchIndexList = new List<MatchIndex>();

            // Iterate the sentence from the beginning to the end.
            var startsWith = false;
            Char[] sentenceChArr = sentence.ToCharArray();
            while (end <= sentenceChArr.Length)
            {
                // Check if a sensitive word starts with word range from [start:end)
                // We find the longest possible path
                // Then we check any sub word is the sensitive word from long to short
                if (this.StartsWith(sentenceChArr.Skip(start).Take(end - start).ToArray()))
                {
                    startsWith = true;
                    end++;
                }
                else
                {
                    if (startsWith)
                    {
                        // Check any sub word is the sensitive word from long to short
                        for (var index = end - 1; index > start; index--)
                        {
                            if (this.SearchWord(sentenceChArr.Skip(start).Take(index - start).ToArray()))
                            {
                                matchIndexList.Add(new MatchIndex(start, index - 1));
                                break;
                            }
                        }
                    }
                    start = end - 1;
                    end = end + 1;
                    startsWith = false;
                }
            }

            // If finishing not because of unmatching, but reaching the end, we need to 
            // check if the previous startsWith is true or not.
            // If it's true, we need to check if there is any candidate?
            if (startsWith)
            {
                for (var index = end - 1; index > start; index--)
                {
                    if (this.SearchWord(sentenceChArr.Skip(start).Take(index - start).ToArray()))
                    {
                        matchIndexList.Add(new MatchIndex(start, index - 1));
                        break;
                    }
                }
            }

            return matchIndexList;
        }

        /// <summary>
        /// Judge if input sentence contains some special character
        /// </summary>
        /// <param name="sentence">Input sentence</param>
        /// <returns>Match or not</returns>
        public Boolean IsMatch(String sentence)
        {
            //最终的索引范围列表<索引下限、索引上限>
            List<MatchIndex> matchIndexList = this.SearchSentence(sentence);
            if (matchIndexList.Count > 0)
            {
                return true;
            }

            // Filter some fixed sensitive character
            return StringUtil.IfHaveSpecialChar(sentence);
        }

        /// <summary>
        /// Handle sentence. Use specified character to replace those sensitive characters.
        /// </summary>
        /// <param name="sentence">Input sentence</param>
        /// <param name="replaceCh">candidate</param>
        /// <returns>Sentence after manipulation</returns>
        public String HandleWord(String sentence, Char replaceCh)
        {
            List<MatchIndex> matchIndexList = this.SearchSentence(sentence);
            if (matchIndexList.Count == 0)
            {
                return sentence;
            }

            // Manipulate
            Char[] chArr = sentence.ToCharArray();
            foreach (var item in matchIndexList)
            {
                for (var i = item.Start; i <= item.End; i++)
                {
                    chArr[i] = replaceCh;
                }
            }

            return new String(chArr);
        }
    }
  • 测试用例
public static void Test1()
{
        String[] sensitiveWordList = { "中国", "中国人" };
        String input = "我来自中国cd";

        DFAUtil util = new DFAUtil(sensitiveWordList);
        Console.WriteLine(util.IsMatch(input));
        Console.WriteLine(util.HandleWord(input, '*'));
}

复杂度分析

  • 时间复杂度
    由于底层使用的是Trie树,所有分支的存储是使用哈希表,所以每次查找字符所需要的是一次操作;但是对于输入的字符串,需要进行一次完整的遍历,所以对应的时间复杂度为O(n)。
  • 空间复杂度
    初始化时,每一个不同的字符都会被存储一次,所以所需的空间复杂度为O(n);而查询时,无需额外的空间。所以总的空间复杂度为O(n)。
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,088评论 5 459
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,715评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,361评论 0 319
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,099评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 60,987评论 4 355
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,063评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,486评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,175评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,440评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,518评论 2 309
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,305评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,190评论 3 312
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,550评论 3 298
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,880评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,152评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,451评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,637评论 2 335

推荐阅读更多精彩内容