Author: Richard Zhang. Mail: 89205975@qq.com This library filters sensitive phrases by user's configuration. Currently, only support UTF8 & ANSI encoded strings. The matching rule is max-length-matching, the library tries to match sensitive phrase as long as possible. For example: "damn fucker" and "damn" are all in sensitive dictionary, the sentence "he's a damn fucker" will be processed to "he's a ***********". Even user insert some spaces or non-letter characters between sensitive words, the library is also able to deal with it. For example: "Bad boy" is added to sensitive dictionary, "Bad.boy", "Bad boy", "Bad/boy" can also be filtered. "你去死" is added to sensitive dictionary, "你 去 死", "你/去 死", "你 去 .死" can also be filtered. Compiling requirement: 1. STL C++11 2. BOOST multi_index_container Performance test condition: 1. Giving a sentence around 100 bytes (English & Chinese mixed) 2. Dirty phrases around 10,000 3. Do 1,000 loop test 4. Intel I7 CPU Test result: For each loop, it cost around 100us
2022-04-02 17:47:14 4KB 脏话 敏感词 聊天 过滤
1
维基百科向量 sgns.wiki.char.bz2解压后文件后缀名是.char, 可以通过一些方法得到.txt结尾的文件,有35万多个字和符号,300维的向量表示。将向量作为嵌入层时需要加载全部的向量到内存,如果计算机的内存不够大,会直接内存溢出。所以,截取8000,20000个汇的向量进行使用,在配置普遍的设备也能运行。该项目提供了100多个使用不同表示(密集和稀疏),上下文特征(单,ngram,字符等)和语料库训练的中文单向量(嵌入)。人们可以很容易地获得具有不同属性的预训练向量,并将它们用于下游任务。
1
Hownet情感语集
2022-04-01 16:05:36 83KB 情感词
1
屏幕取C++源代码,nhw32.dll实现,参考~
2022-03-31 22:14:25 3.47MB 屏幕取词
1
这个是孤立的HMM算法实现,还不错。贡献给大家了。
2022-03-30 22:33:45 14KB HMM
1
java使用dfa算法实现敏感过滤,此算法效率最高,附带了一个敏感库,轻松搞定论坛网站的敏感过滤问题。
2022-03-30 13:14:46 1.39MB 敏感词过滤 dfa Java
1
中文文本分类停用1208个中文文本分类停用1208个中文文本分类停用1208个
2022-03-30 11:47:56 3KB 停用词
1
self complement of Sentence Similarity compute based on cilin, hownet, simhash, wordvector,vsm models,基于同义林,知网,指纹,字向量,向量空间模型的句子相似度计算。
2022-03-29 17:13:03 7.51MB Python开发-自然语言处理
1
微信达人 辅助答题脚本 在达人学过程中 选项会有相关提示 操作过程在压缩包文件夹下world文档中
2022-03-29 13:47:27 6.34MB 词达人 脚本 辅助答题
1