利用十大经典机器学习算法之一的SVM(支持向量机)算法,实现文本分类,用于自然语言处理。
2019-12-21 21:51:32 7KB SVM文本分类
1
中文文本分类语料(复旦大学)-训练集和测试集。测试语料共9833篇文档;训练语料共9804篇文档。使用时请注明来源(复旦大学计算机信息与技术系国际数据库中心自然语言处理小组)。
2019-12-21 21:50:45 106.15MB 中文文本分类 语料库 测试集 训练集
1
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
2019-12-21 21:45:42 11.25MB 数据集 文本分类
1
(全)包括知网Hownet情感词典,台湾大学NTUSD简体中文情感词典,情感词汇本体,情感词典及其分类,清华大学李军中文褒贬义词典,汉语情感词极值表,否定词典,褒贬词及其近义词
2019-12-21 21:45:05 1.63MB 自然语言处理 情感词 文本分类
1
keras实现中文文本分类;实现中文分析,词向量引入;基于语义的特征卷积计算,实现文本分类
2019-12-21 21:42:31 6KB textCNN
1
问题发现: 本次案例为工作中遇到的实际问题,在语音识别中的语料准备部分,需要从网络中爬取相当数量的相关文本,其中发现爬取到了一些不相关的内容,如何把这些不相关的内容剔除掉成为笔者需要思考的问题。 初步思考: 遇到此问题笔者第一时间考虑是将文本分词后向量化,使用聚类看一下分布情况,然而发现在不同训练集中,训练样本变化时,向量随之变化,在测试集中表现一般,在实测中几乎无用。于是想到向量化的方法问题,使用sklearn CountVectorizer方法进行向量化,仅仅是将所有词频无序的向量化,看到另外博文时,发现应该先将目标主题的文本进行词频统计,将统计结果当做向量化模板,实测发现效果不错,现将此方法分享给大家
2019-12-21 21:41:53 2.71MB 自然语言处理 svm 文本分类 高斯贝叶斯
1
针对短文本特征稀疏、噪声大等特点,提出一种基于 LDA 高频词扩展的方法,通过抽取每个类别的高频词作为向量空间模型的特征空间,用 TF-IDF 方法将短文本表示成向量,再利用 LDA 得到每个文本的隐主题特征,将 概率大于某一阈值的隐主题对应的高频词扩展到文本中,以降低短文本的噪声和稀疏性影响。实验证明,这种方法的分类性能高于常规分类方法
2019-12-21 21:41:21 624KB LDA 短文本分类
1
本文基于Google开源的BERT代码进行了进一步的简化,方便生成句向量与做文本分类
2019-12-21 21:40:14 2.96MB Python开发-自然语言处理
1
RNN 文本分类 大作业 BASICRNN BASICLSTM GRU RNN 文本分类 大作业 BASICRNN BASICLSTM GRU
2019-12-21 21:38:27 18.17MB RNN AI
1
简易有效的文本分类
2019-12-21 21:25:04 7KB svm lda 文本分类
1