只有两类label(0,1)的MNIST数据集,应用于二分类问题。
2020-01-03 11:17:24 312KB MNIST数 二分类数据集
1
由复旦大学李荣陆提供。answer.rar为测试语料,共9833篇文档;train.rar为训练语料,共9804篇文档,分为20个类别。训练语料和测试语料基本按照1:1的比例来划分。收集工作花费了不少人力和物力,所以请大家在使用时尽量注明来源(复旦大学计算机信息与技术系国际数据库中心自然语言处理小组)。
2019-12-25 11:15:53 103.28MB 数据集 中文语料库
1
SougoCS数据集,内含11类搜狐新闻文本,近10万条。 搜狗提供的数据为未分类的XML格式。 此资源已经将XML解析并分类完毕,方便使用。
2019-12-21 22:23:09 94.29MB NLP 自然语言处理 文本分类 搜狗
1
该数据集包含了1,600,000条从推特爬取的推文,可用于情感分析相关的训练。 该数据集包含两个数据文件:测试集(test)和训练集(training) 数据文件没有包含heading,从左到右分别是: (1)推文标注(polarity): 0 = 负面,2 = 中立,4 = 正面 (2)推文的id (3)时间:Sat May 16 23:58:44 UTC 2009 (4)Query (lyx),如果没有query,数值为NO_QUERY. (5)发推的用户:robotickilldozr (6)推文内容
2019-12-21 22:23:09 86.3MB 文本分类 自然语言处理 NLP 情感分类
1
美国卡耐基大学垃圾邮件分类数据集,英文,已划分好正负样本。总共有5000多条记录,适合数据挖掘,机器学习中贝叶斯分类模型等应用
2019-12-21 22:20:38 1.72MB 垃圾邮件分类 数据集 数据挖掘
1
数据为从101_ObjectCategories中选出的部分数据,作为图像分类的测试数据
2019-12-21 22:08:12 2.85MB 图像,分类
1
中文文本分类语料(复旦)-训练集和测试集 这个链接是训练集,本语料库由复旦大学李荣陆提供。test_corpus为测试语料,共9833篇文档;train_corpus为训练语料,共9804篇文档,两个预料各分为20个相同类别。训练语料和测试语料基本按照1:1的比例来划分。使用时尽量注明来源(复旦大学计算机信息与技术系国际数据库中心自然语言处理小组)。文件较大,下载时请耐心等待。
2019-12-21 22:04:21 101.81MB 文本分类 数据集 复旦 中文
1
内含手机中文评论数据集(商品编号和评论),贝叶斯算法中文评论分类代码,数据集+代码
2019-12-21 22:04:10 17.84MB bayes
1
分类数据
2019-12-21 21:55:41 5KB Iris分类
1
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
2019-12-21 21:45:42 11.25MB 数据集 文本分类
1