论文:An Emotion Cause Corpus for Chinese Microblogs with Multiple-User Structures
2021-04-27 13:01:53 774KB 微博情绪分析 情绪原因检测
1
MSR数据集,是微软公开的相似度计算数据集,其中训练集有4076个句子,其中包含2753个相似度为1,即为正例句子;测试集有1725个句子,其中包含1147个正例句子。
2021-04-26 17:12:27 485KB MSR数据集 文本相似度计算
1
kaggle 数据集 命名实体识别 范强下载的 Abhinav Walia • updated 3 years ago (Version 4) Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
2021-04-23 17:20:16 26.42MB 数据集 命名实体识别 深度学习 nlp
1
TI46数据集,数字0-9语料文件
2021-04-18 22:06:15 193.61MB TI46 数据集
1
SIGIL - R for Corpus Data.pdf
2021-03-28 09:07:49 124KB R语言 语料库
1
PaddlePaddle-DeepSpeech中文语音识别模型(free_st_chinese_mandarin_corpus数据集训练的) 项目地址:https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/tree/release/1.0
1
ace05-data-prep:ACE 2005 Corpus预处理(有关如何运行mgormleyace-data-prep的提示)
2021-03-17 15:40:44 12KB java maven makefile stanford-corenlp
1
PPASR中文语音识别(入门级)模型(free_st_chinese_mandarin_corpus数据集训练的) 源码地址:https://github.com/yeyupiaoling/PPASR/tree/%E5%85%A5%E9%97%A8%E7%BA%A7
1
The Lancaster Corpus of Mandarin Chinese (LCMC) is designed as a Chinese match for the FLOB and FROWN corpora for modern British and American English. The corpus is suitable for use in both monolingual research into modern Mandarin Chinese and cross-linguistic contrast of Chinese and British/American English. The corpus sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The corpus is marked up for text categories, sample file numbers, paragraphs, sentences and tokens. Linguistic annotations undertaken on the corpus include tokenization and part-of-speech tagging. The whole corpus is annotated at the word level and includes orthographic and morphological annotations. The tagging system used was produced by the Institute of Computing Science Chinese Lexical Analysis System (ICTCLAS), the Chinese Academy of Sciences. The corpus is encoded in Unicode (UTF-8) and marked up in XML. The corpus comes with a User Manual detailing corpus design specifications and part-of-speech tags. The XML structure of the corpus was validated using the parser built in Xaira. Part-of-speech tagging of all aspect markers was manually checked.
2021-02-18 20:17:08 5.15MB LCMC
1
整合当前可以找到的NER语料集,并把格式统一化,可以直接训练。
2020-01-03 11:17:01 23.02MB NLP corpus 语料集
1