短文本情感分析语料,某外卖平台收集的用户评价,正负各8000条,共16000条
2019-12-21 19:36:36 386KB 中文情感分析 语料 短文本分类 NLP
1
文档中包含网盘的地址,数据共319M NLP方向文本摘要,文本分类,等方向可采纳! The LCSTS dataset includes two parts: /DATA: 1. PART I: is the main contents of LCSTS that contains 2,400,591 (short text, summary) pairs. It can be used to train supervised learning models for summary generation. 2. PART II: contains 10,666 human labled (short text, summary) pairs which can be used to train classifier to filter the noises of the PART I. 3. PART III: contains 1,106 (short text, summary) pairs, this part is labled by 3 persons with the same labels. These pairs with score 3,4 and 5 can be used as test set for evaluating summary generation systems. /Result: 1.sumary.generated.char.context.txt: contains the summary generated by using RNN+context on the character based input. 2.sumary.generated.char.nocontext.txt: contains the summary generated by using RNN+nocontext on the character based input. 3.sumary.generated.word.context.txt: contains the summary generated by using RNN+context on the word based input. 4.sumary.generated.word.nocontext.txt: contains the summary generated by using RNN+nocontext on the word based input. 5.weibo.txt: contains the weibo of the test set. 6.sumary.human: contains the sumaries corresponding to 'weibo.txt' written by human. This part is the test set of the paper. 7. rouge.char_context.txt: the rouge metric on sumary.generated.char.context 8. rouge.char_nocontext.txt:the rouge metric on sumary.generated.char.nocontext 9. rouge.word_context.txt: the rouge metric on sumary.generated.word.context 10. rouge.word_nocontext.txt:the rouge metric on sumary.generated.word.nocontext
2019-12-21 19:26:22 66B nlp
1
针对中文短文本篇幅较短、特征稀疏性等特征,提出了一种基于隐含狄利克雷分布模型的特征扩展的短文本分类方法。在短文本原始特征的基础上,利用 LDA 主题模型对短文本进行预测,得到对应的主题分布,把主题中的词作为短文本的部分特征,并扩充到原短文本的特征中去,最后利用 SVM 分类方法进行短文本的分类。实验表 明,该方法在性能上与传统的直接使用 VSM 模型来表示短文本特征的方法相比,对不同类别的短文本进行分类,都有不同程度的提高与改进,对于短文本进行补充 LDA 特征信息的方法是切实可行的。
2019-12-21 18:56:42 1.14MB LDA 短文本分类
1