DocumentAnalysis:使用 Hadoop 进行维基百科文档分析-源码

上传者: 42137032 | 上传时间: 2021-07-06 17:06:49 | 文件大小: 84KB | 文件类型: ZIP
DocumentAnalysis Wikipedia document analysis using Hadoop Map的每个输入是XML文档的 标签到 标签。其中key没有意义,value就是这两个标签(包括标签本身)的值,可以通过 .toString() 方法转化为字符串进行下一步处理 src/documentParser TextParser.java 正则表达式处理String,能够去除大部分标点符号,需要补全 XMLHandler.java SAX流形式处理XML格式的字符串

文件下载

资源详情

[{"title":"( 53 个子文件 84KB ) DocumentAnalysis:使用 Hadoop 进行维基百科文档分析-源码","children":[{"title":"DocumentAnalysis-master","children":[{"title":".gitignore <span style='color:#111;'> 21B </span>","children":null,"spread":false},{"title":"packup.sh~ <span style='color:#111;'> 51B </span>","children":null,"spread":false},{"title":"packup.sh <span style='color:#111;'> 42B </span>","children":null,"spread":false},{"title":"BigData.jar <span style='color:#111;'> 34.45KB </span>","children":null,"spread":false},{"title":"src","children":[{"title":"categoryTFIDF","children":[{"title":"CategoryTFIDF.java <span style='color:#111;'> 2.46KB </span>","children":null,"spread":false},{"title":"CategoryTFIDF2.java <span style='color:#111;'> 2.55KB </span>","children":null,"spread":false}],"spread":true},{"title":"xmlDriver","children":[{"title":"XMLDriver.java <span style='color:#111;'> 4.44KB </span>","children":null,"spread":false}],"spread":true},{"title":"documentParser","children":[{"title":"TextParser.java <span style='color:#111;'> 1.51KB </span>","children":null,"spread":false},{"title":"XMLParser.java <span style='color:#111;'> 868B </span>","children":null,"spread":false}],"spread":true},{"title":"Writables","children":[{"title":"Map2ValueWritable.java <span style='color:#111;'> 1.00KB </span>","children":null,"spread":false},{"title":"Map1ValueWritable.java <span style='color:#111;'> 977B </span>","children":null,"spread":false},{"title":"Reduce1KeyWritable.java <span style='color:#111;'> 842B </span>","children":null,"spread":false}],"spread":true},{"title":"wordTFIDF","children":[{"title":"CountDocNum.java <span style='color:#111;'> 1.40KB </span>","children":null,"spread":false},{"title":"WordTFIDF.java <span style='color:#111;'> 3.13KB </span>","children":null,"spread":false},{"title":"WordTFIDF2.java <span style='color:#111;'> 2.69KB </span>","children":null,"spread":false},{"title":"CountWordInitial.java <span style='color:#111;'> 1.67KB </span>","children":null,"spread":false}],"spread":true},{"title":"DocumentAnalysis.java <span style='color:#111;'> 6.44KB </span>","children":null,"spread":false},{"title":"addPosition","children":[{"title":"AddPosition.java <span style='color:#111;'> 3.01KB </span>","children":null,"spread":false}],"spread":true}],"spread":true},{"title":"bin","children":[{"title":"categoryTFIDF","children":[{"title":"CategoryTFIDF2.class <span style='color:#111;'> 428B </span>","children":null,"spread":false},{"title":"CategoryTFIDF2$Reduce.class <span style='color:#111;'> 2.25KB </span>","children":null,"spread":false},{"title":"CategoryTFIDF$Reduce.class <span style='color:#111;'> 2.93KB </span>","children":null,"spread":false},{"title":"CategoryTFIDF.class <span style='color:#111;'> 423B </span>","children":null,"spread":false},{"title":"CategoryTFIDF2$Map.class <span style='color:#111;'> 3.63KB </span>","children":null,"spread":false},{"title":"CategoryTFIDF$Map.class <span style='color:#111;'> 3.10KB </span>","children":null,"spread":false}],"spread":true},{"title":"xmlDriver","children":[{"title":"XMLDriver$XmlInputFormat1$XmlRecordReader.class <span style='color:#111;'> 4.16KB </span>","children":null,"spread":false},{"title":"XMLDriver.class <span style='color:#111;'> 362B </span>","children":null,"spread":false},{"title":"XMLDriver$XmlInputFormat1.class <span style='color:#111;'> 1.18KB </span>","children":null,"spread":false}],"spread":true},{"title":"documentParser","children":[{"title":"XMLParser.class <span style='color:#111;'> 1.16KB </span>","children":null,"spread":false},{"title":"TextParser.class <span style='color:#111;'> 2.12KB </span>","children":null,"spread":false}],"spread":true},{"title":"Writables","children":[{"title":"Map2ValueWritable.class <span style='color:#111;'> 1.77KB </span>","children":null,"spread":false},{"title":"Map1ValueWritable.class <span style='color:#111;'> 1.73KB </span>","children":null,"spread":false},{"title":"Reduce1KeyWritable.class <span style='color:#111;'> 1.65KB </span>","children":null,"spread":false}],"spread":true},{"title":"wordTFIDF","children":[{"title":"CountDocNum$PPartition.class <span style='color:#111;'> 973B </span>","children":null,"spread":false},{"title":"CountWordInitial$Reduce.class <span style='color:#111;'> 2.09KB </span>","children":null,"spread":false},{"title":"WordTFIDF2.class <span style='color:#111;'> 392B </span>","children":null,"spread":false},{"title":"CountDocNum.class <span style='color:#111;'> 456B </span>","children":null,"spread":false},{"title":"CountDocNum$Map.class <span style='color:#111;'> 1.98KB </span>","children":null,"spread":false},{"title":"CountDocNum$Reduce.class <span style='color:#111;'> 2.25KB </span>","children":null,"spread":false},{"title":"CountWordInitial.class <span style='color:#111;'> 486B </span>","children":null,"spread":false},{"title":"WordTFIDF.class <span style='color:#111;'> 387B </span>","children":null,"spread":false},{"title":"WordTFIDF$Reduce.class <span style='color:#111;'> 2.90KB </span>","children":null,"spread":false},{"title":"CountWordInitial$PPartition.class <span style='color:#111;'> 989B </span>","children":null,"spread":false},{"title":"WordTFIDF2$Map.class <span style='color:#111;'> 3.57KB </span>","children":null,"spread":false},{"title":"WordTFIDF$Map.class <span style='color:#111;'> 3.75KB </span>","children":null,"spread":false},{"title":"CountWordInitial$Map.class <span style='color:#111;'> 2.52KB </span>","children":null,"spread":false},{"title":"WordTFIDF2$Reduce.class <span style='color:#111;'> 2.57KB </span>","children":null,"spread":false}],"spread":false},{"title":"addPosition","children":[{"title":"AddPosition$Map.class <span style='color:#111;'> 4.00KB </span>","children":null,"spread":false},{"title":"AddPosition.class <span style='color:#111;'> 405B </span>","children":null,"spread":false},{"title":"AddPosition$Reduce.class <span style='color:#111;'> 2.37KB </span>","children":null,"spread":false}],"spread":true},{"title":"DocumentAnalysis.class <span style='color:#111;'> 5.12KB </span>","children":null,"spread":false}],"spread":true},{"title":".classpath <span style='color:#111;'> 16.77KB </span>","children":null,"spread":false},{"title":"README.md <span style='color:#111;'> 488B </span>","children":null,"spread":false},{"title":".project <span style='color:#111;'> 427B </span>","children":null,"spread":false}],"spread":true}],"spread":true}]

评论信息

免责申明

【只为小站】的资源来自网友分享,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,【只为小站】 无法对用户传输的作品、信息、内容的权属或合法性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论 【只为小站】 经营者是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。
本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二条之规定,若资源存在侵权或相关问题请联系本站客服人员,zhiweidada#qq.com,请把#换成@,本站将给予最大的支持与配合,做到及时反馈和处理。关于更多版权及免责申明参见 版权及免责申明