matlab聚类kmeans代码 作业7 要求 在MapReduce上实现K-Means算法并在小数据集上测试。可以使用附件的数据集,也可以随机生成若干散点的二维数据(x, y)。设置不同的K值和迭代次数,可视化聚类结果。 提交要求同作业5,附上可视化截图。 实现思路 我直接使用了实例代码来运行,用原来的代码创建maven项目KMeansExample。由于原来的代码不是用maven管理的,而且是基于Hadoop1.2编写的程序,所以有一些地方需要进行小小的修改。比如每个java文件前面都要加上对应的包名称,Job对象的创建需要调用getInstance静态方法,而不能直接new Job。 我尝试研读了整个算法的代码,下面简要描述一下示例代码的思路。 主程序:KMeansDriver.main() KMeansDriver.main()方法是整个算法的主程序,它从命令行接收指定的参数k(需要聚成的类数),iterationNum(迭代次数),inputpath,outputpath。依次调用三个主要的过程: generateInitialCluster():随机产生k个cluster
2022-12-07 18:05:50 1.23MB 系统开源
1
Harvard cs153 compiler HW7
2022-06-20 17:00:45 162KB compiler
1
CSE 414 Homework 7: Parallel Data Processing and Spark Objectives: To write distributed queries. To learn about Spark and running distributed data processing in the cloud using AWS. What to turn in: Your Spark code in the sparkapp.py file. Spark Programming Assignment (75 points) In this homework, you will be writing Spark and Spark SQL code, to be executed both locally on your machine and also using Amazon Web Services. We will be using a similar flight dataset used in previous homework. This time, however, we will be using the entire data dump from the US Bureau of Transportation Statistics, which consists of information about all domestic US flights from 1987 to 2011 or so. The data is in Parquet format. Your local runs/tests will use a subset of the data (in the flights_small directory) and your cloud jobs will use the full data (stored on Amazon S3)
2021-08-24 14:28:09 2.34MB CSE414 HW7 pySpark