CSE 414 Homework 7: Parallel Data Processing
and Spark
Objectives: To write distributed queries. To learn about Spark and running distributed data
processing in the cloud using AWS.
What to turn in:
Your Spark code in the sparkapp.py file.
Spark Programming Assignment (75 points)
In this homework, you will be writing Spark and Spark SQL code, to be executed both locally on
your machine and also using Amazon Web Services.
We will be using a similar flight dataset used in previous homework. This time, however, we will
be using the entire data dump from the US Bureau of Transportation Statistics, which consists of
information about all domestic US flights from 1987 to 2011 or so. The data is in Parquet format.
Your local runs/tests will use a subset of the data (in the flights_small directory) and your cloud
jobs will use the full data (stored on Amazon S3)