NAME: Arpita Seth
STUDENT NO: 014739396
There were five different articles about: MapReduce, Spark, Tensorflow, Powergraph and Petuum. MapReduce is a software framework, which divides input data into various chunks and these chunks are processed parellely with help of a map. The output received from these maps in sent as an input to the reduce task. The framework performs these tasks, monitors them and repeats the failed tasks. Next topic was Spark, in this we got to know about RDD which is a collection of memory that helps Programmers to use it in a fault tolerant manner. Spark provides a program to connect to the cluster and stores the results in a distributed memory to make the system faster. Whenever the user runs any program on RDD, Spark’s Scheduler assigns to the machine based on data location and if that task fails it is run on another node. Scala / Python interpreter complies, loads JVM and calls the function. Now comparing MapReduce and Spark, works on processing in-memory data, which is a faster process. Thus very less time is taken in input and output of data whereas MapReduce takes a lot of time in input and output activities. In real time , data processing speed for Spark is very high as compared to MapReduce. Next was tensorflow which is used for large scale data machine learning. Input and output are in the form of tensors, which contain a small number of Primitive data types and nodes are Mathematical operations. It consists of several layers: First is a client that explains how the computation works as a dataflow graph and starts it with the help of a session. Second is distributed master that prunes a particular subgraph from a graph as per session defined. Further it partitions the subgraphs and distributes it to worker services. Third is Worker services that schedules these executions for the kernels using CPUs, GPUs, etc parallely. These kernel implementations further compute the graph operations. Now comparing Spark and Tensorflow, Spark is a big data framework which helps use a large amount of data while Tensorflow is a machine learning framework which helps to improve the performance of neural networks and numerical computations by generating graphs in form of data flows. Next was PowerGraph, which has three phases: the first is collect, which collects information about the neighborhood, the second is apply which applies this accumulated value to center vertex and last is scatter, which updates the neighboring vertices and edges. PowerGraph contains features of both Pregel and Graphlab. The feature of shared-memory, which helps users to eliminate the need for architectural movement of information. It also features computational efficiency from asynchronous Graphlab. Graphlab uses serializability, which helps preventing neighboring vertex-programs from running using a fine-grained locking protocol, which uses sequentially grabbing locks on all the neighboring vertices. PowerGraph retains these features while improving on limitations in Graphlab. Next is Petuum, which consists of the first parameter server, which provides data-parallelism to users for accessing model parameters through shared shared memory, second scheduler, which provides model-parallelism by helping users to keep control of model parameters that are updated by worker machines and third workers, which receives update from the scheduler and then parallely updates the functions. Comparing Petuum and Spark, Petuum is rated 10 to 100 times faster as compared to Spark. Petuum works on high level machine learning clusters while Spark works for large machine learning clusters. Spark works on Java Virtual Machine while Petuum works on C++.
Other than reading these papers we also did three projects. First project was about learning Spark. It was implemented on AWS. Amazon S3 was used for storing external data files. Next step was to create Elastic MApReduce(EMR) cluster with EMR release 5.7.0 or up. The application type selected was Spark with three instances of type m4.large. After this EC2 key pair was selected and cluster was created. But after doing all this we did not get Web Connection Enabled, so we need to add InBound Rules in EMR Master-Node, we needed to add a rule for SSH and Port 22 for TCP protocol. We also added a rule for All Traffic for All Ports. This enabled web connection and Zeppelin notebook was available for use. Next RDDs were created and the data files uploaded on Amazon S3 for performing actions. Data was converted to float and it was pretty easy to find minimum, maximum, average and variance by performing actions on RDD data. It was slightly tricky to find the median. There were two main restrictions on calculating median. First was to avoid approximation and the second was to avoid sorting of full data set. If there was no restriction regarding approximation, then we could have used percentil_approx function. There exists a function ‘takeOrdered(k:int)’, which gives the first ‘k’ ordered elements of the dataset. We used this function to get the lower half of the sorted numbers and in case the number of entries was even, we took the average of the last two entries of the lower half as median, else took the last entry to lower half as median. But apparently, ‘takeOrdered(k:int)’ function sorts the whole dataset, hence the solution was incorrect. Next we also presented an algorithm for calculating the mode. It was a good exercise to gain exposure to AWS and the various configurations available, and also learn about PySpark and Zeppelin notebook and gain practical knowledge about RDDs.
The second project was about using Graphlab to analyze datasets. Using Graphlab on laptop was very heavy, so we chose to use Graphlab on AWS. Graphlab instance was created with type t2.micro. After it was created we could use it through Juypter notebook having a public address. The three provided datasets were uploaded on Amazon S3. Two of the datasets movies_movielens.csv and ratings_movielens.csv have the same data source ‘MovieLens’. So, it is easier to merge these two datasets based on the ‘movieId’. The third data set ‘imdb.csv’ has a different datasource and is scraped from IMDB and needs to be joined based on ‘movieName’ and ‘Year’. This requires preprocessing in order to match the formats of movieName from the two different datasets. First all the three datasets were imported as S Frames. Then preprocessing and cleaning of the data was done. ‘Year’ was removed from Title in movies_movielens.csv and added as a new column. Unwanted columns were removed from the datasets. Next from ratings_movielens.csv dataset, average ratings for each of the movies were calculated by grouping based on the movie Ids. Ratings and Movie datasets was merged on the basis of movie Ids. Next this was merged with Imdb dataset. This was a bit difficult to do on AWS due to big size of merged file obtained. Another difficulty was to get the ‘title’ for both the data sources in the similar format. After merging, the task was easy to filter out by genres, ratings, cast, movie year, etc and generate various statistics like top rated movies in certain genre in certain years, top casts in certain genres, etc. Finally, the statistics were visualized on graph and the graph visualization was quite interesting and interactive in GraphLab.
Last project was about experimenting with various machining learning parameters on MNIST dataset with Tensorflow. This project was comparatively easier than the previous projects. It was about understanding workflow of tensorflow and visualization on tensorboard. It involved training the MNIST dataset with softmax regression. Initially, Gradient Descent Optimizer was used, and we experimented with the learning rate, and analyzed the effect of learning rate on the accuracy of prediction on test dataset. Next, the effect of number of training steps, was observed on the accuracy of prediction. Lastly, we chose two optimizers other than Gradient Descent Optimizer for training, and observed the effect of the same on the accuracy of prediction. After this, we learnt the utility of tensorboard for visualization of various variables like accuracy, cross entropy, etc. for both training and testing against the number of increasing training steps. Tensorboard also provides us the ability to visualize the coded algorithm in form of a graph. In order to visualize through Tensorboard, we needed to modify the code by specifying variables in tf.summary, and filewriter.
All the projects were quite interesting and helped to gain exposure to latest tools used for analysis of big data sets. AWS and Spark is good to use in Big Data processing as Spark uses RDD (Resilient Distributed Dataset) thus less number of discs have to read/write and applications run very fast on Hadoop clusters. AWS provides a lot of options: Hadoop Cluster, Spark or Graphlab for use. The services are charged on hourly basis, and whenever the server is terminated/stopped, there will be no charges from the next hour. Graphlab is also a good option especially for data visualization. Graphlab canvas provides interactive GUI for tabular data, summary statistics and bi-variate plots. It helps in saving time used for data exploration.
Tensorflow has a powerful library and has good performance and customizability. Tensorflow is very flexible and portable. It can be used to train models and serve them in live to real customers. This means that, it is not necessary to rewrite the codes and the people can use their ideas on products quickly. Also people from, academic research areas can provide codes directly with greater reproducibility. In this way, it is useful and supports research and production to process faster. Another good reason for using tensorflow could be the visualization obtained through tensorboard.
Overall, the course was really useful to gain theoretical knowledge by understanding various important papers in the domain, and also providing hands-on knowledge and experience with the tools for both big data processing, machine learning and data visualization.