Spark and Hadoop

📄️ Origin of Spark

Big Data and Distributed Computing at Google

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

📄️ Map Reduce

Knowing that answering the how question is what is important to understanding big data, the first question we need to answer is how does it actually store the data? What makes it different from non-big data storage?

📄️ Hadoop vs Spark

📄️ Interpreting a Spark DAG

A DAG is just a regular graph with nodes and edges but with no cycles or loops. In order to understand a Spark DAG, we first have to understand where a DAG comes into the picture during the execution of a Spark job.

📄️ Spark RDDs

An RDD (Resilient Distributed Dataset) is the primary data structure in Spark. It is a distributed collection of data that can be processed in parallel. RDDs are immutable, meaning that once an RDD is created, it cannot be modified. Instead, any transformations applied to an RDD will return a new RDD.

📄️ Quiz: Spark basics

Below are a few questions that should come handy in the first go :

📄️ Apache Spark Basics