Skip to main content

Lab: Spark Optimizations for Analytics Workloads

Optimizations in Apache Spark play a crucial role while building big data solutions. Knowledge and experience in tuning Spark-based workloads help organizations save costs and time while running these workloads on the cloud. In this lab, we will learn about various optimization techniques concerning Spark DataFrames and big data analytics in general. We will learn about the limitations of the collect() method and inferSchema when reading data. This will be followed by an overview of the best practices for working with CSV files, Parquet files, Pandas projects, and Koalas projects. Also, we will learn about some powerful optimization techniques, such as column predicate pushdowncolumn pruning, and partitioning strategies.

The topics covered in this lab are as follows:

  • Understanding the collect() method
  • Understanding the use of inferSchema
  • Learning to differentiate between CSV and Parquet
  • Learning to differentiate between pandas and Koalas
  • Understanding built-in Spark functions
  • Learning column predicate pushdown
  • Learning partitioning strategies in Spark
  • Understanding Spark SQL optimizations
  • Understanding bucketing in Spark

The code is in the assets folder. You will also find a .dbc file that can be imported directly into databricks.