Lab: Apache Beam Getting Started
Pipeline 1 - Simple Ingest Data Pipeline
Notebook: 01-sa-ingest-data-pipeline.ipynb
Pipeline 2 - Wordcount
It demonstrates a simple pipeline that uses the Direct Runner to read from a text file, apply transforms to tokenize and count the words, and write the data to an output text file.
Key Concepts:
- Creating the Pipeline
- Applying transforms to the Pipeline
- Reading input
- Applying ParDo transforms
- Applying SDK-provided transforms (in this example: Count)
- Writing output
- Running the Pipeline
Notebook: 02-sa-wordcount-pipeline.ipynb
Apache Beam Basic Operations
In this tutorial, we will learn about:
- Create and print input data
- Read data from files
- Write data into files
- Read data from SQLite database
- Map, FlatMap, Reduce, and Combine functions
Notebook: 03-sa-basic-operations.ipynb
Windowing
In this tutorial, we will learn about:
- Global windows
- Fixed-time windows
- Sliding-time windows
- Session windows
Notebook: 04-sa-windowing.ipynb
Dataframes
In this tutorial, we will learn about:
- Pandas dataframe to Beam Dataframe
- Pandas dataframe to PCollections
- Beam Dataframe to Pandas dataframe
- PCollections to Pandas dataframe
- Beam Dataframe to PCollections
- PCollections to Beam Dataframe
Notebook: 05-sa-dataframes.ipynb
Files
├── [ 22K] 01-sa-ingest-data-pipeline.ipynb
├── [ 21K] 02-sa-wordcount-pipeline.ipynb
├── [ 26K] 03-sa-basic-operations.ipynb
├── [ 29K] 04-sa-windowing.ipynb
├── [ 18K] 05-sa-dataframes.ipynb
├── [ 113] Makefile
├── [2.0K] README.md
├── [158K] data
│ ├── [ 62K] kinglear.txt.zip
│ ├── [8.0K] moon-phases.db
│ ├── [ 529] penguins.csv
│ ├── [ 121] sample1.txt
│ ├── [ 72] sample2.txt
│ ├── [ 160] solar_events.csv
│ └── [ 87K] sp500.csv.zip
├── [115K] output
│ ├── [ 66K] pipe2-00000-of-00001
│ ├── [ 76] result.txt-00000-of-00001
│ ├── [ 153] sample-00000-of-00001.txt
│ └── [ 48K] wordcount-00000-of-00001
└── [5.4K] src
├── [3.5K] pipeline1.py
└── [1.9K] pipeline2.py
397K used in 3 directories, 20 files