Apache Beam
What is apache beam?
Apache Beam is a library for parallel data processing. Beam is commonly used for Extract, Transform, and Load (ETL), batch, and stream processing. It does particularly well with large amounts of data since it can use mutliple machines to process everything at the same time.
Apache Beam comprises four basic features:
- Pipeline - Pipeline is responsible for reading, processing, and saving the data. This whole cycle is a pipeline starting from the input until its entire circle to output. Every Beam program is capable of generating a Pipeline.
- PCollection - It is equivalent to RDD or DataFrames in Spark. The pipeline creates a PCollection by reading data from a data source, and after that, more PCollections keep on developing as PTransforms are applied to it.
- PTransform - Each PTransform on PCollection results in a new PCollection making it immutable. Once constructed, you will not be able to configure individual items in a PCollection. A transformation onPCollection will result in a new PCollection. The features in a PCollection can be of any type, but all must be of the same kind. However, to maintain disseminated processing, Beam encodes each element as a byte string so that Beam can pass around items to distributed workers.
- Runner - It determines where this pipeline will operate.
In Beam, your data lives in a PCollection
, which stands for Parallel Collection. A PCollection
is like a list of elements, but without any order guarantees. This allows Beam to easily parallelize and distribute the PCollection
's elements.
Once you have your data, the next step is to transform it. In Beam, you transform data using PTransform
s, which stands for Parallel Transform. A PTransform
is like a function, they take some inputs, transform them and create some outputs.
We pass the elements from step1 through step3 and save the results into outputs
.
outputs = pipeline | step1 | step2 | step3
This is equivalent to the example above.
outputs = (
pipeline
| step1
| step2
| step3
)
Also, Beam expects each transform, or step, to have a unique label, or description. This makes it a lot easier to debug, and it's in general a good practice to start. You can use the right shift operator >>
to add a label to your transforms, like 'My description' >> MyTransform
.
Try to give short but descriptive labels.
outputs = (
pipeline
| 'First step' >> step1
| 'Second step' >> step2
| 'Third step' >> step3
)
Python transform catalog overview
Element-wise
Transform | Description |
---|---|
Filter | Given a predicate, filter out all elements that don't satisfy the predicate. |
FlatMap | Applies a function that returns a collection to every element in the input and outputs all resulting elements. |
Keys | Extracts the key from each element in a collection of key-value pairs. |
KvSwap | Swaps the key and value of each element in a collection of key-value pairs. |
Map | Applies a function to every element in the input and outputs the result. |
ParDo | The most-general mechanism for applying a user-defined DoFn to every element in the input collection. |
Partition | Routes each input element to a specific output collection based on some partition function. |
Regex | Filters input string elements based on a regex. May also transform them based on the matching groups. |
Reify | Transforms for converting between explicit and implicit form of various Beam values. |
RunInference | Uses machine learning (ML) models to do local and remote inference. |
ToString | Transforms every element in an input collection a string. |
WithTimestamps | Applies a function to determine a timestamp to each element in the output collection, and updates the implicit timestamp associated with each input. Note that it is only safe to adjust timestamps forwards. |
Values | Extracts the value from each element in a collection of key-value pairs. |
Aggregation
Transform | Description |
---|---|
ApproximateQuantiles | Not available. SeeBEAM-6694 for updates. |
ApproximateUnique | Not available. SeeBEAM-6693 for updates. |
CoGroupByKey | Takes several keyed collections of elements and produces a collection where each element consists of a key and all values associated with that key. |
CombineGlobally | Transforms to combine elements. |
CombinePerKey | Transforms to combine elements for each key. |
CombineValues | Transforms to combine keyed iterables. |
CombineWithContext | Not available. |
Count | Counts the number of elements within each aggregation. |
Distinct | Produces a collection containing distinct elements from the input collection. |
GroupByKey | Takes a keyed collection of elements and produces a collection where each element consists of a key and all values associated with that key. |
GroupBy | Takes a collection of elements and produces a collection grouped, by properties of those elements. Unlike GroupByKey, the key is dynamically created from the elements themselves. |
GroupIntoBatches | Batches the input into desired batch size. |
Latest | Gets the element with the latest timestamp. |
Max | Gets the element with the maximum value within each aggregation. |
Mean | Computes the average within each aggregation. |
Min | Gets the element with the minimum value within each aggregation. |
Sample | Randomly select some number of elements from each aggregation. |
Sum | Sums all the elements within each aggregation. |
Top | Compute the largest element(s) in each aggregation. |
Other
Transform | Description |
---|---|
Create | Creates a collection from an in-memory list. |
Flatten | Given multiple input collections, produces a single output collection containing all elements from all of the input collections. |
PAssert | Not available. |
Reshuffle | Given an input collection, redistributes the elements between workers. This is most useful for adjusting parallelism or preventing coupled failures. |
View | Not available. |
WindowInto | Logically divides up or groups the elements of a collection into finite windows according to a function. |
Refer to this documentation for up-to-date list.