Skip to main content

Apache Beam

What is apache beam?

Apache Beam is a library for parallel data processing. Beam is commonly used for Extract, Transform, and Load (ETL), batch, and stream processing. It does particularly well with large amounts of data since it can use mutliple machines to process everything at the same time.

Apache Beam comprises four basic features:

  1. Pipeline - Pipeline is responsible for reading, processing, and saving the data. This whole cycle is a pipeline starting from the input until its entire circle to output. Every Beam program is capable of generating a Pipeline.
  2. PCollection - It is equivalent to RDD or DataFrames in Spark. The pipeline creates a PCollection by reading data from a data source, and after that, more PCollections keep on developing as PTransforms are applied to it.
  3. PTransform - Each PTransform on PCollection results in a new PCollection making it immutable. Once constructed, you will not be able to configure individual items in a PCollection. A transformation onPCollection will result in a new PCollection. The features in a PCollection can be of any type, but all must be of the same kind. However, to maintain disseminated processing, Beam encodes each element as a byte string so that Beam can pass around items to distributed workers.
  4. Runner - It determines where this pipeline will operate.

In Beam, your data lives in a PCollection, which stands for Parallel Collection. A PCollection is like a list of elements, but without any order guarantees. This allows Beam to easily parallelize and distribute the PCollection's elements.

Once you have your data, the next step is to transform it. In Beam, you transform data using PTransforms, which stands for Parallel Transform. A PTransform is like a function, they take some inputs, transform them and create some outputs.

We pass the elements from step1 through step3 and save the results into outputs.

outputs = pipeline | step1 | step2 | step3

This is equivalent to the example above.

outputs = (
pipeline
| step1
| step2
| step3
)

Also, Beam expects each transform, or step, to have a unique label, or description. This makes it a lot easier to debug, and it's in general a good practice to start. You can use the right shift operator >> to add a label to your transforms, like 'My description' >> MyTransform.

Try to give short but descriptive labels.

outputs = (
pipeline
| 'First step' >> step1
| 'Second step' >> step2
| 'Third step' >> step3
)

Python transform catalog overview

Element-wise

TransformDescription
FilterGiven a predicate, filter out all elements that don't satisfy the predicate.
FlatMapApplies a function that returns a collection to every element in the input and outputs all resulting elements.
KeysExtracts the key from each element in a collection of key-value pairs.
KvSwapSwaps the key and value of each element in a collection of key-value pairs.
MapApplies a function to every element in the input and outputs the result.
ParDoThe most-general mechanism for applying a user-defined DoFn to every element in the input collection.
PartitionRoutes each input element to a specific output collection based on some partition function.
RegexFilters input string elements based on a regex. May also transform them based on the matching groups.
ReifyTransforms for converting between explicit and implicit form of various Beam values.
RunInferenceUses machine learning (ML) models to do local and remote inference.
ToStringTransforms every element in an input collection a string.
WithTimestampsApplies a function to determine a timestamp to each element in the output collection, and updates the implicit timestamp associated with each input. Note that it is only safe to adjust timestamps forwards.
ValuesExtracts the value from each element in a collection of key-value pairs.

Aggregation

TransformDescription
ApproximateQuantilesNot available. SeeBEAM-6694 for updates.
ApproximateUniqueNot available. SeeBEAM-6693 for updates.
CoGroupByKeyTakes several keyed collections of elements and produces a collection where each element consists of a key and all values associated with that key.
CombineGloballyTransforms to combine elements.
CombinePerKeyTransforms to combine elements for each key.
CombineValuesTransforms to combine keyed iterables.
CombineWithContextNot available.
CountCounts the number of elements within each aggregation.
DistinctProduces a collection containing distinct elements from the input collection.
GroupByKeyTakes a keyed collection of elements and produces a collection where each element consists of a key and all values associated with that key.
GroupByTakes a collection of elements and produces a collection grouped, by properties of those elements. Unlike GroupByKey, the key is dynamically created from the elements themselves.
GroupIntoBatchesBatches the input into desired batch size.
LatestGets the element with the latest timestamp.
MaxGets the element with the maximum value within each aggregation.
MeanComputes the average within each aggregation.
MinGets the element with the minimum value within each aggregation.
SampleRandomly select some number of elements from each aggregation.
SumSums all the elements within each aggregation.
TopCompute the largest element(s) in each aggregation.

Other

TransformDescription
CreateCreates a collection from an in-memory list.
FlattenGiven multiple input collections, produces a single output collection containing all elements from all of the input collections.
PAssertNot available.
ReshuffleGiven an input collection, redistributes the elements between workers. This is most useful for adjusting parallelism or preventing coupled failures.
ViewNot available.
WindowIntoLogically divides up or groups the elements of a collection into finite windows according to a function.

Refer to this documentation for up-to-date list.