Apache Beam

What is apache beam?

Apache Beam is a library for parallel data processing. Beam is commonly used for Extract, Transform, and Load (ETL), batch, and stream processing. It does particularly well with large amounts of data since it can use mutliple machines to process everything at the same time.

Apache Beam comprises four basic features:

Pipeline - Pipeline is responsible for reading, processing, and saving the data. This whole cycle is a pipeline starting from the input until its entire circle to output. Every Beam program is capable of generating a Pipeline.
PCollection - It is equivalent to RDD or DataFrames in Spark. The pipeline creates a PCollection by reading data from a data source, and after that, more PCollections keep on developing as PTransforms are applied to it.
PTransform - Each PTransform on PCollection results in a new PCollection making it immutable. Once constructed, you will not be able to configure individual items in a PCollection. A transformation onPCollection will result in a new PCollection. The features in a PCollection can be of any type, but all must be of the same kind. However, to maintain disseminated processing, Beam encodes each element as a byte string so that Beam can pass around items to distributed workers.
Runner - It determines where this pipeline will operate.

In Beam, your data lives in a PCollection, which stands for Parallel Collection. A PCollection is like a list of elements, but without any order guarantees. This allows Beam to easily parallelize and distribute the PCollection's elements.

Once you have your data, the next step is to transform it. In Beam, you transform data using PTransforms, which stands for Parallel Transform. A PTransform is like a function, they take some inputs, transform them and create some outputs.

We pass the elements from step1 through step3 and save the results into outputs.

outputs = pipeline | step1 | step2 | step3

This is equivalent to the example above.

outputs = (
  pipeline
  | step1
  | step2
  | step3
)

Also, Beam expects each transform, or step, to have a unique label, or description. This makes it a lot easier to debug, and it's in general a good practice to start. You can use the right shift operator >> to add a label to your transforms, like 'My description' >> MyTransform.

Try to give short but descriptive labels.

outputs = (
  pipeline
  | 'First step' >> step1
  | 'Second step' >> step2
  | 'Third step' >> step3
)

Python transform catalog overview

Element-wise

Transform	Description
Filter	Given a predicate, filter out all elements that don't satisfy the predicate.
FlatMap	Applies a function that returns a collection to every element in the input and outputs all resulting elements.
Keys	Extracts the key from each element in a collection of key-value pairs.
KvSwap	Swaps the key and value of each element in a collection of key-value pairs.
Map	Applies a function to every element in the input and outputs the result.
ParDo	The most-general mechanism for applying a user-defined `DoFn` to every element in the input collection.
Partition	Routes each input element to a specific output collection based on some partition function.
Regex	Filters input string elements based on a regex. May also transform them based on the matching groups.
Reify	Transforms for converting between explicit and implicit form of various Beam values.
RunInference	Uses machine learning (ML) models to do local and remote inference.
ToString	Transforms every element in an input collection a string.
WithTimestamps	Applies a function to determine a timestamp to each element in the output collection, and updates the implicit timestamp associated with each input. Note that it is only safe to adjust timestamps forwards.
Values	Extracts the value from each element in a collection of key-value pairs.

Aggregation

Transform	Description
ApproximateQuantiles	Not available. SeeBEAM-6694 for updates.
ApproximateUnique	Not available. SeeBEAM-6693 for updates.
CoGroupByKey	Takes several keyed collections of elements and produces a collection where each element consists of a key and all values associated with that key.
CombineGlobally	Transforms to combine elements.
CombinePerKey	Transforms to combine elements for each key.
CombineValues	Transforms to combine keyed iterables.
CombineWithContext	Not available.
Count	Counts the number of elements within each aggregation.
Distinct	Produces a collection containing distinct elements from the input collection.
GroupByKey	Takes a keyed collection of elements and produces a collection where each element consists of a key and all values associated with that key.
GroupBy	Takes a collection of elements and produces a collection grouped, by properties of those elements. Unlike GroupByKey, the key is dynamically created from the elements themselves.
GroupIntoBatches	Batches the input into desired batch size.
Latest	Gets the element with the latest timestamp.
Max	Gets the element with the maximum value within each aggregation.
Mean	Computes the average within each aggregation.
Min	Gets the element with the minimum value within each aggregation.
Sample	Randomly select some number of elements from each aggregation.
Sum	Sums all the elements within each aggregation.
Top	Compute the largest element(s) in each aggregation.

Other

Transform	Description
Create	Creates a collection from an in-memory list.
Flatten	Given multiple input collections, produces a single output collection containing all elements from all of the input collections.
PAssert	Not available.
Reshuffle	Given an input collection, redistributes the elements between workers. This is most useful for adjusting parallelism or preventing coupled failures.
View	Not available.
WindowInto	Logically divides up or groups the elements of a collection into finite windows according to a function.

Refer to this documentation for up-to-date list.

Apache Beam

What is apache beam?​

Python transform catalog overview​

Element-wise​

Aggregation​

Other​

What is apache beam?

Python transform catalog overview

Element-wise

Aggregation

Other