Beam - Future

Created2024-10-05|Big DataBeam

|Word Count:168|Reading Time:1mins

技术迭代

2006，Apache Hadoop 发布，基于 MapReduce 计算模型
2009，Spark 计算框架在加州伯克利大学诞生，于 2010 年开源，于 2014 年成为 Apache 的顶级项目
- Spark 的数据处理效率远在 Hadoop 之上
2014，Flink 面世，流批一体，于 2018 年被阿里收购

Apache Beam

Apache Beam 根据 Dataflow Model API 实现的，能完全胜任批流一体的任务
Apache Beam 有中间的抽象转换层，工程师无需学习新 Runner 的 API 的语法，减少学习新技术的时间成本
Runner 可以专心优化效率和迭代功能，而不必担心迁移

Beam Runner

迭代非常快 - 如 Flink

Author: zhongmingmao

Link: https://blog.zhongmingmao.top/2024/10/05/bigdata-beam-future/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Related Articles

Beam - Streaming

有界数据 vs 无界数据在 Beam 中，可以用同一个 Pipeline 处理有界数据和无界数据无论是有界数据还是无界数据，在 Beam 中，都可以用窗口把数据按时间分割成一些有限大小的集合对于无界数据，必须使用窗口对数据进行分割，然后对每个窗口内的数据集进行处理读取无界数据 withLogAppendTime - 使用 Kafka 的 log append time 作为 PCollection 的时间戳 12345678Pipeline pipeline = Pipeline.create();pipeline.apply( KafkaIO.<String, String>read() .withBootstrapServers("broker_1:9092,broker_2:9092") .withTopic("shakespeare") // use withTopics(List<String>) to read from multiple topics. .wi...

Window 在 Beam 中，Window 将 PCollection 里的每个元素根据时间戳划分成不同的有限数据集合要将一些聚合操作应用在 PCollection 上时，或者对不同的 PCollection 进行 Join 操作 Beam 将这些操作应用在这些被 Window 划分好的不同的数据集上无论是有界数据还是无界数据，Beam 都会按同样的规则进行处理在用 IO Connector 读取有界数据集的过程中，Read Transform 会默认为每个元素分配一个相同的时间戳一般情况下，该时间戳为运行 Pipeline 的时间，即处理时间 - Processing Time Beam 会为该 Pipeline 默认分配一个全局窗口 - Global Window - 从无限小到无限大的时间窗口 Global Window 可以显式将一个全局窗口赋予一个有界数据集 12PCollection<String> input = p.apply(TextIO.read().from(filepath));PCollection<String> batchI...

Beam - WordCount

步骤用 Pipeline IO 读取文本用 Transform 对文本进行分词和词频统计用 Pipeline IO 输出结果将所有步骤打包成一个 Pipeline 创建 Pipeline 默认情况下，将采用 DirectRunner 在本地运行 1PipelineOptions options = PipelineOptionsFactory.create(); 一个 Pipeline 实例会构建数据处理的 DAG，以及这个 DAG 所需要的 Transform 1Pipeline p = Pipeline.create(options); 应用 Transform TextIO.Read - 读取外部文件，生成一个 PCollection，包含所有文本行，每个元素都是文本中的一行 123String filepattern = "file:///Users/zhongmingmao/workspace/java/hello-beam/corpus/shakespeare.txt";PCollection<String> lines = ...

Beam - Execution Engine

Pipeline 读取输入数据到 PCollection 对读进来的 PCollection 进行 Transform，得到另一个 PCollection 输出结果 PCollection 1234567891011121314// Start by defining the options for the pipeline.PipelineOptions options = PipelineOptionsFactory.create();// Then create the pipeline.Pipeline pipeline = Pipeline.create(options);PCollection<String> lines = pipeline.apply( "ReadLines", TextIO.read().from("gs://some/inputData.txt"));PCollection<String> filteredLines = lines.apply(new FilterLines());filtere...

Beam - Pipeline Test

Context 设计好的 Pipeline 通常需要放在分布式环境下执行，具体每一步的 Transform 都会被分配到任意机器上执行如果 Pipeline 运行出错，则需要定位到具体机器，再到上面去做调试是不现实的另一种办法，读取一些样本数据集，再运行整个 Pipeline 去验证哪一步逻辑出错 - 费时费力正式将 Pipeline 放在分布式环境上运行之前，需要先完整地测试整个 Pipeline 逻辑 Solution Beam 提供了一套完整的测试 SDK 可以在开发 Pipeline 的同时，能够实现对一个 Transform 逻辑的单元测试也可以对整个 Pipeline 的 End-to-End 测试在 Beam 所支持的各种 Runners 中，有一个 DirectRunner DirectRunner 即本地机器，整个 Pipeline 会放在本地机器上运行 DoFnTester - 让用户传入一个自定义函数来进行测试 - UDF - User Defined Function DoFnTester 接收的对象是用户继承实现的 DoFn 不应该将 DoFn 当成...

Copier Pattern 每个数据处理模块的输入都是相同的，并且每个数据处理模块都可以单独并且同步地运行处理 1234567891011121314151617181920212223242526272829303132333435363738394041PCollection<Video> videoDataCollection = ...;// 生成高画质视频PCollection<Video> highResolutionVideoCollection = videoDataCollection.apply("highResolutionTransform", ParDo.of(new DoFn<Video, Video>(){ @ProcessElement public void processElement(ProcessContext c) { c.output(generateHighResolution(c.element())); }}));// 生成低画质视频...