python kafka
Advancements in machine learning, data analytics, and IoT, and the business strategic shift towards real-time data-driven decision making, increase the demand for stream processing. Apache Kafka and Kafka Streams experience rising popularity as a solution to build streaming data processing platforms.
机器学习,数据分析和IoT方面的进步以及业务战略向实时数据驱动型决策的转变,增加了对流处理的需求。 作为构建流数据处理平台的解决方案,Apache Kafka和Kafka Streams越来越受欢迎。
Natural language processing (NLP) can retrieve valuable information from texts and is a typical task tackled on such platforms.
自然语言处理(NLP)可以从文本中检索有价值的信息,这是在此类平台上解决的典型任务。
TL;DR We implemented common base functionality to create production NLP pipelines. We use Avro, large messages, and error handling with modern monitoring and combine powerful Python libraries with Java. Read on to learn about the technical foundation we created and the libraries we share.
TL ; DR我们实施了通用的基本功能来创建生产NLP管道。 我们将Avro,大型消息和错误处理与现代监控结合使用,并将功能强大的Python库与Java结合在一起。 继续阅读以了解我们创建的技术基础和我们共享的库。
Because Kafka Streams, the most popular client library for Kafka, is developed for Java, many applications in Kafka pipelines are written in Java. However, several popular libraries for NLP are developed in Python (spaCy, NLTK, Gensim, …). These libraries have large communities behind them and are very popular. Thus, using Python including such NLP libraries as part of streaming data pipelines is aspired to achieve excellent results. Consequently, working with streaming applications written in Java and Python seamlessly in a modern streaming data pipeline becomes necessary.
因为Kafka最流行的客户端库Kafka Streams是为Java开发的,所以Kafka管道中的许多应用程序都是用Java编写的。 但是,使用Python开发了几种流行的NLP库(spaCy,NLTK,Gensim等)。 这些图书馆背后有许多社区,非常受欢迎。 因此,希望使用包含此类NLP库的Python作为流数据管道的一部分来获得出色的结果。 因此,有必要在现代的流数据管道中无缝地使用Java和Python编写的流应用程序。
Using Python and Java for streaming applications in combination implies that applications are decoupled. Then, deployment processes, error handling, and monitoring are more complex. Also, the consistency and (de-)serializability of the data consumed and produced by applications has to be language-agnostic.
结合使用Python和Java来流式处理应用程序意味着应用程序是分离的。 然后,部署过程,错误处理和监视变得更加复杂。 而且,应用程序消耗和产生的数据的一致性和(反序列化)必须与语言无关。
In this blog post, we illustrate how we laid down the foundation to develop NLP pipelines with Python, Java, and Apache Kafka. We discuss all the above-mentioned challenges and showcase common utility functions and base classes to develop Python and Java streaming applications, which work together smoothly. We cover the following topics:
在此博客文章中,我们说明了如何为使用Python,Java和Apache Kafka开发NLP管道奠定基础。 我们讨论了上述所有挑战,并展示了共同的实用程序功能和基类,以开发可平滑地协同工作的Python和Java流应用程序。 我们涵盖以下主题:
- Developing, configuring, and deploying Kafka applications written in Python and Java on Kubernetes
- Using Avro for serialization in Java and Python streaming applications
- Managing errors in a common way using standardized dead letters
- Handling large text messages in Kafka using a custom s3-backed SerDe
- Monitoring streaming data processing platform using Grafana and Prometheus
(Example NLP Pipeline)
Example NLP Pipeline with Java and Python, and Apache Kafka 使用Java和Python以及Apache Kafka的示例NLP管道
As an example, for this blog post, we set up a streaming data pipeline in Apache Kafka: We assume that we receive text messages in the input topic serialized with Avro.
例如,在此博客文章中,我们在Apache Kafka中建立了一个流数据管道:我们假设在以Avro序列化的输入主题中接收文本消息。
If you want to reproduce the example you can find the demo code for the Producer and the Loader to write messages to the input topic in the Github repository.
如果您想重现该示例,则可以找到Producer和Loader的演示代码,以将消息写入Github存储库中的输入主题。
The pipeline consists of two steps that are implemented in two separate streaming applications:
管道包含两个步骤,分别在两个单独的流应用程序中实现:
- spaCy Lemmatizer: A Python streaming application that extracts non-stop-word lemmas using spaCy. spaCy Lemmatizer :一个Python流应用程序,使用spaCy提取不停词引理。
- TFIDF-Application: A Java streaming application that receives documents as a list of lemmas and calculates the corresponding TFIDF scores. TFIDF-Application :一种Java流应用程序,用于接收作为引理列表的文档并计算相应的TFIDF分数。
To develop the Lemmatizer in Python, we use Faust, a library that aims to port Kafka Streams’ ideas to Python. SpaCy extracts lemmas in this example. For the Java application, we are using the well-known Kafka Streams library.
为了使用Python开发Lemmatizer,我们使用Faust ,该库旨在将Kafka Streams的思想移植到Python。 在此示例中, SpaCy提取引理。 对于Java应用程序,我们使用著名的Kafka Streams库。
To deploy Kafka on Kubernetes, you can use the Confluent Platform Helm Charts. For this example, we need the following services: ZooKeeper, Kafka brokers, and the Confluent Schema Registry.
要在Kubernetes上部署Kafka,可以使用Confluent Platform Helm Charts 。 对于此示例,我们需要以下服务:ZooKeeper,Kafka代理和Confluent Schema Registry。
In the following sections, we focus on the most important parts of the development, configuration, and deployment of the Python and the Java application to build the example. We will not discuss the lemma extraction using spaCy and the TFIDF calculation in Java. You can find the entire code for this example NLP pipeline in the demo repository.
在以下各节中,我们将重点介绍Python和Java应用程序的开发,配置和部署中最重要的部分,以构建示例。 我们不会讨论使用spaCy进行引理提取和Java中的TFIDF计算。 您可以在示例存储库中找到此示例NLP管道的完整代码。
(Using Common Configuration Options to Develop Streaming Applications in Python and Java)
Kafka streaming applications require a minimum set of configurations to launch. A typical streaming application needs information about the input topics, the output topic, the brokers, the schema registry URL, etc. These configurations are mandatory for stream processing applications using Kafka Streams in Java or Faust in Python.
Kafka流应用程序需要启动最少的配置集。 典型的流应用程序需要有关输入主题, 输出主题, 代理, 模式注册表URL等的信息。对于使用Java中的Kafka Streams或Python中的Faust的流处理应用程序,这些配置是必需的。
To align Kafka Streams and Faust deployment configurations, we built utility functions and base classes for both application frameworks.
为了使Kafka Streams和Faust部署配置保持一致,我们为两个应用程序框架构建了实用程序功能和基类。
- The Java library for Kafka Streams can be found here: common-kafka-streams Kafka流的Java库可以在以下位置找到: common-kafka-streams
- The Python package for Faust is here: faust-bootstrap Faust的Python软件包在这里: faust-bootstrap
They introduce a common way:
他们介绍了一种常见的方式:
- to configure Kafka streaming applications, 配置 Kafka流媒体应用程序,
- to deploy applications into a Kafka cluster on Kubernetes via our common Helm Chart using a standardized values.yaml, 使用标准values.yaml通过常见的Helm Chart 将应用程序部署到Kubernetes上的Kafka集群中,
- and to reprocess data. 并重新处理数据。
The common configuration from the Helm chart is passed as environment variables and set as matching command-line arguments. This configuration includes:
Helm图表中的通用配置作为环境变量传递,并设置为匹配的命令行参数。 此配置包括:
- brokers: List of Kafka brokers brokers :卡夫卡经纪人名单
- input-topics: List of input topics input-topics :输入主题列表
- output-topic: The output topic output-topic :输出主题
- error-topic: A topic to write errors to error-topic :将错误写入的主题
- schema-registry-url: The URL of the schema registry schema-registry-url :模式注册表的URL
- clean-up: Whether the state store and the Kafka offsets for the consumer group should be cleared clean-up :是否应清除状态存储和消费者组的Kafka偏移量
- delete-output: Whether the output topic should be deleted during the cleanup delete-output :在清理过程中是否应删除输出主题
We now demonstrate how we can easily spin up applications using those libraries.
现在,我们演示如何使用这些库轻松启动应用程序。
(Python — Faust-Bootstrap Application)
To use faust-bootstrap with our spaCy Lemmatizer application, we create a LemmatizerApp class that inherits from FaustApplication:
要使用faust-bootstrap我们的spaCy Lemmatizer的应用程序,我们创建了一个LemmatizerApp类继承FaustApplication :
from faust import TopicT
from faust.serializers.codecs import codecs
from faust_bootstrap.core.app import FaustApplication
class LemmatizerApp(FaustApplication):
input_topics: TopicT
output_topic: TopicT
def __init__(self):
super(LemmatizerApp, self).__init__()
def get_unique_app_id(self):
return f'spacy-lemmatizer-{self.output_topic_name}'
def _register_parameters(self):
pass
def setup_topics(self):
schema_input = self.create_schema_from(codecs["raw"], codecs["raw"], bytes, bytes)
schema_output = self.create_schema_from(codecs["raw"], codecs["raw"], bytes, bytes)
self.input_topics = self.get_topic_from_schema(self.input_topic_names, schema_input)
self.output_topic = self.get_topic_from_schema(self.output_topic_name, schema_output)
def build_topology(self):
agent = create_spacy_agent(self.output_topic)
self.create_agent(agent, self.input_topics)
Every faust-bootstrap application has to implement the abstract methods get_unique_app_id(), setup_topics(), build_topology(), and _register_parameters().
每个faust-bootstrap应用程序都必须实现抽象方法get_unique_app_id() , setup_topics() , build_topology()和_register_parameters() 。
We set the Id of the application in get_unique_app_id() (line 12). In _register_parameters() (line 15) we can add additional parameters to the application that are exposed as configuration options in the deployment. In setup_topics() (line 18), we configure the input and output. Message keys and values are defined to be bytes (lines 19 and 20).
我们在get_unique_app_id()设置应用程序的ID(第12行)。 在_register_parameters() (第15行)中,我们可以向应用程序添加其他参数,这些参数在部署中作为配置选项公开。 在setup_topics() (第18行)中,我们配置输入和输出。 消息键和值定义为字节(第19和20行)。
Finally, we provide the topology of our streaming application. Faust uses agents to process streams. We can either use Faust’s app.agent decorator or register our agents in the buildTopology() method. Here, we register our agent in the buildToplogy() method (lines 26 and 27). To do so, we define the basis of our agent as an inner function of another function that expects the topics that our agent requires:
最后,我们提供了流式应用程序的拓扑。 浮士德大学使用代理处理流。 我们可以使用Faust的app.agent装饰器,也可以在buildTopology()方法中注册我们的代理。 在这里,我们在buildToplogy()方法(第26和27行buildToplogy()注册我们的代理。 为此,我们将代理的基础定义为另一个函数的内部函数,该函数期望代理需要的主题:
import faust
def create_spacy_agent(output_topic):
async def spacy_agent(stream: faust.Stream[Text]):
async for key, text in stream.items():
# lemmatization with spaCy will be implemented here
yield await output_topic.send(key=key, value=text)
return spacy_agent
To start the application, we run:
要启动该应用程序,我们运行:
app = LemmatizerApp()
if __name__ == '__main__':
app.start_app()
The base for our streaming application is now ready to start and can be configured either by using environment variables or command-line arguments.
流媒体应用程序的基础现在可以启动了,可以使用环境变量或命令行参数进行配置。
(Java — Common-Kafka-Streams Application)
Creating the basis for our TFIDF application with ourcommon-kafka-streams is easy:
使用common-kafka-streams为我们的TFIDF应用程序创建基础很容易:
package com.bakdata.kafka;
import com.bakdata.common_kafka_streams.KafkaStreamsApplication;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.KStream;
public class TFIDFApplication extends KafkaStreamsApplication {
public static void main(final String[] args) {
startApplication(new TFIDFApplication(), args);
}
@Override
public void buildTopology(final StreamsBuilder builder) {
final KStream<String, String> input =
builder.<String, String>stream(this.getInputTopics());
// TFIDF calculation will be implemented here
input.to(this.getOutputTopic());
}
@Override
public String getUniqueAppId() {
return "tf-idf";
}
}
We create a subclass of KafkaStreamsApplication and implement the abstract methods buildTopology() and getUniqueAppId(). We define the topology of our application in buildTopology() (line 14) and we set the application Id by overriding getUniqueAppId() (line 24). By calling startApplication(new TFIDFApplication(), args); in the main method (line 20), the passed in command-line arguments are populated with the values from matching environment variables. Like in our Faust-Bootstrap application, the application is now ready to start and can be configured either by using environment variables or command-line arguments.
我们创建KafkaStreamsApplication的子类,并实现抽象方法buildTopology()和getUniqueAppId() 。 我们在buildTopology()定义应用程序的拓扑(第14行),并通过覆盖getUniqueAppId() (第24行)来设置应用程序ID。 通过调用startApplication(new TFIDFApplication(), args); 在main方法(第20行)中,使用匹配的环境变量中的值填充传入的命令行参数。 就像在我们的Faust-Bootstrap应用程序中一样,该应用程序现在可以启动了,可以使用环境变量或命令行参数进行配置。
(Deployment of our Applications)
To deploy the applications, we use our common-kafka-streams Helm Charts. You need to add our Helm Charts repo and update the repo information by running:
要部署应用程序,我们使用common-kafka-streams Helm Charts。 您需要添加我们的Helm Charts存储库并通过运行以下命令来更新存储库信息:
$ helm repo add bakdata-common https://raw.githubusercontent.com/\
bakdata/common-kafka-streams/master/charts/
$ helm repo udpate
Now, you have to build and push Docker images for the applications written in Java or Python to your preferred Docker registry.
现在,您必须为以Java或Python编写的应用程序构建Docker映像并将其推送到您首选的Docker注册表中。
ToJib. For Faust applications, we use a custom Dockerfile.
到吉卜 。 对于Faust应用程序,我们使用自定义Dockerfile。
Then, we configure the streaming application deployment with Helm. In values.yaml files, you can set the configuration options. For example, the file for the spaCy Lemmatizer in the demo pipeline looks like this:
然后,我们使用Helm配置流式应用程序部署。 在values.yaml文件中,可以设置配置选项。 例如,演示管道中的spaCy Lemmatizer的文件如下所示:
nameOverride: nlp-spacy-lemmatizer
replicaCount: 5
image: spacy-lemmatizer
streams:
brokers: "kafka://my-confluent-cp-kafka-headless:9092"
schemaRegistryUrl: "http://my-confluent-cp-schema-registry:8081"
inputTopics:
- text-topic
outputTopic: lemmatized-text-topic
errorTopic: error-topic
Common configurations appear in the streams section. You can find the default values and some example parameters here.
通用配置显示在streams部分。 您可以在此处找到默认值和一些示例参数。
The Helm Chart populates the corresponding environment variables to match those defined by the common configuration options of common-kafka-streams and faust-bootstrap. Therefore, the deployment of both applications, Faust and Kafka Streams, to Kafka on Kubernetes is as simple as this:
Helm Chart填充相应的环境变量,以匹配由common-kafka-streams和faust-bootstrap的公共配置选项定义的环境变量。 因此,在Kubernetes上将两个应用程序Faust和Kafka Streams部署到Kafka就是这样简单:
$ helm upgrade --debug --install --force --values values.yaml <release_name> bakdata-common/streams-app
The deployed pod launches our streaming application, which is configured just as defined in the values.yaml.
部署的pod启动我们的流应用程序,该应用程序的配置与values.yaml定义的values.yaml 。
(Reprocessing)
At some point during the development, you might want to process all input data again. Essentially, you want to reset the application state. Common-kafka-streams and faust-bootstrap applications can reset all offsets for the consumer group, as well as the application state. We provide a Helm Chart to run such cleanup: bakdata-common/streams-app-cleanup-job. You can start it with:
在开发过程中的某个时候,您可能希望再次处理所有输入数据。 本质上,您想重置应用程序状态。 Common-kafka-streams和faust-bootstrap应用程序可以重置使用者组的所有偏移量以及应用程序状态。 我们提供了一个Helm Chart来运行此类清理: bakdata-common/streams-app-cleanup-job 。 您可以从以下内容开始:
$ helm upgrade --debug --install --force --values values.yaml <release_name> bakdata-common/streams-app-cleanup-job
If you want to delete all involved topics (output and internal) and reset the schema registry during the cleanup, you can set streams.deleteOutput=true:
如果要删除所有涉及的主题(输出和内部主题)并在清理过程中重置架构注册表,则可以设置streams.deleteOutput=true :
$ helm upgrade --debug --install --force --values values.yaml <release_name> bakdata-common/streams-app-cleanup-job --set streams.deleteOutput=true
That is important if the schema becomes incompatible. However, remember that deleting the output is risky if you do not also control the downstream applications.
如果架构不兼容,那很重要。 但是,请记住,如果您还不控制下游应用程序,则删除输出是有风险的。
(Using Avro for Serialization in Faust and Kafka Streams)
Avro Schemas allow to validate, serialize, and deserialize data passed between different streaming applications. In our example with Python and Java, Avro is particularly important as we require a standardized serialization. Such common ground speeds up and simplifies error-resistant development.
Avro架构允许验证,序列化和反序列化在不同流应用程序之间传递的数据。 在我们的Python和Java示例中,Avro特别重要,因为我们需要标准化的序列化。 这样的共同点可加快并简化抗错误开发。
For our example, we define the following Avro Schema for the input topic:
对于我们的示例,我们为输入主题定义以下Avro模式:
{
"fields": [
{
"name": "content",
"type": "string"
}
],
"name": "Text",
"namespace": "com.bakdata.kafka",
"type": "record"
}
The Avro Schema for the topic the spaCy Lemmatizer application writes to and the TFIDF application reads from is:
spaCy Lemmatizer应用程序写入并且TFIDF应用程序读取的主题的Avro架构是:
{
"fields": [
{
"name": "lemmas",
"type": {
"items": "string",
"type": "array"
}
}
],
"name": "LemmaText",
"namespace": "com.bakdata.kafka",
"type": "record"
}
Finally, the output topic receives messages using the following Avro Schema:
最后,输出主题使用以下Avro架构接收消息:
{
"type": "record",
"name": "TfIdfScore",
"namespace": "com.bakdata.kafka.tfidf",
"fields": [
{
"name": "term",
"type": "string"
},
{
"name": "tf_idf",
"type": "double"
}
]
}
(Avro in Faust-Bootstrap Applications)
To use Avro for serialization in Faust applications, we introduce faust_avro_serializer. Faust-bootstrap uses the faust_avro_serializer for message values by default. Faust uses models for the description of data structures in keys and values. The following defines the models for the input and output topic of our spaCy lemmatizer application:
要将Avro用于Faust应用程序中的序列化,我们引入faust_avro_serializer 。 默认情况下, Faust-bootstrap使用faust_avro_serializer来获取消息值。 Faust使用模型来描述键和值中的数据结构。 下面定义了spaCy lemmatizer应用程序的输入和输出主题的模型:
from typing import List
from faust import Record
class Text(Record):
_schema = {
"fields": [
{
"name": "content",
"type": {
"avro.java.string": "String",
"type": "string"
}
}
],
"name": "Text",
"namespace": "com.bakdata.kafka",
"type": "record"
}
content: str
class LemmaText(Record):
_schema = {
"fields": [
{
"name": "lemmas",
"type": {
"items": "string",
"type": "array"
}
}
],
"name": "LemmaText",
"namespace": "com.bakdata.kafka",
"type": "record"
}
lemmas: List[str] = []
The faust_avro_serializer uses the Avro Schema definition in the _schema attribute (line 7 and 25) in the Faust models to serialize messages. Here, we set the schema definitions, shown before. Finally, we configure our application to use these schemata for (de-)serialization. We can easily do this via setup_topics()in the LemmatizerApp-Class:
faust_avro_serializer使用Faust模型的_schema属性(第7和25行)中的Avro Schema定义来序列化消息。 在这里,我们设置架构定义,如前所示。 最后,我们将应用程序配置为使用这些模式进行(反序列化)。 我们可以通过setup_topics() -Class中的setup_topics()轻松地做到这一点:
...
def setup_topics(self):
value_serializer_input = self.self.create_avro_serde(self.input_topic_names[0], False)
value_serializer_output = self.self.create_avro_serde(self.output_topic_name, False)
schema_input = self.create_schema_from(codecs["raw"], value_serializer_input, bytes, Text)
schema_output = self.create_schema_from(codecs["raw"], value_serializer_output, bytes, LemmaText)
self.input_topics = self.get_topic_from_schema(self.input_topic_names, schema_input)
self.output_topic = self.get_topic_from_schema(self.output_topic_name, schema_output)
...
(Avro in CommonKafkaStreams Applications)
Avro is commonly used in Kafka Streams. Our Java application expects LemmaText objects. Keys are simple Strings. To deserialize the input, we add additional configuration to the KafkaProperties:
Avro通常在Kafka Streams中使用。 我们的Java应用程序需要LemmaText对象。 键是简单的字符串。 为了反序列化输入,我们向KafkaProperties添加了其他配置:
...
@Override
public void buildTopology(final StreamsBuilder builder) {
final KStream<String, List<String>> lemmaTexts =
builder.<String, LemmaText>stream(this.getInputTopics()).mapValues(LemmaText::getLemmas);
...
}
@Override
protected Properties createKafkaProperties() {
final Properties kafkaProperties = super.createKafkaProperties();
kafkaProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, StringSerde.class);
final Class<?> valueSerde = SpecificAvroSerde.class;
kafkaProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, valueSerde);
return kafkaProperties;
}
...
Now the KafkaStream application uses Avro to deserialize the messages from the input topic. Avro ensures compatible messages in the topic the Python Faust application writes to and the Java Kafka Streams application reads from.
现在,KafkaStream应用程序使用Avro对来自输入主题的消息进行反序列化。 Avro确保在Python Faust应用程序写入和Java Kafka Streams应用程序读取的主题中兼容消息。
(Error Handling)
Sooner or later, you experience exceptions in your streaming applications written in Python or Java. For example, errors occur at the beginning of a pipeline, where unparsable raw data occurs. Then, Kubernetes restarts the crashed container, and the application will retry the processing. The processing continues, if the error was temporary. If the exception relates to a specific corrupt message, your streaming application will repeatedly fail and block the pipeline. Thus, you may want to process other incoming data instead of stopping the processing completely.
迟早,您会在用Python或Java编写的流应用程序中遇到异常。 例如,错误发生在流水线的开头,在那里发生了不可解析的原始数据。 然后,Kubernetes重新启动崩溃的容器,应用程序将重试该处理。 如果错误是暂时的,则处理继续。 如果异常与特定的损坏消息有关,则您的流应用程序将反复失败并阻塞管道。 因此,您可能要处理其他传入数据,而不是完全停止处理。
To tackle this, we developed common error handling for Kafka Streams and Faust applications. For Kafka Streams, we published a separate library: kafka-error-handling. For Faust, the error handling functions are included in faust-bootstrap.
为了解决这个问题,我们为Kafka Streams和Faust应用程序开发了常见的错误处理。 对于Kafka Streams,我们发布了一个单独的库: kafka-error-handling 。 对于Faust,错误处理功能包含在faust-bootstrap 。
For both libraries, the error handling works analogously. Successfully processed messages are sent to the output topic. If an error occurs, it is sent to a given error topic and will not cause the application to crash. The error topic contains dead letters following a common Avro schema. Dead letters describe the input value that caused the error, the respective error message, and the exception stack trace. That simplifies the investigation to solve the issue and the processing will not be interrupted until the error is resolved.
对于这两个库,错误处理的工作方式相似。 成功处理的消息将发送到输出主题。 如果发生错误,它将被发送到给定的错误主题,并且不会导致应用程序崩溃。 错误主题包含遵循通用Avro架构的空字母。 无效字母描述导致错误的输入值 ,相应的错误消息以及异常堆栈跟踪 。 这简化了解决问题的调查,并且在解决错误之前不会中断处理。
As mentioned above, both, common-kafka-streams and faust-bootstrap, include an error topic configuration option. This refers to the error topic that can be used for error handling.
如上所述, common-kafka-streams和faust-bootstrap包含一个错误主题配置选项。 这是指可用于错误处理的错误主题。
(Error Handling in Faust-Bootstrap Applications)
To configure the error topic to use the dead letter schema, we add the following to setup_topics() in our LemmatizerApp Class:
要将错误主题配置为使用死信模式,我们将以下内容添加到setup_topics()类的setup_topics()中:
...
def setup_topics(self):
...
value_serializer_error = self.create_avro_serde(self.error_topic_name, False)
schema_error = self.create_schema_from(codecs["raw"], value_serializer_error, bytes, DeadLetter)
if self.error_topic_name:
self.error_topic = self.get_topic_from_schema(self.error_topic_name, schema_error)
...
To handle errors in our agent, we add the following to our agent:
为了处理我们代理中的错误,我们将以下内容添加到我们的代理中:
...
from faust_bootstrap.core.streams.agent_utils import capture_errors, forward_error, \
create_deadletter_message
def create_spacy_agent(output_topic, error_topic):
async def spacy_agent(stream: faust.Stream[Text]):
async for key, text in stream.items():
processed = await capture_errors(text, process_text)
if isinstance(processed, Exception):
await forward_error(error_topic)
yield await error_topic.send(key=key, value=create_deadletter_message(processed, text, 'Could not process text'))
elif isinstance(processed, Record):
yield await output_topic.send(key=key, value=processed)
return spacy_agent
create_spacy_agent now additionally expects the error topic as an argument. process_text (line 8) is a method that handles the processing of the text using spaCy. You can find its definition in the demo repository. To capture whether process_text threw an error, we use capture_error (line 8). If the returned value is an Exception, we check whether the error should cause the application to crash (forward_error line 11) or send a dead letter to the error topic.
现在, create_spacy_agent还希望将错误主题作为参数。 process_text (第8行)是使用spaCy处理文本处理的方法。 您可以在演示存储库中找到其定义。 为了捕获process_text是否引发错误,我们使用capture_error (第8行)。 如果返回的值是Exception,我们将检查该错误是否会导致应用程序崩溃( forward_error第11行)或向该错误主题发送一个死信。
(Error Handling in Common Kafka Streams Applications)
We also integrated error handling into the Kafka Streams DSL. Consider that we want to map input data using a KeyValueMapper called mapper. Using kafka-error-handling, we can easily capture all errors that may occur when mapping the data:
我们还将错误处理集成到Kafka Streams DSL中。 考虑我们想使用一个称为mapper的KeyValueMapper映射输入数据。 使用kafka-error-handling ,我们可以轻松捕获映射数据时可能发生的所有错误:
@Override
public void buildTopology(final StreamsBuilder builder) {
final KeyValueMapper<String, String, KeyValue <Double, Long>> mapper = …
final KStream<String, String> input =
builder.<String, String>stream(this.getInputTopic());
final KStream<Double, ProcessedKeyValue<String, String, Long>> mappedWithErrors =
input.map(ErrorCapturingFlatKeyValueMapper.captureErrors(mapper));
mappedWithErrors.flatMap(ProcessedKeyValue::getErrors)
.mapValues(error -> error.createDeadLetter("Description for the pipeline error"))
.to(this.getErrorTopic());
final KStream<Double, Long> mapped = mappedWithErrors.flatMapValues(ProcessedKeyValue::getValues);
mapped.to(this.getOutputTopic());
}
Like in the Python example, we capture mapper errors (line 9). For each error, we create a dead letter (lines 10 and 11) and send it to the error topic (line 12). Successfully processed values can be sent to the output topic (lines 15 and 16).
像在Python示例中一样,我们捕获了mapper错误(第9行)。 对于每个错误,我们都创建一个无效字母(第10和11行),并将其发送到错误主题(第12行)。 可以将成功处理的值发送到输出主题(第15和16行)。
(Processing Large Text Files in Kafka)
Text messages tend to be very large. However, Apache Kafka limits the maximum size of messages sent to a topic. Although the limit is configurable, some messages eventually can exceed this limit. We recently discussed the problem and how we solved it using our custom s3-backed SerDe that stores large messages on Amazon S3.
短信往往很大。 但是,Apache Kafka限制了发送给主题的消息的最大大小。 尽管该限制是可配置的,但某些消息最终可能会超过此限制。 最近,我们讨论了该问题以及如何使用支持s3的自定义SerDe(在Amazon S3上存储大消息)来解决该问题 。
For our example pipeline, we now assume that the spaCy lemmatizer receives s3-backed messages. Clearly, the same applies to the lemmatizer output topic. Therefore, our Java and Python applications must support s3-backed SerDe. In the following, we introduce s3-backed SerDe support in our NLP pipeline to support large texts in Apache Kafka.
对于我们的示例管道,我们现在假设spaCy lemmatizer接收s3支持的消息。 显然,这同样适用于lemmatizer输出主题。 因此,我们的Java和Python应用程序必须支持s3支持的SerDe。 接下来,我们在NLP管道中引入s3支持的SerDe支持,以支持Apache Kafka中的大文本。
(s3-backed SerDe in Faust-Boostrap Applications)
To use s3-backed SerDe in our Faust application, we need to register a new serializer that uses the faust-s3-backed-serializer. To determine, whether our application should use s3-backed SerDe we add an additional parameter via the above-mentioned _register_paramters().
要在我们的Faust应用程序中使用s3-backed的SerDe,我们需要注册一个使用faust-s3-backed-serializer的新序列faust-s3-backed-serializer 。 为了确定我们的应用程序是否应使用s3支持的SerDe,我们通过上述_register_paramters()添加了一个附加参数。
class LemmatizerApp(FaustApplication):
...
s3_serde: bool
...
def _register_parameters(self):
self.register_parameter("--s3-serde", False, "Activate s3-backed SerDe as serializer", bool, default=False)
def create_s3_serde(self, topic: Union[str, List[str]]):
value_s3_serializer = self.create_s3_backed_serde(topic, self._generate_streams_config())
value_avro = self.create_avro_serde(topic, False)
return value_avro | value_s3_serializer
def create_serde(self, topic: Union[str, List[str]]):
if self.s3_serde:
return self.create_s3_serde(topic)
else:
return self.create_avro_serde(topic, False)
def setup_topics(self):
value_serializer_input = self.create_serde(self.input_topic_names[0])
value_serializer_output = self.create_serde(self.output_topic_name)
...
@staticmethod
def create_s3_backed_serde(topic: str, s3_config: Dict[str, str], is_key: bool = False):
base_path = s3_config.get("s3backed.base.path")
max_size = int(s3_config.get("s3backed.max.byte.size"))
region_name = s3_config.get("s3backed.region")
faust_s3_serializer = S3BackedSerializer(topic, base_path, region_name, None, max_size,
is_key)
return faust_s3_serializer
We configure s3-backed SerDe in create_s3_backed_serde(...) (line 26) and create the serializer in create_s3_serde(...) (line 9). The returned value (line 12) ensures that faust_avro_serializer is used as the first serializer followed by s3_backed_serializer to handle messages exceeding the configured maximum message size.
我们在create_s3_backed_serde(...) (第26行)中配置s3支持的SerDe,并在create_s3_serde(...) (第9行)中创建序列化程序。 返回值(第12行)确保faust_avro_serializer用作第一个序列化器,然后使用s3_backed_serializer来处理超出配置的最大消息大小的消息。
(s3-backed SerDe in Common Kafka Streams Applications)
To add s3-backed SerDe to the Java application, we use kafka-s3-backed-serde. To deal with s3-backed messages in the input, we add the following to the TFIDFApplication properties:
要将s3支持的SerDe添加到Java应用程序中,我们使用kafka-s3-backed-serde 。 为了处理输入中s3支持的消息,我们将以下内容添加到TFIDFApplication属性中:
...
@Override
protected Properties createKafkaProperties() {
final Properties kafkaProperties = super.createKafkaProperties();
kafkaProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, StringSerde.class);
final Class<?> valueSerde = SpecificAvroSerde.class;
if (this.useS3) {
kafkaProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, S3BackedSerde.class);
kafkaProperties.put(S3BackedSerdeConfig.VALUE_SERDE_CLASS_CONFIG, valueSerde);
} else {
kafkaProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, valueSerde);
}
return kafkaProperties;
}
...
Like in the Faust application, we use Avro to serialize values. We set the S3BackedSerde as the default value SerDe (line 8) and set Avro as its value SerDe (line 9).
像在Faust应用程序中一样,我们使用Avro序列化值。 我们将S3BackedSerde设置为默认值SerDe(第8行),并将Avro设置为其默认值SerDe(第9行)。
(Deployment with s3-backed SerDe)
To configure the s3-backed SerDe, we add the following value to the values.yaml for both applications:
要配置支持s3的SerDe,我们将以下值添加到两个应用程序的values.yaml中:
...
streams:
...
config:
s3backed.base.path: "s3://my-bucket"
streams.config.s3backed.base.path configures the base path in Amazon S3 for storing the s3-backed messages. Additionally, we need to configure our Amazon S3 credentials as environment variables for both deployments. We can either do this in the values.yaml or when running the Helm upgrade command:
streams.config.s3backed.base.path在Amazon S3中配置用于存储s3支持的消息的基本路径。 此外,我们需要将我们的Amazon S3凭据配置为两个部署的环境变量。 我们可以在values.yaml执行此操作,也可以在运行Helm upgrade命令时执行此操作:
helm upgrade --debug --install --force --values values.yaml <release_name> bakdata-common/streams-app \
--set env.AWS_ACCESS_KEY_ID=<acces_key_id> \
--set env.AWS_SECRET_ACCESS_KEY=<access_key> \
--set env.AWS_REGION=<region>
监控方式 (Monitoring)
Monitoring the data flow in sophisticated streaming data platforms is crucial. There exist great monitoring solutions for Kubernetes. Besides Kubernetes, the Cloud Native Computing Foundation also hosts Prometheus as their second project. Prometheus was explicitly built to monitor systems through recording real-time metrics. For visualization, Prometheus recommends Grafana. Prometheus can be set up as a data source for Grafana. Then, we can query the metrics in Prometheus to build comprehensive dashboards. Moreover, several dashboards for Apache Kafka already exist and can be imported to start monitoring data pipelines quickly.
在复杂的流数据平台中监视数据流至关重要。 对于Kubernetes有很好的监视解决方案。 除了Kubernetes,Cloud Native Computing Foundation还托管Prometheus作为其第二个项目。 Prometheus专门用于通过记录实时指标来监视系统。 对于可视化,Prometheus建议使用Grafana 。 可以将Prometheus设置为Grafana的数据源。 然后,我们可以查询Prometheus中的指标以构建全面的仪表板。 此外,Apache Kafka的几个仪表板已经存在,可以导入以快速开始监视数据管道。
Apache Kafka uses JMX as the default reporter for metrics. The Confluent Platform Helm charts install the Prometheus JMX Exporter as sidecars in all affected pods and JMX metrics are enabled by default for all components. Our Helm Charts also allows to deploy the Prometheus JMX Exporter with our streaming applications and thus to expose JMX metrics.
Apache Kafka使用JMX作为指标的默认报告器。 Confluent Platform Helm图表将Prometheus JMX Exporter安装为所有受影响的窗格中的小车,并且默认情况下为所有组件启用JMX指标。 我们的Helm Charts还允许将Prometheus JMX Exporter与我们的流应用程序一起部署,从而公开JMX指标。
Setting up the monitoring for our NLP pipeline in the Kubernetes cluster is straightforward. We first deploy Prometheus and Grafana. You can add them using Helm:
在Kubernetes集群中为我们的NLP管道设置监视非常简单。 我们首先部署Prometheus和Grafana。 您可以使用Helm添加它们:
helm install stable/prometheus
helm install stable/grafana
After setting up Prometheus as a data source in Grafana you can import your desired dashboard into Grafana. For example, the Confluent Platform Helm Charts provide a Gafana dashboard you can use.
将Prometheus设置为Grafana中的数据源后,您可以将所需的仪表板导入到Grafana中。 例如,Confluent Platform Helm Charts提供了您可以使用的Gafana仪表板 。
Additionally, we use the Kafka Lag Exporter to view consumer group metrics. The consumer group lag refers to the difference between the last message produced and the last message committed to partitions. This tells us how far behind our NLP application is compared to data ingestion. This is crucial for real-time data processing.
此外,我们使用Kafka滞后导出器查看消费者组指标。 使用者组延迟是指最后产生的消息与提交给分区的最后一条消息之间的差异。 这告诉我们与数据摄取相比,我们的NLP应用程序落后了多远。 这对于实时数据处理至关重要。
Finally, we added custom dashboards to monitor the data flow. As an example, we monitor the number of messages, and the number of bytes transferred:
最后,我们添加了自定义仪表板来监视数据流。 例如,我们监视消息的数量和传输的字节数:
Monitoring the Number of Messages and Bytes Transferred In and Out of Topics in the Overall Pipeline or Specific Topics
监视整个管道或特定主题中进出主题的消息和字节数
(Conclusion)
Apache Kafka is a state-of-the-art stream processing platform. NLP is one common task in streaming data pipelines that often requires to use popular Python packages in combination with Java to accomplish excellent results. We built tools and libraries to accomplish NLP pipelines at scale where Java and Python interoperate seamlessly in Apache Kafka.
Apache Kafka是最先进的流处理平台。 NLP是流数据管道中的一项常见任务,通常需要结合使用流行的Python软件包和Java来获得出色的结果。 我们构建了工具和库,以实现Java和Python在Apache Kafka中无缝互操作的大规模NLP管道。
Now that you made it until the conclusion, you learned that it requires several connecting pieces, including configuration, deployment, serialization, and error handling, to allow for language-agnostic streaming data pipelines in production.
现在您已经得出结论,您了解到它需要几个连接点,包括配置,部署,序列化和错误处理,以允许在生产中使用与语言无关的流数据管道。
Thank you for reading.
感谢您的阅读。
Thanks to Christoph Böhm, Alexander Albrecht, Philipp Schirmer, and Benjamin Feldmann
感谢 ChristophBöhm ,Alexander Albrecht, Philipp Schirmer 和 Benjamin Feldmann
翻译自: https://medium.com/bakdata/continuous-nlp-pipelines-with-python-java-and-apache-kafka-f6903e7e429d
python kafka