Elasticsearch连接器

译者：flink.sojb.cn

此连接器提供可以向Elasticsearch索引请求文档算子操作的接收器。要使用此连接器，请将以下依赖项之一添加到项目中，具体取决于Elasticsearch安装的版本：

Maven依赖	支持自	Elasticsearch版本
Flink连接器-elasticsearch_2.11	1.0.0	1.x中
Flink连接器-elasticsearch2_2.11	1.0.0	2.X
Flink连接器-elasticsearch5_2.11	1.3.0	5.x
Flink连接器-elasticsearch6_2.11	1.6.0	6及更高版本

请注意，流连接器当前不是二进制分发的一部分。有关如何使用库将程序打包以执行集群的信息，请参见此处。

安装Elasticsearch

可以在此处找到有关设置Elasticsearch集群的说明。确保设置并记住群集名称。必须在创建ElasticsearchSink针对群集的请求文档算子操作时设置此选项。

Elasticsearch Sink

的ElasticsearchSink使用TransportClient（前6.x）或RestHighLevelClient（与6.x开始）与Elasticsearch集群通信。

以下示例显示了如何配置和创建接收器：

Java，Elasticsearch 1.x
Java，Elasticsearch 2.x / 5.x.
Java，Elasticsearch 6.x
Scala，Elasticsearch 1.x
Scala，Elasticsearch 2.x / 5.x.
Scala，Elasticsearch 6.x

import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;

import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.common.transport.TransportAddress;

import java.net.InetAddress;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

DataStream<String> input = ...;

Map<String, String> config = new HashMap<>();
config.put("cluster.name", "my-cluster-name");
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");

List<TransportAddress> transportAddresses = new ArrayList<String>();
transportAddresses.add(new InetSocketTransportAddress("127.0.0.1", 9300));
transportAddresses.add(new InetSocketTransportAddress("10.2.3.1", 9300));

input.addSink(new ElasticsearchSink<>(config, transportAddresses, new ElasticsearchSinkFunction<String>() {
    public IndexRequest createIndexRequest(String element) {
        Map<String, String> json = new HashMap<>();
        json.put("data", element);

        return Requests.indexRequest()
                .index("my-index")
                .type("my-type")
                .source(json);
    }

    @Override
    public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
        indexer.add(createIndexRequest(element));
    }
}));

import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch5.ElasticsearchSink;

import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

DataStream<String> input = ...;

Map<String, String> config = new HashMap<>();
config.put("cluster.name", "my-cluster-name");
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");

List<InetSocketAddress> transportAddresses = new ArrayList<>();
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300));
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("10.2.3.1"), 9300));

input.addSink(new ElasticsearchSink<>(config, transportAddresses, new ElasticsearchSinkFunction<String>() {
    public IndexRequest createIndexRequest(String element) {
        Map<String, String> json = new HashMap<>();
        json.put("data", element);

        return Requests.indexRequest()
                .index("my-index")
                .type("my-type")
                .source(json);
    }

    @Override
    public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
        indexer.add(createIndexRequest(element));
    }
}));

import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink;

import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

DataStream<String> input = ...;

List<HttpHost> httpHost = new ArrayList<>();
httpHosts.add(new HttpHost("127.0.0.1", 9200, "http"));
httpHosts.add(new HttpHost("10.2.3.1", 9200, "http"));

// use a ElasticsearchSink.Builder to create an ElasticsearchSink
ElasticsearchSink.Builder<String> esSinkBuilder = new ElasticsearchSink.Builder<>(
    httpHosts,
    new ElasticsearchSinkFunction<String>() {
        public IndexRequest createIndexRequest(String element) {
            Map<String, String> json = new HashMap<>();
            json.put("data", element);

            return Requests.indexRequest()
                    .index("my-index")
                    .type("my-type")
                    .source(json);
        }

        @Override
        public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
            indexer.add(createIndexRequest(element));
        }
    }
);

// configuration for the bulk requests; this instructs the sink to emit after every element, otherwise they would be buffered
builder.setBulkFlushMaxActions(1);

// provide a RestClientFactory for custom configuration on the internally created REST client
builder.setRestClientFactory(
  restClientBuilder -> {
    restClientBuilder.setDefaultHeaders(...)
    restClientBuilder.setMaxRetryTimeoutMillis(...)
    restClientBuilder.setPathPrefix(...)
    restClientBuilder.setHttpClientConfigCallback(...)
  }
);

// finally, build and add the sink to the job's pipeline
input.addSink(esSinkBuilder.build());

import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer

import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.elasticsearch.common.transport.InetSocketTransportAddress
import org.elasticsearch.common.transport.TransportAddress

import java.net.InetAddress
import java.util.ArrayList
import java.util.HashMap
import java.util.List
import java.util.Map

val input: DataStream[String] = ...

val config = new java.util.HashMap[String, String]
config.put("cluster.name", "my-cluster-name")
// This instructs the sink to emit after every element, otherwise they would be buffered config.put("bulk.flush.max.actions", "1")

val transportAddresses = new java.util.ArrayList[TransportAddress]
transportAddresses.add(new InetSocketTransportAddress("127.0.0.1", 9300))
transportAddresses.add(new InetSocketTransportAddress("10.2.3.1", 9300))

input.addSink(new ElasticsearchSink(config, transportAddresses, new ElasticsearchSinkFunction[String] {
  def createIndexRequest(element: String): IndexRequest = {
    val json = new java.util.HashMap[String, String]
    json.put("data", element)

    return Requests.indexRequest()
            .index("my-index")
            .type("my-type")
            .source(json)
  }
}))

import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer
import org.apache.flink.streaming.connectors.elasticsearch5.ElasticsearchSink

import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests

import java.net.InetAddress
import java.net.InetSocketAddress
import java.util.ArrayList
import java.util.HashMap
import java.util.List
import java.util.Map

val input: DataStream[String] = ...

val config = new java.util.HashMap[String, String]
config.put("cluster.name", "my-cluster-name")
// This instructs the sink to emit after every element, otherwise they would be buffered config.put("bulk.flush.max.actions", "1")

val transportAddresses = new java.util.ArrayList[InetSocketAddress]
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300))
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("10.2.3.1"), 9300))

input.addSink(new ElasticsearchSink(config, transportAddresses, new ElasticsearchSinkFunction[String] {
  def createIndexRequest(element: String): IndexRequest = {
    val json = new java.util.HashMap[String, String]
    json.put("data", element)

    return Requests.indexRequest()
            .index("my-index")
            .type("my-type")
            .source(json)
  }
}))

import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.datastream.DataStream
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink

import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests

import java.util.ArrayList
import java.util.List

val input: DataStream[String] = ...

val httpHosts = new java.util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("127.0.0.1", 9300, "http"))
httpHosts.add(new HttpHost("10.2.3.1", 9300, "http"))

val esSinkBuilder = new ElasticsearchSink.Builer[String](
  httpHosts,
  new ElasticsearchSinkFunction[String] {
    def createIndexRequest(element: String): IndexRequest = {
      val json = new java.util.HashMap[String, String]
      json.put("data", element)

      return Requests.indexRequest()
              .index("my-index")
              .type("my-type")
              .source(json)
    }
  }
)

// configuration for the bulk requests; this instructs the sink to emit after every element, otherwise they would be buffered builder.setBulkFlushMaxActions(1)

// provide a RestClientFactory for custom configuration on the internally created REST client builder.setRestClientFactory(
  restClientBuilder -> {
    restClientBuilder.setDefaultHeaders(...)
    restClientBuilder.setMaxRetryTimeoutMillis(...)
    restClientBuilder.setPathPrefix(...)
    restClientBuilder.setHttpClientConfigCallback(...)
  }
)

// finally, build and add the sink to the job's pipeline input.addSink(esSinkBuilder.build)

对于仍使用现在已经过时，Elasticsearch版本TransportClient与Elasticsearch集群通信（即版本等于或小于5.x），注意如何Map的StringS用于配置ElasticsearchSink。创建内部使用时，将直接转发此配置映射TransportClient。配置键在此处的Elasticsearch文档中进行了说明。特别重要的是cluster.name必须与群集名称对应的参数。

对于Elasticsearch 6.x及更高版本，在内部，RestHighLevelClient它用于群集通信。默认情况下，连接器使用REST客户端的默认配置。要为REST客户端进行自定义配置，用户可以RestClientFactory在设置ElasticsearchClient.Builder构建接收器时提供实现。

另请注意，该示例仅演示了为每个传入数据元执行单个索引请求。一般地，ElasticsearchSinkFunction 可用于执行不同类型的多个请求（例如， DeleteRequest，UpdateRequest等等）。

在内部，Flink Elasticsearch Sink的每个并行实例都使用a BulkProcessor向集群发送算子操作请求。这将在将数据元批量发送到集群之前缓冲数据元。在BulkProcessor 执行批量同时申请一个，即会出现在正在进行的缓冲动作没有两个并发刷新。

Elasticsearch Sinks和Fault Tolerance

通过启用Flink的检查点，Flink Elasticsearch Sink可以保证至少一次向Elasticsearch集群传递算子操作请求。它通过等待BulkProcessor检查点时的所有待处理算子操作请求来实现。这有效地确保Elasticsearch已成功确认触发检查点之前的所有请求，然后再继续处理发送到接收器的更多记录。

有关检查点和容错的更多详细信息，请参见容错文档。

要使用容错Elasticsearch Sink，需要在运行环境中启用拓扑的检查点：

Java
Scala

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs

val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(5000) // checkpoint every 5000 msecs

注意：如果用户希望这样做，可以通过在创建的ElasticsearchSink上调用disableFlushOnCheckpoint（）来禁用刷新。请注意，这实际上意味着接收器不再提供任何强大的传送保证，即使启用了拓扑的检查点也是如此。

使用嵌入式节点进行通信（仅适用于Elasticsearch 1.x）

对于Elasticsearch版本1.x，还支持使用嵌入式节点的通信。有关与Elasticsearch与嵌入式节点进行通信与之间的区别的信息，请参见此处TransportClient。

下面是如何创建ElasticsearchSink使用嵌入式节点而不是TransportClient：

Java
Scala

DataStream<String> input = ...;

Map<String, String> config = new HashMap<>;
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "my-cluster-name");

input.addSink(new ElasticsearchSink<>(config, new ElasticsearchSinkFunction<String>() {
    public IndexRequest createIndexRequest(String element) {
        Map<String, String> json = new HashMap<>();
        json.put("data", element);

        return Requests.indexRequest()
                .index("my-index")
                .type("my-type")
                .source(json);
    }

    @Override
    public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
        indexer.add(createIndexRequest(element));
    }
}));

val input: DataStream[String] = ...

val config = new java.util.HashMap[String, String]
config.put("bulk.flush.max.actions", "1")
config.put("cluster.name", "my-cluster-name")

input.addSink(new ElasticsearchSink(config, new ElasticsearchSinkFunction[String] {
  def createIndexRequest(element: String): IndexRequest = {
    val json = new java.util.HashMap[String, String]
    json.put("data", element)

    return Requests.indexRequest()
            .index("my-index")
            .type("my-type")
            .source(json)
  }
}))

不同之处在于，现在我们不需要提供Elasticsearch节点的地址列表。

处理失败的Elasticsearch请求

Elasticsearch 算子操作请求可能由于各种原因而失败，包括临时饱和的节点队列容量或要编制索引的格式错误的文档。Flink Elasticsearch Sink允许用户通过简单地实现ActionRequestFailureHandler并将其提供给构造函数来指定请求失败的处理方式。

以下是一个例子：

Java
Scala

DataStream<String> input = ...;

input.addSink(new ElasticsearchSink<>(
    config, transportAddresses,
    new ElasticsearchSinkFunction<String>() {...},
    new ActionRequestFailureHandler() {
        @Override
        void onFailure(ActionRequest action,
                Throwable failure,
                int restStatusCode,
                RequestIndexer indexer) throw Throwable {

            if (ExceptionUtils.containsThrowable(failure, EsRejectedExecutionException.class)) {
                // full queue; re-add document for indexing
                indexer.add(action);
            } else if (ExceptionUtils.containsThrowable(failure, ElasticsearchParseException.class)) {
                // malformed document; simply drop request without failing sink
            } else {
                // for all other failures, fail the sink
                // here the failure is simply rethrown, but users can also choose to throw custom exceptions
                throw failure;
            }
        }
}));

val input: DataStream[String] = ...

input.addSink(new ElasticsearchSink(
    config, transportAddresses,
    new ElasticsearchSinkFunction[String] {...},
    new ActionRequestFailureHandler {
        @throws(classOf[Throwable])
        override def onFailure(ActionRequest action,
                Throwable failure,
                int restStatusCode,
                RequestIndexer indexer) {

            if (ExceptionUtils.containsThrowable(failure, EsRejectedExecutionException.class)) {
                // full queue; re-add document for indexing
                indexer.add(action)
            } else if (ExceptionUtils.containsThrowable(failure, ElasticsearchParseException.class)) {
                // malformed document; simply drop request without failing sink
            } else {
                // for all other failures, fail the sink
                // here the failure is simply rethrown, but users can also choose to throw custom exceptions
                throw failure
            }
        }
}))

上面的示例将让接收器重新添加由于队列容量饱和而失败的请求以及丢弃具有格式错误文档的请求，而不会使接收器失败。对于所有其他故障，接收器将失败。如果ActionRequestFailureHandler 未向构造函数提供a ，则接收器将因任何类型的错误而失败。

请注意，onFailure仅在BulkProcessor内部完成所有退避重试尝试后才会发生故障。默认情况下，BulkProcessor使用指数退避重试最多8次尝试。有关内部行为及其BulkProcessor配置方式的更多信息，请参阅以下部分。

默认情况下，如果未提供失败处理程序，则接收器将使用NoOpFailureHandler对所有类型的异常都失败的接收器。连接器还提供一种RetryRejectedExecutionFailureHandler实现，该实现始终重新添加由于队列容量饱和而失败的请求。

重要信息：在发生故障时将请求重新添加回内部BulkProcessor将导致更长的检查点，因为接收器还需要等待在检查点时刷新重新添加的请求。例如，在使用RetryRejectedExecutionFailureHandler时，检查点需要等到Elasticsearch节点队列具有足够的容量用于所有挂起的请求。这也意味着如果重新添加的请求永远不会成功，那么检查点将永远不会完成。

对于Elasticsearch 1.x故障处理：对于Elasticsearch 1.x中，它是不相匹配的故障类型可行的，因为确切类型无法通过旧版本的Java客户端API（因此被检索，该类型将一般例外小号并且仅在失败消息中有所不同）。在这种情况下，建议匹配提供的REST状态代码。

配置内部批量处理器

BulkProcessor通过在提供的内容中设置以下值，可以进一步配置内部以了解刷新缓冲算子操作请求的行为Map<String, String>：

bulk.flush.max.actions：刷新前缓冲的最大算子操作量。
bulk.flush.max.size.mb：刷新前缓冲区的最大数据大小（以兆字节为单位）。
bulk.flush.interval.ms：无论缓冲算子操作的数量或大小如何都要刷新的时间间隔。

对于2.x及更高版本，还支持配置重试临时请求错误的方式：

bulk.flush.backoff.enable：如果一个或多个算子操作由于临时算子操作而失败，是否对刷新执行具有退避延迟的重试EsRejectedExecutionException。
bulk.flush.backoff.type：退避延迟的类型，CONSTANT或者EXPONENTIAL
bulk.flush.backoff.delay：退避的延迟量。对于恒定的退避，这只是每次重试之间的延迟。对于指数退避，这是初始基本延迟。
bulk.flush.backoff.retries：要尝试的退避重试次数。

有关Elasticsearch的更多信息，请访问此处。

将Elasticsearch Connector打包到Uber-Jar中

为了执行Flink程序，建议构建一个包含所有依赖项的所谓uber-jar（可执行jar）（有关详细信息，请参阅此处）。

或者，您可以将连接器的jar文件放入Flink的lib/文件夹中，以使其在系统范围内可用，即对于所有正在运行的作业。

我们一直在努力

apachecn/AiLearning