Auto retry on disconnection: 

Two cases to consider:  1. connect attempt is not successful, we need to schedule reconnection after some time   2. connect is successful. but for some reason, it is closed. In this case we need to register a listener to listen to channel close event and schedule  

// use reconnectListener
final var b = getBootstrap();
final var connectFuture = b.connect();
connectFuture.addListener(connFuture -> {
  if (!connectFuture.isSuccess() && reconnectListener != null) {
    reconnectListener.scheduleReconnect();
    return;
  }
  final var channel = connectFuture.channel();
  logger.debug("Client is connected to {}", channel.remoteAddress());
  setChannel(channel);
  if (reconnectListener != null) {
     channel.closeFuture().addListener(reconnectListener);
  }
 });


public class ReconnectOnCloseListener implements ChannelFutureListener {
   private final NettyClient<?> client;
   private final int reconnectInterval;
   private final AtomicBoolean disconnectRequested = new AtomicBoolean(false);
   private final ScheduledExecutorService executorService;

   public ReconnectOnCloseListener(final NettyClient<?> client, final int reconnectInterval, final ScheduledExecutorService executorService) {
      this.client = client;
      this.reconnectInterval = reconnectInterval;
      this.executorService = executorService;
   }

   public void requestReconnect() {
      disconnectRequested.set(false);
   }

   public void requestDisconnect() {
      disconnectRequested.set(true);
   }

   @Override
   public void operationComplete(final ChannelFuture future) {
      final var channel = future.channel();
      logger.debug("Client connection was closed to {}", channel.remoteAddress());
      channel.disconnect();
      scheduleReconnect();
   }

   public void scheduleReconnect() {
      if (!disconnectRequested.get()) {
         logger.trace("Failed to connect. Will try again in {} millis", reconnectInterval);
         //noinspection Convert2MethodRef
         executorService.schedule(
                () -> client.connectAsync(),
                reconnectInterval, TimeUnit.MILLISECONDS);
      }
   } 
}


Handling connection refuse:

Netty can set connect timeout with a channel option: (CONNECT_TIMEOUT_MILLIS)

val b = Bootstrap()
b.group(bossEventLoopGroup)
            .channel(NioSocketChannel::class.java)
            .option(ChannelOption.SO_KEEPALIVE, true)
            .option(ChannelOption.TCP_NODELAY, tcpNoDelay)
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, iso8583ClientProperties.socketConnectTimeout)
            .remoteAddress(socketAddress)
            .handler(channelInitHandler)


On connection timeout , netty will throw:  io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: /127.0.0.1:6000

Caused by: .ConnectException: Connection refused: no further information

Key class trace for the exception:  NioEventLoop.processSelectedKey -> NioSocketChannel.doFinishConnect -> sun.nio.ch.SocketChannelImpl.finishConnect , see the sequence diagram below for more details

Netty technical tricks_netty

Register error listener:

Normally an exception will reach the end of the netty pipeline, an error handler is a special ChannelInboundHandlerAdapter that  intercepts the error and pass to a customized ErrorListener. The errorListener can have customized error handling logic and decide whether to propagate the error to the next processor in the pipeline

class MyChannelInitializer(val socketAddress: SocketAddress,
   val errorListenerOptional: Optional<NettyErrorListener>): ChannelInitializer {
   override fun initChannel(ch: T) {
      super.initChannel(ch)
      val errorHandler = object : ChannelInboundHandlerAdapter() {
         override fun exceptionCaught(ctx: ChannelHandlerContext, cause: Throwable) {
            val propagateError = errorListenerOptional.map { it.onError(cause, socketAddress)}.orElse(true)
            if (propagateError) {
                ctx.fireExceptionCaught(cause)
            }
         }   
      }
      ch.pipeline().addFirst("errorHandler", errorHandler)
   }
   
   override fun exceptionCaught(ctx: ChannelHandlerContext, cause: Throwable) {
      errorListenerOptional.map {
        it.onError(cause, socketAddress)
      }
      super.exceptionCaught(ctx, cause)
   }
}


Handling read timeout:

The remote server may not respond to request in a timely manner. Interestingly, netty does not support SO_TIMEOUT configuration for normal socket. The alternative that it provides is io.netty.handler.timeout.ReadTimeoutHandler  which is a subclass of IdleStateHandler.  Idle state is triggered periodically when something is not happening. So the ReadTimeoutHandler is actually expecting some data to be read after the previous read event (assert evt.state() == IdleState.READER_IDLE). The side effect of it is that, it will close the context and its internal socket channel. It is worth mentioning that besides READ_IDLE events, IdleStateHandler can also trigger WRITE_IDLE events or ALL_IDLE events by configuring writerIdleTimeSeconds or allIdleTimeSeconds in IdleStateHandler. When constructing channel pipeline, the ReadTimeoutHandler needs to be put before the ErrorHandler mentioned in the section above. So that the ReadTimeoutException fired by ReadTimeoutHandler will be caught and handled by ErrorHandler.

ch.pipeline().addFirst(
    "timeoutHandler",
    ReadTimeoutHandler(readIdleTimeout.toLong(), TimeUnit.MILLISECONDS)
)
ch.pipeline().addAfter(
    "timeoutHandler", "errorHandler", errorHandler
)

Sequence diagram for read timeout handling Netty technical tricks_netty_02

Netty technical tricks_netty_03

Handling IO Exception:

The remote server may unexpectedly terminate the connection. An IOException will be thrown in from NioEventLoop.processSelectedKeys -> AbstractNioByteChannel$NioByteUnsafe.read -> SocketChannel.read

channel.isWritable() will return false 

Retaining correlation trace id:

Since netty is asynchronous, we need to ensure that trace id for logging is not lost between main thread and netty pipeline thread. This is achieved by customizing thread factory used by netty boss event loop group and worker event loop group.

fun createBossEventLoopGroup(): EventLoopGroup {
    val threadFactory = ThreadFactory { runnable ->
        val traceId = ThreadLocalContext.getTraceId()
        Thread(TraceIdAwareRunnable(traceId, runnable))
    }
    return NioEventLoopGroup(0, threadFactory)
}

fun createWorkerEventLoopGroup(): EventLoopGroup {
    val threadFactory = ThreadFactory { runnable ->
        val traceId = ThreadLocalContext.getTraceId()
        Thread(TraceIdAwareRunnable(traceId, runnable))
    }
    val group =
        NioEventLoopGroup(configuration.workerThreadsCount, threadFactory)
    logger.debug("Created worker EventLoopGroup with {} executor threads", group.executorCount())
    return group
}

class TraceIdAwareRunnable(val traceId: String?, val runnable: Runnable) : Runnable {
    override fun run() {
        ThreadLocalContext.setTraceId(traceId)
        try {
            runnable.run()
        } finally {
            ThreadLocalContext.setTraceId(null)
        }
    }
}

In the code snippet above, we use a special runnable class that accepts traceId from main thread (retrieved from main thread's threadlocal when constructing thread factory), stores it in the threadlocal before the runnable is actually executed by netty event group thread, and clears it afterwards. The traceId can then be hooked to logger's MDC