Apache Hadoop YARN

Apache Hadoop YARN (Yet Another Resource Negotiator) is a framework that allows distributed processing of large data sets on clusters. It is one of the key components of the Apache Hadoop ecosystem and is designed to efficiently manage resources and schedule tasks in a distributed environment.

Introduction to Apache Hadoop YARN

Hadoop YARN was introduced in Hadoop 2.0 as a major upgrade to its predecessor, the Hadoop MapReduce framework. While MapReduce was primarily focused on batch processing of data, YARN provides a more general-purpose framework that supports a variety of processing models, including batch processing, interactive queries, real-time streaming, and graph processing.

YARN decouples the resource management and job scheduling/monitoring functions of the original MapReduce framework, allowing them to run independently. This separation provides flexibility and scalability, enabling multiple applications to run concurrently on a shared Hadoop cluster without interfering with each other.

Understanding YARN Architecture

YARN consists of three main components: the ResourceManager, the NodeManager, and the ApplicationMaster.

ResourceManager

The ResourceManager is responsible for managing the resources in the cluster. It keeps track of available resources, allocates resources to different applications, and monitors their resource usage. It also schedules tasks based on the resource requirements and policies defined by the cluster administrator.

NodeManager

The NodeManager runs on each node in the cluster and is responsible for managing the resources on that node. It monitors the resource usage and health of the node, and reports this information to the ResourceManager. The NodeManager also manages the execution of tasks allocated to that node, launching and monitoring containers (a container represents a set of resources allocated to an application).

ApplicationMaster

The ApplicationMaster is responsible for a specific application's resource management and coordination with the ResourceManager and NodeManager. It negotiates resources for the application, works with the NodeManagers to launch and monitor containers, and reports the progress and status of the application to the ResourceManager.

YARN Example: Word Count

To better understand how YARN works, let's consider a simple example: the classic Word Count program. The goal of this program is to count the occurrences of each word in a given text document.

Step 1: Create the Word Count Application

We first need to create a Word Count application that follows the YARN application model. This includes defining the ApplicationMaster, which will coordinate the execution of the application.

public class WordCountApplicationMaster {
  public static void main(String[] args) {
    // Initialize the YARN configuration
    Configuration conf = new YarnConfiguration();

    try {
      // Create a new YARN client
      YarnClient yarnClient = YarnClient.createYarnClient();
      yarnClient.init(conf);
      yarnClient.start();

      // Create a new application submission context
      YarnClientApplication app = yarnClient.createApplication();
      ApplicationSubmissionContext appContext = app.getApplicationSubmissionContext();

      // Set the application name and type
      appContext.setApplicationName("Word Count Application");
      appContext.setApplicationType("YARN");

      // Set the resource requirements for the application
      Resource resource = Resource.newInstance(1024, 1);
      appContext.setResource(resource);

      // Set the command to execute the ApplicationMaster
      List<String> commands = new ArrayList<>();
      commands.add("/path/to/wordcountappmaster.sh");
      appContext.setAMContainerSpec(ContainerLaunchContext.newInstance(
          null, null, null, null, null, null, commands, null));

      // Submit the application to the ResourceManager
      yarnClient.submitApplication(appContext);

      // Wait for the application to finish
      while (true) {
        ApplicationReport report = yarnClient.getApplicationReport(appContext.getApplicationId());

        if (report.getYarnApplicationState() == YarnApplicationState.FINISHED) {
          break;
        }

        Thread.sleep(1000);
      }

      // Cleanup the YARN client
      yarnClient.stop();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Step 2: Define the Word Count ApplicationMaster

In this step, we need to define the Word Count ApplicationMaster, which will be responsible for coordinating the execution of the Word Count application. It will communicate with the ResourceManager to allocate resources and with the NodeManagers to launch and monitor containers.

public class WordCountApplicationMaster {
  public static void main(String[] args) {
    // Initialize the YARN configuration
    Configuration conf = new YarnConfiguration();

    try {
      // Create a new YARN client
      YarnClient yarnClient = YarnClient.createYarnClient();
      yarnClient.init(conf);
      yarnClient.start();

      // Create a new application submission context
      YarnClientApplication app = yarnClient.createApplication();
      ApplicationSubmissionContext appContext = app.getApplicationSubmissionContext();

      // Set the application name and type
      appContext.setApplicationName("Word Count Application");
      appContext.setApplicationType("YARN");

      // Set the resource requirements for the application
      Resource resource = Resource.newInstance(1024, 1);
      appContext.setResource(resource);

      // Set the command to execute the ApplicationMaster
      List<String> commands = new ArrayList<>();
      commands.add("/path/to/wordcountapp