DEVELOPMENT OF A MODEL OF AN IMPROVED SCALABLE RESOURCE MANAGEMENT SYSTEM FOR HADOOP- YET ANOTHER RESOURCE NEGOTIATOR (YARN)

Amount: ₦5,000.00 |

Format: Ms Word |

1-5 chapters |




ABSTRACT

The growing popularity of cloud computing and advances in ICT today, continuous increase in the volume of data and its computational capacity has generated an overwhelming flow of data now referred to as big data. This computational capacity has exceeded the capabilities of conventional processing tools like Relational Database Management System. Big data has brought in an era of data exploration and utilization with MapReduce computational paradigm as its major enabler. Hadoop MapReduce as a data driven programming model has gained popularity in large-scale data processing due to its simplicity, scalability, fault tolerance, ease of programming and flexibility. Though great efforts through the implementation of this framework has made Hadoop scale to tens of thousands of commodity cluster processors, the centralized architecture of resource manager has adversely affected Hadoop’s scalability to tomorrow’s extreme-scale datacenters. With over 4000 nodes in a cluster, resource requests from all the nodes to a single Resource Manager leads to scalability bottleneck. Decentralizing the responsibilities of resource manager to address scalability issues of Hadoop for better response, processing, turnaround time and to eliminate single point of failure in this framework is therefore, the concern of this work. Existing framework was analyzed and its deficiencies served as a basis for improvement in the new system. The new framework decoupled the responsibilities of resource manager by providing another layer where each daemon called Rack_Unit Resource Manager (RU_RM) carry out the responsibility of allocating resources to compute nodes within its local rack. This ensured that there was no single point of failure and allowed low latency for large files on compute nodes within the same local rack. The new framework also ensured that 2/3 of each data block is stored on different compute nodes within the same local rack. This speed-up execution time for each job since bandwidth within compute nodes on same rack is higher than bandwidth across racks. Also, the introduction of novel relaxed-ring architecture in the rack unit resource manager layer of the framework provided fault tolerance mechanism for the layer. Object Modeling Technique (OMT) methodology was used for this work. The methodology captured three major viewpoints, which are static, dynamic and functional behaviour of the system. The application was developed and tested in Java programming language with Hadoop benchmark workload called WordCount used for analysis. Two performance evaluation metrics (efficiency and average task-delay ratio) were used for the purpose of comparison. Efficiency of the new model for file sizes 30.5kB and 92kB gave a difference of 3.5% and 4.7%, respectively better than the existing framework. The new model had lower average task-delay ratio of 0.056ns and 0.166ns for file sizes 30.5kB and 92kB, respectively when compared to the existing framework. Results from the analysis of this new model showed a better response time and less scheduling overheads.

CHAPTER ONE

INTRODUCTION

1.1     Background of the Study

Advances in ICT today has made data more voluminous and multifarious and its being transferred at high speed (Sergio, 2015). Applications in cloud like Yahoo weather, Facebook photo gallery and Google search index is changing the IT landscape in a profound way (Stone et al., 2008; Barroso et al., 2003). Reasons for these trends include scientific organizations solving big problems related to high performance computing workloads, diverse public services being digitized and new resources used. Mobile devices, global positioning systems, sensors, social media, medical imaging, financial transaction logs and lots of them are all sources of massive data generating large sets of complex data (Sergio, 2015). These applications are evolving to be data- intensive which processes very large volumes of data hence, require dynamically scalable, virtualized resources to handle them.

Large firms like Google, Amazon, IBM, Microsoft and Apple are processing vast amount of data (Dominique, 2015). International Data Corporation (IDC) survey in 2011 estimated the total world wide data size which they called digital data universe at

1.8 zegabytes (ZB) (Dominique, 2015). IBM observed that about 2.5 quintillion bytes of data is created each day and about 90% of data in the world was created in the last two year (IBM, 2012). This is obviously large. An analysis given by Douglas (2012) showed that data generated from the earliest starting point until 2003 represented close to 5exabytes and rose to 2.7zettabytes as at 2012 (Douglas, 2012). Type of data that has rapid increase is the unstructured data (Nawsher et al., 2014). This is because, these data are characterized by human information such as high-definition videos, scientific simulations, financial transactions, seismic images, geospatial maps, tweets, call-centre conversations, mobile calls, climatology and weather records (Douglas, 2012). Computer world submit that unstructured information account for more than 70% of all data in an organization (Holzinger et al., 2013). Most of these data are not modelled, they are random and very difficult to analyse (Nawsher et al., 2014).

A new crystal ball of the 21st century that helps put all these massive data together, classifying them according to their kinds or nature is referred to as Big Data. Big data is a platform that helps in the storage, classification and analysing massive volume of data (Camille, 2015). Hortonworks (2016) defined big data as a collection of large datasets that cannot be processed using traditional computing techniques. These data includes black box data (data from components of helicopter, airplanes and jets), social media data such as facebook and twitter, stock exchange data that holds information about the “buy” and “sell” decisions made on a share of different companies, power grid data like information consumed by a particular node with respect to a base station, transport data which includes model, capacity, distance and availability of a vehicle. Big data can also be seen as accumulation of huge and complex datasets that is too hard to process using database management tools or traditional data processing application with the challenges of capturing, storing, searching, sharing, transferring, analysing and virtualization. Madeen (2012) also see big data as “too big, too fast or too hard for existing tools to process”. “Too big” from Madden‟s explanation has to do with the amount of data which might be at petabyte – scale and come from various sources. “Too fast” is data growth, which is fast and must be processed quickly and “too hard” is the difficulties of big data that does not fit neatly into an existing processing tool (Madden, 2012).

The characteristics of big data are better defined by Gartner in Beyer and Laney (2012) as the three Vs (Volume, Velocity and Variety). Volume refers to the amount of data to be processed. Volume of data could amount to hundreds of terabytes or even petabytes of information generated from everywhere (Avita et al., 2013). As organization grows, more data sources consisting large datasets increase the volume of data. Oracle gave the rate at which data grows. It was observed that data is growing at a 40% compound annual rate, reaching nearly 45ZB by 2020 (Oracle, 2012). Velocity is speed at which data grows. According to Sircular (2013), velocity is the most misunderstood big data characteristic. She described data velocity as the rate of changes and combining datasets that are coming with different speeds (Sircular, 2013). Variety has to do with the type of data. Big data accommodate structured data (relational data), semi-structured data (XML data) and unstructured data (word, pdf, text, media logs). From analytics perspective, data variety is seen as the biggest challenge to effectively gain insight in big data. Some researchers believe that taming data variety and volatility will be a key

to big data analytics (Nawsher et al., 2014). IBM came with an additional V for big data characteristic which is “veracity” (IBM, 2012) Veracity addressed the inherent trustworthiness of data. Since data will be used for decision making, it is important to make sure that such data can be trusted (IBM, 2012). Some researchers mentioned “viability” and “value” as the fourth and fifth big data characteristics leaving “veracity” out of the Vs (Biehn, 2013).

These ever increasing data pools obviously have a profound impact not only on hardware storage requirements and user applications, but also on the file system design, implementation and the actual I/O performance and scalability behaviour of today‟s IT environment. To improve I/O performance and scalability therefore, the obvious answer is to provide a means such that users can read/write from/to multiple disks (Dominique, 2015). Assume a hypothethical setup with 100 drives, each holding just 1/100 of 1TB data and all of these drives are accessed in parallel at 100MB/second. It then means that 1TB of data can be fetched in less than 2minutes. If same operation is to be performed with just a drive, then it will take more than 2½hours to accomplish same task. Today‟s huge and complex semi-structured or unstructured data are difficult to manage using traditional technologies like RDBMS hence, the introduction of HDFS and MapReduce framework in Hadoop. Hadoop is a distributed data storage/data processing framework. Data sets processed by traditional database (RDBMS) solutions are by far much smaller compared with the data pool utilized in Hadoop environment (Dominique, 2015). While Hadoop adopts a brute-force access method, RDBMS solution only banks on optimized accessing routines such as indexes, read-head and write-behind technique (Dominique, 2015). Hadoop excels in an environment that reveals a massive parallel processing infrastructure where data is unstructured to the point where no RDBMS optimization techniques can be used to boost I/O performance (Dominique, 2015). Hadoop is therefore, designed to process efficiently, large data volumes by linking many commodity systems so that they can work as parallel entity. The framework was designed basically to provide reliable, shared storage and analysis infrastructure to the user community. Hadoop has two components – HDFS (Hadoop Distributed File System) and MapReduce framework (Nagina and Sunita, 2016). The storage portion of the framework is provided by HDFS while the analysis functionality is presented by MapReduce (Dominique, 2015). Other components also constitute Hadoop solution suite.

MapReduce framework was designed as a tool for data driven programming model which aims at processing large-scale data-intensive applications in cloud on commodity processors (Dean and Ghemawat, 2008). MapReduce has two components – Map and Reduce (Wang, 2015) with intermediate shuffling procedures and the data formatted as unstructured (key, value) pair (Dean and Ghemawat, 2008). HDFS replicates data unto multiple data nodes to safeguard the file system from any potential loss so that, if one data node gets fenced, there are at least two other nodes holding same data set (Dominique, 2015).

The first generation Hadoop called Hadoop_v1 was an open source of MapReduce (Bialecki et al., 2005). It has a centralised component called JobTracker that plays the role of both resource management and task scheduling. Another centralized component is the NameNode which is the file metadata server for HDFS that stores application data (Shvachko et al., 2010). With Hadoop_v1, scalability beyond 4000 nodes was not possible with the centralized responsibility of JobTracker/TaskTracker architecture. To overcome this bottleneck and to promote this programming framework so that it carries other standard programming models and not just implementation of MapReduce, the Apache Hadoop Community developed the next generation Hadoop called YARN (Yet Another Resource Negotiator). This newer version of Hadoop called YARN decouples resource management infrastructure from JobTracker in Hadoop_v1. Hadoop YARN introduced a centralized Resource Manager (RM) that monitors and allocates resources. Each application also delegates a centralized per-application master (AM) to schedule tasks to resource containers managed by Node Manager (NM) on each compute node (Wang, 2015). The HDFS and its centralized metadata management remains the same on this newer programming model (Wang, 2015). Improvement made on Hadoop_v1 (by decoupling the resource management infrastructure) enables YARN to run many application frameworks like MapReduce, Message Passing Interface (MPI), interactive applications and scientific workflows. This eases the resource sharing of the Hadoop cluster.

With the scheduler separated from RM and the implementation of per-application master (AM), Hadoop has achieved an unprecedented scalability. However, there are inevitable design issues that are preventing Hadoop from scaling to extreme scales. These issues are in the centralized paradigm in the implementation of some components in YARN framework. This research work therefore, seeks to develop a model solution

that will decentralize the responsibilities of resource manager for scalable resource management in YARN.

1.2              Statement of the Problem

Data driven models like Hadoop have gained tremendous popularity in Big Data analytics. Though great efforts have been made through the implementation of Hadoop framework by decoupling of resource management infrastructure which has allowed Hadoop to scale to tens of thousands of commodity cluster processors, the centralized designs of resource manager and metadata management of HDFS has adversely affected Hadoop scalability (ability to expand the cluster) to tomorrow‟s extreme-scale datacentres. This challenge therefore, led us to the following problem definition.

  1. How to develop a model alternative that will ensure better scalable resource management in YARN.
    1. To address scalability issues of Hadoop through decentralized resource management in order to improve response and turnaround time of clients‟ jobs.
    1. How to provide a mechanism that will guard against failure of resource management deamons during job execution.
    1. How to evaluate the scalability of the new model to the existing model using efficiency and average task-delay ratio as performance metrics.

1.3              Aim and Objectives of the Study

The aim of this research work is to develop a model of an improved scalable resource management system for Hadoop YARN.

The objectives of the study are to;

  1. Decentralize the global control of Resource Manager in YARN framework by providing another layer called Rack Unit Resource Manager (RU_RM) layer.
  1. Configure RU_RM layer to ensure that each RU_RM controls resource requests for compute nodes within its rack instead of a single Resource Manager controlling all the compute nodes in the cluster.
    1. Develop ring architecture in RU_RM layer to ensure that all Rack Unit resource managers form a peer-to-peer architecture such that each Rack Unit resource manager holds resources for which it is directly responsible to and also have backup copies of resources for the RU_RM preceding/succeeding it.
    1. Carry out a performance evaluation test between the new model and YARN with Hadoop benchmark workload called WordCount.

1.4              Significance of the Study

The major significance of this research is to deliver an elastic, scalable and easy way to optimize and streamline operations in YARN so as to provide better quality of service to users. The research work seeks to make sure that no single point of failure exists in YARN framework. The global resource manager in YARN is a per cluster resource manager controlling all data nodes in the network. Once this daemon fails, all jobs will halt and have to be restarted. This process however, leads to delay in response time and the execution of jobs in the framework. With the introduction of per-rack resource manager layer in the new model, each rack unit resource manager is directly responsible for its corresponding data nodes. Single point of failure experienced in the existing framework therefore, is eliminated so that jobs will have lower response and execution time.

The introduction of novel ring approach in the rack unit resource manager layer of the new model ensured that all rack unit resource managers form a peer-to-peer architecture such that each rack unit resource manager holds resources for which it is directly responsible to and also have backup copies of resources for the RU_RM preceding/succeeding it. This will ensure that resources are available to users on demand and efficiently utilized to provide greater turnaround time for safety critical jobs like computer controlled radiation machines in health sector. Failure of a single rack unit resource manager will no longer affect the processing of jobs since the rack unit resource manager preceding/succeeding it will take responsibility of all data nodes within the failed rack. The framework will enable surplus data to be streamlined for any

distributed processing system across cluster of computers. It will scale up single servers to a very large number of machines; each and every of these machines offering local computation and storage space. This will allow for rapid data transfer facilitated by well laid out distributed file system.

The introduction of rack-aware resource manager in this system can now provide a cost effective storage solution to business analysts. Its highly scalable storage/processing capability will facilitate businesses to easily access data sources and tap into different types of data to produce value for such data. Significantly noted is that, data management industry has expanded from software and web into retail, healthcare/hospitals, media and entertainment, information services, finance and government. This creates a huge demand for a more scalable system that will provide excellent data analytic services. Most enterprises/organizations use and analyse lower volume of data with a very large amount wasted. It is a bad practice to term data as unwanted as many part of these data can be put to good use by an organization/enterprice. This system therefore has the capability of storing and processing of large amount of data that can help organization improve the functionality of each and every business unit which inludes research, design, development, marketing, advertising, sales and customers handling.

One vital component of data analytics is machine learning. Machine learning is significantly used in medical domain to predict cancer, natural language processing, search engine, recommendation engines, bio-informatics, image processing, text analytics and much more. Machine learning algorithms gain significance where data size is big; especially when it is unstructured, as it means making sence out of thousands of parameters, of billions of data values. Since this system processes large datasets across different cluster of cheap commodity servers, the place of big data comes into picture, which is also significant for running machine learning algorithms. With this rack-aware resource management system, data scientists and engineers will be able to ingest more data and processes into machine learning tools and be sure of lower response time during job execution. An efficient library can be developed to enable running various machine learning algorithms on this system in a distributed manner.

1.5              Scope of the Study

This research work considered the resource management of YARN framework and provided a model alternative that help to improve management of applications/jobs in this framework. The scope of this work therefore, is limited to resource manager daemon of YARN framework. Hadoop benchmark workload called WordCount was used to compare the new model and the existing model. The map task counts the frequency of each individual word in a subset data file while the reduce task shuffles and gather the frequency of all the words.

1.6              Limitation of the Study

The simulator for Hadoop YARN called YARNsim (Ning et al., 2015) has very limited tools for carrying out this work. The current version (developed recently) does not support fault tolerance models (Ning et al., 2015), only FIFO scheduling algorithm was built into the simulator and it has no plugins that allows modifications to suit research improvements in Hadoop hence, the need to use real test bed. However, this situation will not be a drawback to achieving desired objective for this research work.

1.7              Definition of Terms

Datacenter (DC):- Is a centralized repository for storage, management and dissemination of data and information organized around a particular body of knowledge.

Framework:- Is a layered structure in any programming model, which shows what kind of program can or should be built and how they would interrelate.

Hadoop:- Is an open source Java-based programming model which supports processing and storage of large pool of data sets in a distributed environment.

HDFS:- Hadoop Distributed File System use in Hadoop framework to store clients data.

MapReduce:- A component of Hadoop framework that processes clients data.

Nodes:- Just two or more different computers connected within a network.

Scalable:- Ability to change in size.

Scalability:- Ability of a system or application to continue to function well even when it changes in size or volume in order to meet its desired objectives.

Task:- A piece of work to be carried out by system.

YARN:- Yet Another Resource Negotiator is the newer version of Hadoop framework used in storage and processing of large data sets.

Resource Manager:- Is a core component of YARN responsible for scheduling of jobs and management of compute nodes in a cluster.

Cluster:- Is a special type of computational framework designed specifically for storing and analysing huge amounts of structured/unstructured data in a distributed computing environment.



This material content is developed to serve as a GUIDE for students to conduct academic research


DEVELOPMENT OF A MODEL OF AN IMPROVED SCALABLE RESOURCE MANAGEMENT SYSTEM FOR HADOOP- YET ANOTHER RESOURCE NEGOTIATOR (YARN)

NOT THE TOPIC YOU ARE LOOKING FOR?



A1Project Hub Support Team Are Always (24/7) Online To Help You With Your Project

Chat Us on WhatsApp » 09063590000

DO YOU NEED CLARIFICATION? CALL OUR HELP DESK:

  09063590000 (Country Code: +234)
 
YOU CAN REACH OUR SUPPORT TEAM VIA MAIL: [email protected]


Related Project Topics :

Choose Project Department