Big Data has
been one of the current and future research frontiers. MapReduce is the widely
used programming model for data-intensive applications in big data environment.
Implementation of MapReduce in the Hadoop system is the ideal and adaptive
model for analytics on large sized data in the heterogeneous computing
environment. The scheduling of jobs plays a major role in achieving the better
overall system performance in data critical Hadoop applications. So, the
heterogeneous environment needs to be improvise with the best scheduling
algorithm and interoperable to the execution of task to reduce time to leave from the job after reaching
time deadline. The performance of the scheduler must be cost effective,
time constraint and also must consider factors such as execution time, response
time, makespan and throughput.

 

Key
words: Big Data, Hadoop, MapReduce, Heterogeneous environment and  Scheduling

1.      
Introduction

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Big Data is a phrase used to mean a massive volume of
both structured and unstructured data that is so large, which is difficult to
process using traditional database and software techniques. In most enterprise
scenarios the volume of data is too big or it moves too fast or it exceeds
current processing capacity. Big Data has the potential to help companies
improve operations and make faster, more intelligent decisions.

This data, when captured, formatted, manipulated,
stored, and analyzed can help a company to gain useful insight to increase
revenues, get or retain customers, and improve operations. Big Data applies to
information that can’t be processed or analyzed using traditional processes or
tools. Increasingly, organizations today are facing more and more Big Data
challenges. They have access to a wealth of information, but they don’t know
how to get value out of it because it is sitting in its most raw form or in a
semi structured or unstructured format; and as a result Bigdata is less unaware
in the Computing environment.

2. Characteristics

    We
have all heard of the 3Vs of big data which are Volume, Variety and Velocity.
There are also other big data Vs that are getting attention are: Veracity,
Validity and Volatility.

2.1 Volume

Big
Data implies enormous volumes of data. Now that data is generated by machines,
networks and human interaction on systems like social media, the volume of data
to be analyzed is massive. The main characteristic that makes data “big” is the
sheer volume. It makes no sense to focus on minimum storage units because the
total amount of information is growing exponentially every year.

2.2 Variety

Variety
is one the most interesting developments in technology as more and more
information is digitized. Traditional data types (structured data) include
things on a bank statement like date, amount, and time. These are things that
fit neatly in a relational database. It also refers to the many sources and
types of data both structured and unstructured. We used to store data from sources
like spreadsheets and databases. Now data comes in the form of emails, photos,
videos, monitoring devices, PDF, audio, etc. This variety of unstructured data
creates problems for storage, mining and analyzing data.

 

2.3 Velocity

Velocity
is the frequency of incoming data that needs to be processed.  Big Data velocity deals with the pace at
which data flows in from sources like business processes, machines, networks
and human interaction with things like social media sites, mobile devices, etc.
The flow of data is massive and continuous. This real-time data can help
researchers and businesses make valuable decisions that provide strategic
competitive advantages and ROI.

 

2.4   
Veracity

Big Data veracity refers to the biases, noise and
abnormality in data. Veracity in data analysis is the biggest challenge when
compared to things like volume and velocity. 

 

3.      
Hadoop

Hadoop is one of the most popular MapReduce
implementations. Both input and output pairs of a MapReduce application are
managed by an under lying Hadoop Distributed File System (HDFS).  At the heart of HDFS is a single Name Node a
master server managing the file system namespace and regulates file accesses.
The Hadoop runtime system establishes two processes called JobTracker and Task
Tracker.  Job-Tracker is responsible for
assigning and scheduling tasks; each TaskTracker handles mappers or reducers
assigned by JobTracker.

4.      
Mapreduce

MapReduce
is a popular data processing paradigm for efficient and fault tolerant workload
distribution in large clusters. A MapReduce computation has two phases, namely,
the Map phase and the Reduce phase. The Map phase splits an input data into a
large number of fragments, which are evenly distributed to Map tasks across a
cluster of nodes to process. Each Map task takes in a key-value pair and then
generates a set of intermediate key-value pairs. After the MapReduce runtime
system groups and sorts all the intermediate values associated with the same
intermediate key, the runtime system delivers the intermediate values to Reduce
tasks.  Each Reduce task takes in all
intermediate pairs associated with a particular key and emits a final set of
key-value pairs.  MapReduce applies the
main idea of moving computation towards data, scheduling map tasks to the
closest nodes where the input data is stored in order to maximize data
locality.

 

 

  

5.      
RESEARCH ISSUES

 Main research issues in big data are:

·        
Handling data volume

·        
Analysis of big data

·        
Privacy of data

·        
Storage of huge amount of data

·        
Data visualization

·        
Job scheduling

·        
Fault tolerance

6.     
BACKGROUND AND MOVITATION

RELATED WORKS

This section surveys the
related work reported in the current literature to reduce execution time
,response time and throughput  in the
BigData  area.

A.       Self-Adaptive
Scheduling Algorithm For Reduce Start Time 1 says about when to start the reduce tasks because starting the
reduce tasks is one of the key problems to advance the MapReduce performance.
The existing implementations may result in a block of reduce tasks. The reasons
of system slot resources waste are illustrated. 
The results in the reduce tasks waiting around, and proposes an optimal
reduce scheduling policy called SARS (Self Adaptive Reduce Scheduling) for
reduce tasks’ start times in the Hadoop platform 1. It is also proved that
the average response time is decreased by 11% to 29%, when the SARS algorithm
is applied to the traditional job scheduling algorithms FIFO, Fair Scheduler,
and Capacity Scheduler. This paper uses the Grid mix to submit jobs, which can
analyze the Hadoop performance by simulating the actual load of a Hadoop
cluster. There are also some difficulties in the experiments. The first is the
speed of the network transmission. When multiple reduce tasks from one
TaskTracker run at the same time, they may lead to the network I/O competition
and lower the transmission speed during the copy stage.

 

B.       Novel
Approach for Partitioning in Hadoop using Round Robin Technique 2 recommends
an enhanced partitioning algorithm utilizing round robin partitioning that
advances load balancing and recollection utilization. A sequence of
experimentations have exposed that given a skewed data sample, the Round Robin
architecture was capable to reduce skew by distributing records on average when
compared with subsisting Hash Partitioning2. Round Robin partition technique
uniformly distributes the data on every destination data partitions and when
number of records is divisible by number of partitions, and then the skew is
most probably zero. The data splits
are applied to the Mapper and the outcome is sorted splits. Further these
splits are facsimiled to the splits of Reducer for merging. After this,
additional research can be made to introduce few other partitioning mechanisms
so that it can be incorporated with Hadoop for applications using different
input samples since Hadoop File System is not having any partitioning mechanism
except hash key partitioning 2.

 

C.     
MapReduce Task Scheduling
Algorithm for Deadline Constraints proposes 3 an extensional MapReduce Task Scheduling
algorithm for Deadline constraints in Hadoop platform: MTSD 3. It allows user
specify a job’s deadline and tries to make the job be finished before the
deadline. Through measuring the node’s computing capacity, a node
classification algorithm is proposed in MTSD. This algorithm classifies the
nodes into several levels in heterogeneous clusters. Under this algorithm, we
firstly illuminate a novel data distribution model which distributes data
according to the node’s capacity level respectively. The MTSD proposed   focuses on user’s deadline constraints
problem. In this algorithm, a node classification algorithm is proposed to
divide the nodes into different type with their computation ability 3. The
future work of the author is to find a good solution to solve the Reduce task
scheduling problem.

 

 

D.      Longest
Approximate Time to End Scheduling Algorithm in Hadoop Environment, As the number and deviation of jobs to be executed
across different clusters are increasing, so it is the complexity of scheduling
them efficiently to meet required objectives of performance. a LATE MapReduce
scheduling algorithms4 is proposed. The proposed Longest Approximate Time to
End (LATE) algorithm is based on three principles: prioritize tasks to
speculate, select fast nodes to run on, and cap speculative tasks to prevent
thrashing. Progress Rate of a task is given by ProgressScore = ExecutionTime.
The time left parameter for a task is estimated based on the Progress Score
provided by Hadoop, as (1 –ProgressScore) /ProgressRate. The authors have
evaluated the performance of LATE (Longest Approximate Time to End) scheduling
algorithm which improves the performance of Hadoop. It works better than
existing map reduce scheduling algorithms. The results demonstrate that the
algorithm is both more accurate and efficient in comparison to other algorithms
in literature.

 

E.       Matchmaking: A New MapReduce
Scheduling Technique 5 develops a new MapReduce scheduling technique
to enhance map task’s data locality. We have integrated this technique into
Hadoop default FIFO scheduler and Hadoop fair scheduler. Furthermore, unlike
the delay algorithm, it does not require an intricate parameter tuning process
5. The main idea behind our technique is to give every slave node a
fair chance to grab local tasks before any non-local tasks are assigned to any
slave node. Since our algorithm tries to find a match, i.e., a slave node that
contains the input data, for every unassigned map task, we call our new
technique the matchmaking scheduling algorithm 5. The author developed a new
matchmaking algorithm to improve the data locality rate and the average
response time of MapReduce clusters.

 

 

F.       HFSP: Bringing Size-Based
Scheduling To Hadoop highlights a novel
approach to the resource allocation problem, based on the idea of size-based
scheduling. Our effort materialized in a full-fledged scheduler that we called
HFSP, the Hadoop Fair Sojourn Protocol, which implements a size-based
discipline that satisfies simultaneously system responsiveness and fairness
requirements. HFSP uses a simple and practical design: size estimation trades
accuracy for speed, and starvation is largely alleviated, by introducing the
mechanisms of virtual time and aging 6. This work raised many challenges:
evaluating job sizes online without wasting resources, avoiding job starvation
for both small and large jobs, and guaranteeing short response times despite
estimation errors were the most noteworthy.

 

G.      In Job Size-Based Scheduler for
Efficient Task Assignments in Hadoop, the
MapReduce paradigm and its open source implementation are highlighted. Hadoop
are emerging as an important standard for large scale data-intensive processing
in both industry and academia 7. A MapReduce cluster is typically shared
among multiple users with different types of workloads. When a flock of jobs
are concurrently submitted to a MapReduce cluster, they compete for the shared
resources and the overall system performance in terms of job response times,
might be seriously degraded. To address this issue, author proposed a new
Hadoop scheduler, which leverages the knowledge of workload patterns to reduce
average job response times by dynamically tuning the resource shares among
users and the scheduling algorithms for each user 7.

 

 

H.      SAMR:  Self-Adaptive
MapReduce scheduling algorithm, which
calculates progress of tasks dynamically and adapts to the continuously varying
environment automatically. When a job is committed, SAMR splits the job
into lots offine- grained map and reduce tasks, then assigns them to a series
of nodes. Meanwhile, it reads historical information which stored on every node
and updated after every execution. Then, SAMR adjusts time weight of
each stage of map and reduce tasks according to the historical information
respectively. Thus, it gets the progress of each task accurately and finds
which tasks need backup tasks. The algorithm decreases the execution time of
MapReduce jobs, especially in heterogeneous environments. The algorithm selects
slow tasks and launch backup tasks accordingly while classifying nodes
correctly, and saving a lot of system resources. But it doesn’t consider the
datasets and the job types that can also affect the stage weights of map and
reduce tasks.  11

 

I.        
Self-Adjusting Slot Configurations for
Homogeneous and Heterogeneous Hadoop Clusters, The current Hadoop only allows static slot
configuration, i.e., fixed numbers of map slots and reduce slots throughout the
lifetime of a cluster. However, it is found that such a static configuration
may lead to low system resource utilizations as well as long completion length.
Motivated by this, Author have proposed simple yet effective schemes which use
slot ratio between map and reduce tasks as a tuneable knob for reducing the
makespan of a given set. By leveraging the workload information of recently
completed jobs, our schemes dynamically allocates resources (or slots) to map
and reduce tasks. The main objective of TuMM is to improve resource utilization
and reduce the makespans of multiple jobs. The experimental results demonstrate
up to 28 percent reduction in the makespans and 20 percent increase in resource
utilizations. The effectiveness and the robustness of our new slot management
schemes are validated under both homogeneous and heterogeneous cluster environments.
But it has low performance in homogeneous environment.13

 

J.       
Adaptive Task Scheduling
strategy based on Dynamic Workload Adjustment (ATSDWA) Tasktrackers in ATSDWA, can adapt
to the change of load at runtime, obtain tasks in accordance with the computing
ability of their own, and realize the self-regulation, while avoiding the
complexity of algorithm, which is the prime reason to make jobtracker the system
performance bottleneck. ATSDWA
significantly benefits both tasktrackers and jobtracker. On the taskertracker’s
side, task execution time is reduced, node performance is more stable, task
failure rate is obviously decreased, and both hunger and saturation are avoided
at the same time. On the jobtracker’s side, the failure of jobtracker due to
overloading can be avoided. ATSDWA is applicable to both homogeneous and
heterogeneous clusters and can improve the overall task throughput rate of
cluster without bringing extra load to tasktrackers. Fault tolerance and
reliability have to be improved 14.

 

 

K.      gSched, A resource-aware Hadoop scheduler that takes into
account both the heterogeneity of computing resources and provisioning charges
in task allocation in cloud computing environments. gSched is initially
evaluated in an experimental Hadoop cluster and demonstrates enhanced
performance compared with the default Hadoop scheduler. Evaluations are
conducted on the Amazon EC2 cloud that demonstrates the effectiveness of gSched
in task allocation in heterogeneous cloud computing environments. gSched is
intended to exploit heterogeneous capabilities in a resource effective manner
in task scheduling. Execution cost and running time is reduced. It uses
characteristic based node allocation of jobs. But computation cost is high in
gSched 15.

 

L.       Heterogeneity-Aware Resource Allocation and Scheduling
in the Cloud is a resource allocation and job scheduling on a data analytics
system in the cloud to embrace the heterogeneity of the underlying platforms
and workloads. Author proposed that propose a metric of share in a
heterogeneous cluster to realize a scheduling scheme that achieves high
performance and fairness 12. The architecture, participating nodes are
grouped into one of two pools: (1) long-living core nodes to host both data and
computations, and (2) accelerator nodes that are added to the cluster
temporarily when additional computing power is needed.

 

M.     Hadoop Fair
Scheduler 4 takes the number of
slots assigned to a job as a metric of share, and it provides fairness by
having each job assigned the same number of slots. In this paper the system
architecture to allocate resources to such a cluster in a cost-effective
manner, and discussed a scheduling scheme that provides good performance and
fairness simultaneously in heterogeneous cluster, by adopting progress share as
a share metric.

 

N.      MapTask
Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality is asymptotically minimizes
the number of backlogged tasks as the arrival rate vector approaches the
boundary of the capacity region. he Fair Scheduler in Hadoop is the de facto
standard in which the delay scheduling technique is used to improve locality.
When a machine requests a new task, if the job that should be scheduled next
according to fairness does not have available local tasks for this machine, the
job is temporarily skipped and the machine checks the next job in the list.
Since machines free up quickly, more tasks are served locally. The paper
extends the throughput-optimality result to general service time distributions.
MaxWeight scheduling for resource allocation in clouds and independently
established a similar result with general service time distributions. the
proposed algorithm is both throughput-optimal and heavy-traffic optimal. The
proof technique was novel since non preemptive task execution and random
service times were involved. Simulation results were given to evaluate the
throughput performance and the delay performance for a large range of total
arrival rates 18.

 

 

O.      A speculative approach to spatial-temporal efficiency
with multi-objective optimization in a heterogeneous cloud environment an
adaptive method is presented aiming at spatial–temporal efficiency in a
heterogeneous cloud environment. A prediction model based on an optimized
Kernel-based Extreme Learning Machine algorithm is proposed for faster forecast
of job execution duration and space occupation, which consequently facilitates
the process of task scheduling through a multi-objective algorithm called time
and space optimized NSGA-II (TS-NSGA-II). An adaptive algorithm called
heterogeneity-aware partitioning (HAP) was designed for managing distribution
of tasks based on estimated work thresholds. The experiment results have shown
that both models achieve a good performance. About 47–55 s have been saved
during experiments. In terms of storage efficiency, only 1.254of differences on
hard disk occupation was made among all scheduled reducers, which achieves
26.6% improvement than the original scheme 17.

 

 

P.       A Hybrid Chemical Reaction Optimization Scheme for
Task Scheduling on Heterogeneous Computing Systems is a hybrid chemical
reaction optimization approach is proposed for DAG scheduling on heterogeneous
computing systems. The algorithm incorporates the CRO technique to search the
execution order of tasks while using a heuristic method to map tasks to
computing processors. By doing so, our proposed HCRO scheduling algorithm can
achieve a good performance without incurring a high scheduling overhead.  The heterogeneous computing system consists
of a set P of m heterogeneous processors, P1, P2,… Pm, which are fully
interconnected with a high-speed network. Each task in a DAG application can
only be executed on a processor. The communication time between two dependent
tasks should be taken into account if they are assigned to different processors
19. Both simulation and real-life experiments have been conducted in this
paper to verify the effectiveness of HCRO. The results show that the HCRO
algorithm schedules the DAG tasks much better than the existing algorithms in
terms of make span and speed of convergence 19.

 

Q.      An Adaptive Scheduling Algorithm for Heterogeneous
Hadoop Systems that translates the constraint model into Optimization
Programming Language, according to the job and resources dynamically. Secondly,
it uses OP Optimizer to get the scheduling of jobs. And thirdly, transport the
scheduling information to Hadoop simulator MRSG and evaluate the efficiency of Hadoop
in such scheduling. The Hadoop System uses Earliest Deadline First (EDF) as its
scheduler algorithm. So it does not need to analysis the execution order of the
jobs. On the other side, CPS needs to have an algorithm which could model the
constraints model and solve the optimize problem. The author proposes an
adaptive CPbased scheduling algorithm named CPS, which uses system information
to make scheduling decisions. The proposed scheduling schemes which use slot
ratio between map and reduce tasks as a tunable knob for reducing the
completion length of a given data set in both Homogeneous and Heterogeneous
Hadoop clusters 16.

 

 

7.      
CONCLUSION

This Paper describe about  job scheduling in heterogeneous MapReduce
Hadoop Environments, taking into account both task execution times and running
costs. The execution efficiency of heterogeneous Hadoop clusters, realize the
reasonable utilization of cluster resources, and prevent the abnormal state
during the process of task execution. Decreasing the number of missing deadline
jobs. The various  adaptive methodology
that changes an varies with the execution speed, resource capability,
allocation depending on the time and situation of the environment  and nature of the job which increases
efficient. The Future work is to ensure the performance adaptive methodology in
the job scheduling.

Categories: Articles

x

Hi!
I'm Garrett!

Would you like to get a custom essay? How about receiving a customized one?

Check it out