Fault Tolerance in Check-pointing Approach

download print

About this sample

About this sample


Words: 1902 |

Pages: 4|

10 min read

Published: Jul 17, 2018

Words: 1902|Pages: 4|10 min read

Published: Jul 17, 2018

Today Highly secure virtual grid is demanding in which you can share any resource from any cluster even in presence of a fault in the system. Grid computing is aimed at large-scale systems that even span organizational boundaries which are distributed computing paradigm that differs from traditional distributed computing. Reliability challenges arise because of unreliable nature of grid infrastructure in addition to the challenges of managing and scheduling these applications. A fault can occur due to link failure, resource failure or by any other reason which is to be tolerated for working the system smoothly and accurately without interrupting the current job. Many techniques used accordingly for detection and recovery of these faults. An appropriate fault detector can avoid a loss which is occurring in the system due to system crash and reliable fault tolerance technique can save from system failure. In order to achieve reliability, availability, and QOS, fault tolerance is an important property. The fault tolerance mechanism used here sets job checkpoints based on resource failure rate. The job is restarted from its last successful state using a checkpoint file from another grid resource if resource failure occurs. Selecting optimal intervals of checkpointing an application is important for minimizing the runtime of the application in the presence of system failures. Fault Index based rescheduling algorithm reschedules the job from the failed resource to some other available resource with the least Fault-index value and executes the job from a recently saved checkpoint in case of resource failure. This ensures the job to be executed within the given deadline with increased throughput and helps in making the grid environment trustworthy.

'Why Violent Video Games Shouldn't Be Banned'?

Grid computing is a term referring to the aggregation of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with workloads that are non-interactive and which involve a large number of files. It is more common than a single grid will be used for a variety of different purposes, although a grid can be dedicated to a specialized application. Grids are often constructed with the aid of general-purpose grid software libraries known as middleware. Sharing, selection, and aggregation of a wide variety of geographically distributed resources including supercomputers, storage systems, data sources and specialized devices owned by different organizations are enabled by the grid. Management of these resources is an important infrastructure in grid computing environment.

To achieve the promising potentials of computational grids, the fault tolerance is fundamentally important since the resources are geographically distributed to achieve the promising potentials of the computational grid. Moreover, the probability of resource failure is much greater than in traditional parallel computing and the failure of resources affects job execution fatally. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults and it makes the system more dependable. The fault tolerance service is essential to satisfy QoS requirements in grid computing and it deals with various types of resource failures, which include process failure, processor failure, and network failures.

Checkpointing interval or the period of checkpointing the application’s state is one of the important parameters in a checkpointing system that provides fault tolerance. Smaller checkpointing intervals lead to increased application execution overheads due to checkpointing while larger checkpointing intervals lead to increased times for recovery in the event of failures. Hence, in presence of failure, optimal check-pointing intervals that lead to minimum application execution time has to be determined.


1. If a fault occurs at a grid resource, the job is rescheduled on another resource which eventually results in failing to satisfy the user’s QOS requirement i.e. deadline. The reason is simple. As the job is re-executed, it consumes more time.

2. There are resources that fulfill the criterion of deadline constraint, but they have a tendency toward faults in computational-based grid environments. In such scenario, the grid scheduler goes ahead to select the same resource for the mere reason that grid resource promises to meet user’s requirements of grid jobs. This eventually results in compromising user’s QOS parameters in order to complete the job.

3. Even though there is a fault in the system, a task running should be finished on its deadline. There is no meaning of such a task which is not finishing before its deadline. Hence, deadline in real time is the major issue.

4. In real time distributed system availability of end to end services and the ability to experience failures or systematic attacks, without impacting customers or operations.

5. It is about the ability to handle growing amount of work, and the capability of a system to increase total throughput under an increased load when resources are added.


Adaptive check-pointing fault tolerance approach is used to overcome above-mentioned drawbacks in such scenario. In this approach, every resource maintains fault tolerance information. When a fault occurs, the resource updates the fault occurrence information. During decision making of allocating resources to the job, fault tolerance information is used. The checkpointing is one of the most popular techniques. To provide fault-tolerance on unreliable systems, the checkpointing is one of the most popular technique. It is a record of the snapshot of the entire system state in order to restart the application after the occurrence of some failure. Checkpoint can be stored on temporary as well as stable storage. However, the efficiency of the mechanism is strongly depending on the length of checkpointing interval. Frequent checkpointing enhances the overhead, while lazy checkpointing may lead to the loss of significant computation. Hence, the decision about the size of checkpointing interval and checkpointing technique is a complicated task and should be based upon the knowledge about the system as well as the application.

Checkpoint-recovery depends on system’s MTTR. Usually, a hard disk periodically saves the state of an application on stable storage. After a crash, the application is restarted from the last checkpoint rather than starting the application all over again. There are three checkpointing strategies. They are coordinated checkpointing, uncoordinated checkpointing, and communication-induced checkpointing. 1. In coordinated checkpointing, processes synchronize checkpoints to ensure their saved states are consistent with each other, so that the overall combined, saved state is also consistent. In contrast, 2. In uncoordinated checkpointing, processes schedule checkpoints is independent at different times and do not account for messages.3. Communication-induced checkpointing attempts to coordinate only selected critical checkpoints.


A grid resource is a member of a grid and it offers computing services to grid users. Grid users register themselves to the Grid Information Server (GIS) of a grid by specifying QoS requirements such as the deadline to complete the execution, the number of processors, type of operating system and so on.

The components used in the architecture are described below:

Scheduler-Schedulers is an important entity of a grid. It receives jobs from grid users. It selects feasible resources for those jobs according to received information from GIS. Then it generates job-to-resource mappings. When the schedule manager receives a grid job from a user, it gets details of available grid resources from GIS. It then passes the available resource list to entities in MTTR scheduling strategy. The Matchmaker entity performs match making of resources and job requirements. Response Time Estimator entity estimates the response time for a job on each matched resource based on Transfer time, Queue Wait time and Service time of the job. Resource selector selects the resource with minimum response time. A job dispatcher dispatches the jobs one by one to checkpoint manager.

GIS- GIS contains information about all available grid resources. It maintains details of resources such as processor speed, memory available, load, etc. All grid resources that join and leave the grid are monitored by GIS. A scheduler consults GIS to get information about available grid resources whenever it has jobs to execute.

Checkpoint Manager-It receives the scheduled job from the scheduler and sets checkpoint based on the failure rate of the resource on which it is scheduled. Then it submits the job to the resource. Checkpoint manager receives job completion message or job failure message from the grid resource and responds to that accordingly. During execution, if job failure occurs, the job is rescheduled from the last checkpoint instead of running from the scratch.

Checkpoint Server-Job status is reported to the checkpoint server on each checkpoint set by the checkpoint manager. Checkpoint server saves job status and returns it on demand i.e., during job/resource failure. For a particular job, the checkpoint server discards the result of the previous checkpoint when a new value of checkpoint result is received.

Fault Index Manager- It maintains the fault index value of each resource which indicates the failure rate of the resource. The fault index of a resource is incremented every time when a resource does not complete the job assigned to it within the deadline and also on resource failure. The fault index of a resource is decremented when the resource completes the job assigned to it within the deadline. Fault index manager updates the fault index of a grid resource using fault index update algorithm.

Checkpoint Replication Server- When a new checkpoint is created, Checkpoint Replication Server initializes CRS which replicates the created checkpoints into remote resources by applying RRSA. Details are stored in Checkpoint Server after replication. To obtain information about all checkpoint files, Replication Server queries Checkpoint Server.CRS monitors the Checkpoint Server to detect newer checkpoint versions during the entire application runtime. Information about available resources, hardware, memory and bandwidth details are obtained from GIS. The required details are periodically propagated by these tools to the GIS. CRS selects a suitable resource using RRSA to replicate the checkpoint file depending on transfer sizes, available storage of the resources and current bandwidth.


Throughput- One of the most important standard metrics which is used to measure the performance of fault-tolerant systems is throughput. Throughput is defined as:

Throughput (n)=n/Tn

Where n is the total number of jobs submitted and Tn is the total amount of time required to complete n jobs. Throughput is used to measure the ability of the grid to accommodate jobs. Generally, the throughput of two systems decreases with increase in the percentage of faults injected in the grid. This is because of extra delay which is encountered by both of them to complete jobs in case of some resources failure.

Failure tendency- Failure tendency is the percentage of the tendency of the selected grid resources to fail and is defined as:

Fail tendency=*100%

Where m is the total number of grid resources and Pfj is the failure rate of resource j. Through this a metric, the faulty behavior of the system can be expected.

Get a custom paper now from our expert writers.

In all distributed environments fault tolerance is an important problem. Thus, by dynamically adapting the checkpoint frequency, based on the history of information of failure and job execution time, which reduces checkpoint overhead and also, increases the throughput by which the proposed work achieves fault tolerance. Hence, following have been proposed new fault detection methods, client transparent fault tolerance architecture, on-demand fault tolerant techniques, economic fault tolerant model, optimal failure prediction system, multiple faults tolerant model and self-adaptive fault tolerance framework to make the grid environment is more dependable and trustworthy.

Image of Dr. Oliver Johnson
This essay was reviewed by
Dr. Oliver Johnson

Cite this Essay

Fault tolerance in check-pointing approach. (2018, August 05). GradesFixer. Retrieved June 14, 2024, from
“Fault tolerance in check-pointing approach.” GradesFixer, 05 Aug. 2018,
Fault tolerance in check-pointing approach. [online]. Available at: <> [Accessed 14 Jun. 2024].
Fault tolerance in check-pointing approach [Internet]. GradesFixer. 2018 Aug 05 [cited 2024 Jun 14]. Available from:
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled


Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.


    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts


    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.



    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!


    Get Your
    Personalized Essay in 3 Hours or Less!

    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now