Fault Tolerance

download print

About this sample

About this sample


Words: 1894 |

Pages: 4|

10 min read

Published: Jul 17, 2018

Words: 1894|Pages: 4|10 min read

Published: Jul 17, 2018

Today Highly secure virtual grid is demanding in which you can share any resource from any cluster even with the existence of a fault in the system. Grid computing is a distributed computing paradigm that differs from traditional distributed computing in that it is aimed at large-scale systems that even span organizational boundaries. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. A fault can occur due to link failure, resource failure or by any other reason is to be tolerated for working the system smoothly and accurately. These faults can be detected and recovered by many techniques used accordingly. An appropriate fault detector can avoid loss due to system crash and reliable fault tolerance technique can save from system failure. The fault tolerance is an important property in order to achieve reliability, availability, and QoS.

'Why Violent Video Games Shouldn't Be Banned'?

The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Selecting optimal intervals of checkpointing application is important for minimizing the runtime of the application in the presence of system failures. In case of resource failure Fault Index based rescheduling, algorithm reschedules the job from the failed resource to some other available resource with the least Fault-index value and executes the job from the last saved checkpoint. This ensures the job to be executed within the deadline with increased throughput and helps in making the grid environment trustworthy.

Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Although a grid can be dedicated to a specialized application, it is more common than a single grid will be used for a variety of different purposes. Grids are often constructed with the aid of general-purpose grid software libraries known as middleware. Grid enables the sharing, selection, and aggregation of a wide variety of geographically distributed resources including supercomputers, storage systems, data sources and specialized devices owned by different organizations. Management of these resources is an important infrastructure in the grid computing environment.

To achieve the promising potentials of computational grids, the fault tolerance is fundamentally important since the resources are geographically distributed. Moreover, the probability of a failure is much greater than in traditional parallel computing and the failure of resources affects job execution fatally. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults and it makes the system more dependable. The fault tolerance services essential to satisfy QoS requirements in grid computing and it deals with various types of resource failures, which include process failure, and network failures.

One of the important parameters ina checkpointing system that provides fault tolerance is the check printing interval or the period of checkpointing the application’s state. Smaller checkpointing intervals lead to increased application execution overheads due to checkpointing while larger checkpointing intervals lead to increased time for recovery in the event of failures. Hence, optimal checkpointing intervals that lead to minimum application execution time in the presence of failures will have to be determined.


1. If a fault occurs at a grid resource, the job is rescheduled on another resource which eventually results in failing to satisfy the user’s QoS requirement i.e. deadline. The reason is simple. As the job is re-executed, it consumes more time.

2. In the computational-based grid environments, there are resources that fulfill the criterion of deadline constraint, but they have a tendency toward adults. In such a scenario, the grid scheduler goes ahead to select the same resource for the mere reason that the grid resource promises to meet user requirements of the grid jobs. This eventually results in compromising the user's QoS parameters in order to complete the job.

3. If a task running should be finished on its deadline even though there is a fault in the system. Deadline in a real-time system is the major issue because there is no meaning of such a task which is not finishing before its deadline.

4. In real time distributed system availability of end to end services and the ability to experience failures or systematic attacks, without impacting customers or operations.

5. It is about the ability to handle a growing amount of work, and the capability of a system to increase total throughput under an increased load when resources are added.

Adaptive check-pointing fault tolerance approach is used in this scenario to overcome above-mentioned drawbacks. In this approach, fault occurrence information is maintained for every resource. When a fault occurs, the fault occurrence information of that resource is updated. This fault occurrence information is used during decisionmaking of allocating the resources to the job. The checkpointing is one of the most popular technique to provide fault-tolerance on unreliable systems. It is a record of the snapshot of the entire system state in order to restart the application after the occurrence of some failure. The checkpoint can be stored on temporary as well as stable storage. However, the efficiency of the mechanism is strongly dependent on the length of the checkpointing interval. Frequent checkpointing may enhance the overhead, while lazy checkpointing may lead to loss of significant computation. Hence, the decision about the size of the checkpointing interval and the checkpointing technique is a complicated task and should be based upon the knowledge about the application as well as the system.

Checkpoint-recovery depends on the system’s MTTR. It periodically saves the state of the application on stable storage, usually a hard disk. After a crash, the application is restarted from the last checkpoint rather than from the beginning. There are three check painting strategies. They are coordinated checkpointing, uncoordinated checkpointing,and communication-induced checkpointing.1. In coordinated checkpointing, processes synchronize checkpoints to ensure their saved states are consistent with each other, so that the overall combined, saved state is also consistent. In contrast, 2. in uncoordinated chick pointing, processes schedule checkpoints independently at different times and do not account for messages.3. Communication-induced checkpointing attempts to coordinate only selected critical checkpoints.

Comparative analysis of existing techniques:

A grid resource is a member of a grid and it offers computing services to grid users. Grid users register themselves to the Grid Information Server (GIS) of a grid by specifying QoS requirements such as the deadline to complete the execution, the number of processors, type of operating system and so on.

The components used in the architecture are described below:

Scheduler- Scheduler is an important entity of a grid. The scheduler receives jobs from grid users. It selects feasible resources for those jobs according to acquired information from GIS. Then it generates job-to-resource mappings. When the schedule manager receives a grid job from a user, it gets the details of available grid resources from GIS. It then passes the available resource list to the entities in MTTR scheduling strategy. The Matchmakerentity performs matchmaking of the resources and job requirements. ResponseTime Estimator entity estimates the response time for the job on each matched resource based on Transfer time, Queue Wait time and Service time of the job. Resource selector selects the resource with minimum response time. A job dispatcher dispatches the jobs one by one to the checkpoint manager.

GIS- GIS contains information about all available grid resources. It maintains details of the resource such as processor speed, memory available, load and so on. All grid resources that join and leave the grid are monitored by GIS. Whenever a scheduler has jobs to execute, it consults GIS to get information about available grid resources.

Checkpoint Manager-It receives the scheduled job from the scheduler and sets checkpoint based on the failure rate of the resource on which it is scheduled. Then it submits the job to the resource. Checkpoint manager receives a job completion message or job failure message from the grid resource and responds to that accordingly. During execution, if job failure occurs, the job is rescheduled from the last checkpoint instead of running from the scratch. Checkpoint manager implements a checkpoint better algorithm to set job checkpoints.

Checkpoint Server-On each checkpoint set by the checkpoint manager, the job status is reported to the checkpoint server. Checkpoint server saves the job status and returns it on demand i.e., during job/resource failure. For a particular job, the checkpoint server discards the result of the previous checkpoint when a new value of checkpoint result is received.

Fault Index Manager- Fault Index Manager maintains the fault index value of each resource which indicates the failure rate of the resource. The fault index of a grid resource is incremented everytime the resource does not complete the assigned job within the deadline and also on resource failure. The fault index of a resource is decremented whenever the resource completes the assigned job within the deadline. Fault index manager updates the fault index of a grid resource using fault index update algorithm.

Checkpoint Replication Server- When a new checkpoint is created, Checkpoint Replication Server initiates CRS which will replicate the created checkpoints into remote resources by applying RRSA. Once replicated, details are stored in Checkpoint Server. To obtain information about all checkpoint files, Replication Server queries the Checkpoint Server. During the entire application runtime, CRS monitors the Checkpoint Server to detect newer checkpoint versions. Information about available resources, hardware, memory and bandwidth details are obtained from GIS. NWS and Ganglia tool is used to determine these details. The required details are periodically propagated by these tools to the GIS. Depending on transfer sizes, available storage of the resources and current bandwidths, CRS selects a suitable resource using RRSA to replicate the checkpoint file.

Results and discussion:

Throughput- Throughput is one of the most important standard metrics used to measure the performance of fault-tolerant systems. Throughput is defined as:

Throughput(n)=n/Tn where n is the total number of jobs submitted and Tn is the total amount of time necessary to complete n jobs. Throughput is used to measure the ability of the grid to accommodate jobs. Generally, the throughput of the two systems decreases with the increase in the percentage of faults injected in the grid. This is due that the extra delay encountered by both of them to complete jobs in case of some resources failure.

Failure tendency- It is the percentage of the tendency of the selected grid resources to fail and is defined as:

Fail tendency=*100%Where m is the total number of grid resources and Pfj is the failure rate of resource j. Through this metric, the faulty behavior of the system can be expected. Conclusion:

Get a custom paper now from our expert writers.

In all distributed environments fault tolerance is an important problem. Thus the proposed work achieves fault tolerance by dynamically adapting the checkpoint frequency, based on the history of information of failure and job execution time, which reduces checkpoint overhead and also, increases the throughput. Hence, following have been proposed new fault detection methods, client transparent fault-tolerant architecture, on-demand fault tolerant techniques, economic fault tolerant model, optimal failure prediction system, multiple faults tolerant model and self-adaptive fault tolerance framework to make the grid environment is more dependable and trustworthy.

Image of Dr. Oliver Johnson
This essay was reviewed by
Dr. Oliver Johnson

Cite this Essay

Fault Tolerance. (2018, November 06). GradesFixer. Retrieved May 20, 2024, from
“Fault Tolerance.” GradesFixer, 06 Nov. 2018,
Fault Tolerance. [online]. Available at: <> [Accessed 20 May 2024].
Fault Tolerance [Internet]. GradesFixer. 2018 Nov 06 [cited 2024 May 20]. Available from:
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled


Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.


    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts


    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.



    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!


    Get Your
    Personalized Essay in 3 Hours or Less!

    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now