Home — Essay Samples — Information Science and Technology — Cloud Computing — Enhancing Service Availability in Cloud Systems

Enhancing Service Availability in Cloud Systems

Categories: Adaptability Cloud Computing

Human-Written

About this sample

Human-Written

Words: 2815 |

Pages: 6|

15 min read

Published: Feb 13, 2024

Words: 2815|Pages: 6|15 min read

Published: Feb 13, 2024

A Survey on Improving Service Availability of Cloud Systems
Review
Personal Insights
A Survey on Drizzle: Fast and Adaptable
Description
Review
Personal Insights
A Survey on Soteria: Automated IoT
Description
Review
Personal Insights
Conclusion
References

A Survey on Improving Service Availability of Cloud Systems

Cloud computing is a shared networking practice that utilizes a network of internet hosted remote servers. As a service, it has grown in popularity because of its reduced complexity and cost, since there is no need to purchase IT infrastructure, hardware, or licensing required to run a physical computer network [9]. Instead, these cloud systems use many physical disk drive units as the primary storage component of the cloud system.

Many tech companies such as Amazon, Google, and Microsoft have incorporated cloud computing into the foundations of their applications’ online services. These applications are used by millions of different people at all periods of time. Because of this, service availability must be of the utmost importance. However, despite highly targeted service availability expectations, these services are still prone to failure, resulting in customer dissatisfaction and revenue loss [9]. These failures may come as a result of a variety of hardware issues, but the most prominent one is, in fact, disk failure. Large scale cloud systems typically use several hundred million disk drives, 20% to 57% of which have experienced at least one error over the past 4 to 6 years [9]. This percentage validates the significance of these errors and the importance of predicting disk failures in order to improve service availability in cloud services. To achieve this, there have been many proposed solutions that utilize historical disk level sensor data (SMART data) to predict disk failures and take preemptive actions, such as faulty disk replacements.

The different proposed approaches similarly focus on complete disk failure prediction [9]. Unfortunately, before failure prone disks are proactively replaced, a number of disk errors can occur, which can negatively affect upper layer services. These errors, referred to as “gray failures” are typically unnoticed errors that degrade the quality of cloud software.

This paper introduces CDEF (Cloud Disk Error Forecasting), an innovative approach to proactive disk failure prediction that utilizes both SMART data and error reflected system level signals in order to better detect these gray failures [9]. This approach was evaluated utilizing data from Microsoft production cloud systems and proven to be an improvement over baseline methods with a reduction of sixty-three thousand minutes of Microsoft Azure virtual machine downtime per month.

Review

The authors were faced with two major challenges when designing the CDEF prediction model for a large-scale cloud computing service. Many disk units are required to operate an industry cloud system such as Microsoft Azure, which results in the first challenge: only three in ten thousand disks can potentially become faulty at any given day [9]. With such a low percentage, it would be easy for any prediction model to simply classify all disks as healthy, as this would result in the lowest chance of error. Other approaches have utilized re-balancing techniques to address this and create better results, but created false positives as a result, which ultimately lowers the accuracy of a prediction model [9]. Another challenge presented comes from utilizing historical data to make predictions. Some information that is utilized (particularly system level signals) are both time sensitive and environment sensitive, meaning the data for a specific disk constantly changes during its lifetime within a cloud environment. When using test datasets, prediction models can be accurate, but in practice for future dataset predictions it is much less so [9].

The authors achieve the goal presented in this paper by overcoming these difficulties with the introduction of two novel features: an error proneness ranking system designed for disk drives and a selection tool which determines which features of the SMART dataset can provide the most distinction between a healthy or error prone disk [9].

With a feature identification system, CDEF is capable of filtering through a multitude of different SMART and system-level disk datasets and identifying which ones are the most optimal in determining healthy and faulty disks [9]. Providing a filtered set of historical data containing data that is relevant to accurate disk error detection allows prediction models to focus on important features of a disk drive unit to ensure gray failures do not go undetected.

Rather than opting for the simple approach done by existing systems and classifying a disk as faulty or not, CDEF instead ranks disks based on their potential for error [9]. The previously mentioned issue regarding imbalanced datasets is greatly mitigated because new perspective of prediction does not focus on data imbalance. Because most disks are classified as healthy, this approach more effectively scrutinizes every disk to ensure the healthy disks are also most optimal.

The true novelty of this work is shown in these solutions and their ability to build off one another. Not only can these solutions stand alone as an improvement over other approaches, but the combined accuracy of the feature selection method and ranking model result in more efficient and cost-effective results compared to existing methods. Despite cross-validation approaches used in other methods presenting better results than the CDEF approach, the CDEF approach results better reflect outcomes in actual testing of the prediction model. This is because cross validation does not take into account the time sensitivity of disk data.

Furthermore, the CDEF approach has already been applied to the Microsoft Azure cloud service [9] and has been shown to be effective in selecting healthy disks for the service. Considering that there are many issues affecting the serviceability of cloud systems, the work done by the authors is significant in highlighting the existing issues as well as implementing a solution to one of the more severe issues.

Personal Insights

The authors of CDEF approach do, in fact, accomplish the goal addressed in the beginning of the paper, which is to develop an online prediction software that is capable of distinguishing between healthy and faulty disk drive units in a cloud system in order to improve serviceability. In order to create this software, some methods which adopt machine learning techniques had to be adopted, such as the FastTree algorithm [5] used in the CDEF ranking feature. The algorithm was particularly interesting because it is available in Microsoft’s python library and, considering that this prediction model was tested using a dataset provided by Microsoft Azure systems [6]. This presents some problems for a cloud system such as Apples own iCloud, which is not capable of adopting the libraries owned by Microsoft as iCloud runs on the Apple developed Swift programming language for most of its services [2]. This may be problematic for Apple’s cloud services because it’s service availability may fall behind Microsoft Azure, Google Cloud, and Amazon AWS if the CDEF approach becomes more popular. The authors mention in the conclusion that there are many ways to extend this work, maybe something to consider in the future would be to try and implement this approach to Apple’s cloud computing service.

A Survey on Drizzle: Fast and Adaptable

Stream Processing at Scale

Alexander Monaco

[email protected]

Florida International University (FIU), Miami, Florida

Description

Stream processing is a type of “Big Data” technology that is used to process data as it “streams” in both production on the sending side and manifested in the receiving side. This type of action is used with data regarding stock market trading, traffic monitoring, smart devices, or any type of information that is needed to be detected and queried in a short amount of time. Due to the fact that data travels incredibly fast and in variable quantities, stream processing systems must be capable of adapting to these changes while maintaining a high standard of performance requirements. In addition to being capable of adapting to these changes, stream processing systems must be capable of maintaining high throughput (task performance) and low latency (amount of time data travels between nodes) at the same time [7]. Existing approaches mainly see the aforementioned problems as mutually exclusive solutions, resulting in systems that are high in adaptability and high in latency or systems with low latency during normal operations, but expensive adaptability.

The paper introduces Drizzle [7], a stream processing system developed with an understanding that both previously mentioned solutions have features that can be combined in order to improve adaptability and lower latency in tandem.

Review

The authors use their paper to not only introduce Drizzle, but also to present two main approaches in existing solutions: continuous operator streaming (e.g. Naiad and Apache Flink) and bulk-synchronous processing (e.g. Spark Streaming and FlumeJava) [7]. Their strengths, weaknesses, and what features are implemented to create a novel approach in stream processing that is both fast and adaptable are shared in the paper.

The first analyzed approach, bulk-synchronous processing, is a popular processing framework in which a barrier is used to allow parallel nodes in a system perform a local computation. In stream processing, this method is modified to some degree to create a subgroup of processes and set amount of processing time in seconds. Similar to the baseline bulk-synchronous method, these processes in the subgroup collect data, analyze it, and then finish at a barrier that outputs the data of all the subgroup processes. This type of approach is beneficial because the barriers allow the streaming system to take “snapshots”, or record physical or logical information, of each process, which results in high adaptability and fault tolerance [7]. However, while it is capable of being adaptable and safe, the time allotted to each of these processes cannot be low enough to create low latency and doing so would result in processes spending more time communicating results with the driver rather than actually processing.

The second approach, continuous operator streaming, removes scheduling and driver communication and implements a barrier only when needed. As data enters the system, its operators are stored and processed as a long running task. Unlike bulk-synchronous processing, continuous operator streaming uses checkpoints rather than barrier snapshots to recover from failures [7]. Overall, the method of this approach is flexibility and speed over safety and cost efficiency. Should a node in this system fail, then all nodes must restart at a checkpoint and be replayed.

My fascination in Drizzle is in the novelty of it, and how the features that make both approaches effective are combined. The bulk-synchronous processing method is used for task scheduling and fault tolerance while high throughput and low latency are achieved from continuous operator methods.

Personal Insights

Out of the two combined approaches utilized in Drizzle, the one that required the most improvement upon implementation was bulk-synchronous processing. Bulk-synchronous processing utilizes barriers to simplify fault tolerance and increase adaptability. However, when attempting to lower latency in a system, many barriers take away from processing time to communicate with a centralized driver, causing an overhead situation. Therefore, creative decisions were made against barriers in Drizzle [7]. Another work, titled “Breaking the mapreduce stage barrier”, also touches upon how barriers reduce performance and introduces techniques and algorithms that operate without a barrier to maximize performance [8]. The authors plan on exploring additional techniques to improve performance for Drizzle. Perhaps a good start to this would be finding a way to implement barrierless functionality while also maintaining Drizzle’s level of adaptability and fault tolerance.

A Survey on Soteria: Automated IoT

Safety and Security Analysis

Alexander Monaco

[email protected]

Florida International University (FIU), Miami, Florida

Description

The Internet of Things is a concept that has become more important to individuals as the technologies categorized under it become more advanced. IoT broadly relates to any technology connected to each other digitally, such as smartphones, computers, smart cars, smart tv’s, and so on. Unfortunately, the added convenience of connected devices brings about many security concerns despite many of these IoT technologies being highly advanced since the inception of IoT. There have been guidelines set by many tech companies which describe how to regulate security within devices [3], but there are not many tools and algorithms which evaluate IoT safety and security.

This paper introduces Soteria a statistic analysis system for IoT application and environment security validation [3]. To begin validating an IoT application or environment, the source code must be translated into an intermediate representation (IR), which is a data structure or code used by a compiler to represent source code. With an IR, Soteria then creates a model of the applications lifecycle, entry points, event handling methods, and call graphs. Furthermore, the IR is then used to extract a state model of the application which contains the states and transitions within the IoT application. Soteria then uses model checking to determine whether or not the application conforms to its own properties.

To determine Soterias ability to handle real life use cases, the framework was tasked with testing 35 official market applications, 30 community made applications, and 17 synthetic applications built to have varying levels of violations [3]. The results show that Soteria managed to successfully identify violations in all of the market applications that were either running independently or in communion with each other. However, the test results with synthetic applications show that despite being able to identify all of the planted violations, Soteria was responsible for creating a false report of a violation for one application. Furthermore, Soteria did not detect violations for applications which leak sensitive data or implement dynamic device permission because it was not within Soterias testing scope.

Review

The Soteria framework is a technology that should have been developed during the inception of IoT devices considering how important it is nowadays to validate the security of one’s network devices. Unlike the other previously surveyed papers, which face challenges by adopting approaches by other works [9, 7], Soteria’s challenges arose from its novelty. Although there are indeed many works which discuss IoT applications, their properties, and guidelines for security, there is an absence of security validation frameworks in IoT. This main issue is what makes every implementation in Soteria much more significant.

Personal Insights

Although Soteria is novel in scope and demonstration, it does have its limitations [3]. For example: Soteria’s implementation and evaluation is based on Samsung’s SmartThings IoT technology child company’s programming platform which utilizes the Groovy programming language. Thankfully, the Groovy programming language is built on the java platform [1], which can make it easier to implement Soteria to Amazon’s IoT. Considering that Amazon released its Alexa Blueprints portal for developers and tech hobbyists alike [4], there is potential for someone to unknowingly create an unsafe IoT application that could potentially leak private information – something out of Soteria’s scope.

The closing sentences from the authors of Soteria state that they will “extend the kinds of analysis and provide tools to evaluate implementations and study the complex interactions between users and IoT environment devices.” [3] Due to the fact that IoT is becoming more commonplace that companies are providing customers with the means to develop their own IoT network device applications, a framework like Soteria is an important step forward in maintaining security in IoT.

Conclusion

In conclusion, the research presented in this paper highlights the critical importance of improving service availability in cloud systems, particularly by predicting disk errors. With the prevalence of cloud computing in various industries and the reliance on services provided by major tech companies, ensuring high service availability is paramount to customer satisfaction and business success.

The paper introduces CDEF, an innovative approach to proactive disk failure prediction that utilizes both SMART data and error-reflected system-level signals to detect gray failures. By addressing the challenges of imbalanced datasets and the time sensitivity of disk data, CDEF provides a more effective solution compared to existing methods. The evaluation of CDEF using data from Microsoft production cloud systems demonstrates its effectiveness in reducing downtime and improving service availability.

Furthermore, the paper emphasizes the need for continuous improvement in cloud system reliability and performance. While CDEF represents a significant advancement in predicting disk errors, there are still opportunities for further optimization and refinement. The authors' personal insights suggest exploring additional techniques to enhance performance, such as implementing barrierless functionality while maintaining adaptability and fault tolerance.

Overall, the research contributes valuable insights into enhancing service availability in cloud systems and underscores the importance of proactive measures in addressing potential failures. As cloud computing continues to evolve and expand, innovative approaches like CDEF will play a crucial role in ensuring reliability, scalability, and efficiency in cloud services.

References

“A Multi-Faceted Language for the Java Platform.” The Apache Groovy Programming Language, groovy-lang.org/
Apple Inc. “Swift.” Apple Developer, developer.apple.com/swift/
Celik, Z. Berkay, Patrick McDaniel, and Gang Tan. 'Soteria: Automated iot safety and security analysis.' In 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18)
Crum, Bryan. “Blogs.” Amazon, Greenhaven Press/Gale, 13 Feb. 2019, developer.amazon.com/blogs/alexa/post/9c7792fd-271d-4eac-a850-6257704142e4/now-anyone-can-use-alexa-skill-blueprints-to-create-and-publish-an-alexa-skill-in-minutes-with-no-coding-required-and-new-blueprints-for-content-creators-bloggers-and-organizations.
MICROSOFT. Machine learning fast tree, https://docs.microsoft.com/en-us/machine-learning-server/python-reference/microsoftml/rx-fast-trees, 2017.
SmartThings Documentation http://docs.smartthings. com. [Online; accessed 08-October-2019].
Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M. J., ... & Stoica, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles (pp. 374 - 389). AC
VERMA, A., CHO, B., ZEA, N., GUPTA, I., AND CAMPBELL, R. H. Breaking the mapreduce stage barrier. Cluster computing 16, 1 (2013), 191–206.
Xu, Y., Sui, K., Yao, R., Zhang, H., Lin, Q., Dang, Y., ... & Chintalapati, M. (2018). Improving service availability of cloud systems by predicting disk error. In 2018 {USENIX} Annual Technical Conference ({USENIX})

Virtual Communities Using Cloud Technology

Revolutionizing Cloud Computing

This essay was reviewed by

Alex Wood

More about our Team

Cite this Essay

Enhancing Service Availability in Cloud Systems. (2024, February 13). GradesFixer. Retrieved July 26, 2025, from https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/

“Enhancing Service Availability in Cloud Systems.” GradesFixer, 13 Feb. 2024, gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/

Enhancing Service Availability in Cloud Systems. [online]. Available at: <https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/> [Accessed 26 Jul. 2025].

Enhancing Service Availability in Cloud Systems [Internet]. GradesFixer. 2024 Feb 13 [cited 2025 Jul 26]. Available from: https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/

copy

Keep in mind: This sample was shared by another student.

450+ experts on 30 subjects ready to help
Custom essay delivered in as few as 3 hours

Get high-quality help

Dr Jacklynne

Verified writer

Expert in: Sociology Information Science and Technology

(204 reviews)

“ She followed all my directions. It was really easy to contact her and respond very fast as well. ”

+120 experts online

Hire writer

Learn the cost and time for your paper

Paper Topic

Deadline: in 10 days

Number of pages

Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"

Get an estimate

No need to pay just yet!

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

Get custom essay

121 writers online

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

Enhancing Service Availability in Cloud Systems

Table of contents

A Survey on Improving Service Availability of Cloud Systems

Review

Personal Insights

A Survey on Drizzle: Fast and Adaptable

Description

Review

Personal Insights

A Survey on Soteria: Automated IoT

Description

Review

Personal Insights

Conclusion

References

Cite this Essay

Still can’t find what you need?

Get Your
Personalized Essay in 3 Hours or Less!

Enhancing Service Availability in Cloud Systems

Table of contents

A Survey on Improving Service Availability of Cloud Systems

Review

Personal Insights

A Survey on Drizzle: Fast and Adaptable

Description

Review

Personal Insights

A Survey on Soteria: Automated IoT

Description

Review

Personal Insights

Conclusion

References

Cite this Essay

Related Essays

Still can’t find what you need?

Related Essays on Cloud Computing

Related Topics

Get Your Personalized Essay in 3 Hours or Less!

Get Your
Personalized Essay in 3 Hours or Less!