close
test_template

Enhancing Service Availability in Cloud Systems

Human-Written
download print

About this sample

About this sample

close
Human-Written

Words: 2815 |

Pages: 6|

15 min read

Published: Feb 13, 2024

Words: 2815|Pages: 6|15 min read

Published: Feb 13, 2024

Table of contents

  1. A Survey on Improving Service Availability of Cloud Systems
  2. Review
    Personal Insights
  3. A Survey on Drizzle: Fast and Adaptable
  4. Description
    Review
    Personal Insights
  5. A Survey on Soteria: Automated IoT
  6. Description
    Review
    Personal Insights
  7. Conclusion
  8. References

A Survey on Improving Service Availability of Cloud Systems

Cloud computing is a shared networking practice that utilizes a network of internet hosted remote servers. As a service, it has grown in popularity because of its reduced complexity and cost, since there is no need to purchase IT infrastructure, hardware, or licensing required to run a physical computer network [9]. Instead, these cloud systems use many physical disk drive units as the primary storage component of the cloud system.

'Why Violent Video Games Shouldn't Be Banned'?

Many tech companies such as Amazon, Google, and Microsoft have incorporated cloud computing into the foundations of their applications’ online services. These applications are used by millions of different people at all periods of time. Because of this, service availability must be of the utmost importance. However, despite highly targeted service availability expectations, these services are still prone to failure, resulting in customer dissatisfaction and revenue loss [9]. These failures may come as a result of a variety of hardware issues, but the most prominent one is, in fact, disk failure. Large scale cloud systems typically use several hundred million disk drives, 20% to 57% of which have experienced at least one error over the past 4 to 6 years [9]. This percentage validates the significance of these errors and the importance of predicting disk failures in order to improve service availability in cloud services. To achieve this, there have been many proposed solutions that utilize historical disk level sensor data (SMART data) to predict disk failures and take preemptive actions, such as faulty disk replacements.

The different proposed approaches similarly focus on complete disk failure prediction [9]. Unfortunately, before failure prone disks are proactively replaced, a number of disk errors can occur, which can negatively affect upper layer services. These errors, referred to as “gray failures” are typically unnoticed errors that degrade the quality of cloud software.

This paper introduces CDEF (Cloud Disk Error Forecasting), an innovative approach to proactive disk failure prediction that utilizes both SMART data and error reflected system level signals in order to better detect these gray failures [9]. This approach was evaluated utilizing data from Microsoft production cloud systems and proven to be an improvement over baseline methods with a reduction of sixty-three thousand minutes of Microsoft Azure virtual machine downtime per month.

Review

The authors were faced with two major challenges when designing the CDEF prediction model for a large-scale cloud computing service. Many disk units are required to operate an industry cloud system such as Microsoft Azure, which results in the first challenge: only three in ten thousand disks can potentially become faulty at any given day [9]. With such a low percentage, it would be easy for any prediction model to simply classify all disks as healthy, as this would result in the lowest chance of error. Other approaches have utilized re-balancing techniques to address this and create better results, but created false positives as a result, which ultimately lowers the accuracy of a prediction model [9]. Another challenge presented comes from utilizing historical data to make predictions. Some information that is utilized (particularly system level signals) are both time sensitive and environment sensitive, meaning the data for a specific disk constantly changes during its lifetime within a cloud environment. When using test datasets, prediction models can be accurate, but in practice for future dataset predictions it is much less so [9].

The authors achieve the goal presented in this paper by overcoming these difficulties with the introduction of two novel features: an error proneness ranking system designed for disk drives and a selection tool which determines which features of the SMART dataset can provide the most distinction between a healthy or error prone disk [9].

With a feature identification system, CDEF is capable of filtering through a multitude of different SMART and system-level disk datasets and identifying which ones are the most optimal in determining healthy and faulty disks [9]. Providing a filtered set of historical data containing data that is relevant to accurate disk error detection allows prediction models to focus on important features of a disk drive unit to ensure gray failures do not go undetected.

Rather than opting for the simple approach done by existing systems and classifying a disk as faulty or not, CDEF instead ranks disks based on their potential for error [9]. The previously mentioned issue regarding imbalanced datasets is greatly mitigated because new perspective of prediction does not focus on data imbalance. Because most disks are classified as healthy, this approach more effectively scrutinizes every disk to ensure the healthy disks are also most optimal.

The true novelty of this work is shown in these solutions and their ability to build off one another. Not only can these solutions stand alone as an improvement over other approaches, but the combined accuracy of the feature selection method and ranking model result in more efficient and cost-effective results compared to existing methods. Despite cross-validation approaches used in other methods presenting better results than the CDEF approach, the CDEF approach results better reflect outcomes in actual testing of the prediction model. This is because cross validation does not take into account the time sensitivity of disk data.

Furthermore, the CDEF approach has already been applied to the Microsoft Azure cloud service [9] and has been shown to be effective in selecting healthy disks for the service. Considering that there are many issues affecting the serviceability of cloud systems, the work done by the authors is significant in highlighting the existing issues as well as implementing a solution to one of the more severe issues.

Personal Insights

The authors of CDEF approach do, in fact, accomplish the goal addressed in the beginning of the paper, which is to develop an online prediction software that is capable of distinguishing between healthy and faulty disk drive units in a cloud system in order to improve serviceability. In order to create this software, some methods which adopt machine learning techniques had to be adopted, such as the FastTree algorithm [5] used in the CDEF ranking feature. The algorithm was particularly interesting because it is available in Microsoft’s python library and, considering that this prediction model was tested using a dataset provided by Microsoft Azure systems [6]. This presents some problems for a cloud system such as Apples own iCloud, which is not capable of adopting the libraries owned by Microsoft as iCloud runs on the Apple developed Swift programming language for most of its services [2]. This may be problematic for Apple’s cloud services because it’s service availability may fall behind Microsoft Azure, Google Cloud, and Amazon AWS if the CDEF approach becomes more popular. The authors mention in the conclusion that there are many ways to extend this work, maybe something to consider in the future would be to try and implement this approach to Apple’s cloud computing service.

A Survey on Drizzle: Fast and Adaptable

Stream Processing at Scale

Alexander Monaco

[email protected]

Florida International University (FIU), Miami, Florida

Description

Stream processing is a type of “Big Data” technology that is used to process data as it “streams” in both production on the sending side and manifested in the receiving side. This type of action is used with data regarding stock market trading, traffic monitoring, smart devices, or any type of information that is needed to be detected and queried in a short amount of time. Due to the fact that data travels incredibly fast and in variable quantities, stream processing systems must be capable of adapting to these changes while maintaining a high standard of performance requirements. In addition to being capable of adapting to these changes, stream processing systems must be capable of maintaining high throughput (task performance) and low latency (amount of time data travels between nodes) at the same time [7]. Existing approaches mainly see the aforementioned problems as mutually exclusive solutions, resulting in systems that are high in adaptability and high in latency or systems with low latency during normal operations, but expensive adaptability.

The paper introduces Drizzle [7], a stream processing system developed with an understanding that both previously mentioned solutions have features that can be combined in order to improve adaptability and lower latency in tandem.

Review

The authors use their paper to not only introduce Drizzle, but also to present two main approaches in existing solutions: continuous operator streaming (e.g. Naiad and Apache Flink) and bulk-synchronous processing (e.g. Spark Streaming and FlumeJava) [7]. Their strengths, weaknesses, and what features are implemented to create a novel approach in stream processing that is both fast and adaptable are shared in the paper.

The first analyzed approach, bulk-synchronous processing, is a popular processing framework in which a barrier is used to allow parallel nodes in a system perform a local computation. In stream processing, this method is modified to some degree to create a subgroup of processes and set amount of processing time in seconds. Similar to the baseline bulk-synchronous method, these processes in the subgroup collect data, analyze it, and then finish at a barrier that outputs the data of all the subgroup processes. This type of approach is beneficial because the barriers allow the streaming system to take “snapshots”, or record physical or logical information, of each process, which results in high adaptability and fault tolerance [7]. However, while it is capable of being adaptable and safe, the time allotted to each of these processes cannot be low enough to create low latency and doing so would result in processes spending more time communicating results with the driver rather than actually processing.

The second approach, continuous operator streaming, removes scheduling and driver communication and implements a barrier only when needed. As data enters the system, its operators are stored and processed as a long running task. Unlike bulk-synchronous processing, continuous operator streaming uses checkpoints rather than barrier snapshots to recover from failures [7]. Overall, the method of this approach is flexibility and speed over safety and cost efficiency. Should a node in this system fail, then all nodes must restart at a checkpoint and be replayed.

My fascination in Drizzle is in the novelty of it, and how the features that make both approaches effective are combined. The bulk-synchronous processing method is used for task scheduling and fault tolerance while high throughput and low latency are achieved from continuous operator methods.

Personal Insights

Out of the two combined approaches utilized in Drizzle, the one that required the most improvement upon implementation was bulk-synchronous processing. Bulk-synchronous processing utilizes barriers to simplify fault tolerance and increase adaptability. However, when attempting to lower latency in a system, many barriers take away from processing time to communicate with a centralized driver, causing an overhead situation. Therefore, creative decisions were made against barriers in Drizzle [7]. Another work, titled “Breaking the mapreduce stage barrier”, also touches upon how barriers reduce performance and introduces techniques and algorithms that operate without a barrier to maximize performance [8]. The authors plan on exploring additional techniques to improve performance for Drizzle. Perhaps a good start to this would be finding a way to implement barrierless functionality while also maintaining Drizzle’s level of adaptability and fault tolerance.

A Survey on Soteria: Automated IoT

Safety and Security Analysis

Alexander Monaco

[email protected]

Florida International University (FIU), Miami, Florida

Description

The Internet of Things is a concept that has become more important to individuals as the technologies categorized under it become more advanced. IoT broadly relates to any technology connected to each other digitally, such as smartphones, computers, smart cars, smart tv’s, and so on. Unfortunately, the added convenience of connected devices brings about many security concerns despite many of these IoT technologies being highly advanced since the inception of IoT. There have been guidelines set by many tech companies which describe how to regulate security within devices [3], but there are not many tools and algorithms which evaluate IoT safety and security.

This paper introduces Soteria a statistic analysis system for IoT application and environment security validation [3]. To begin validating an IoT application or environment, the source code must be translated into an intermediate representation (IR), which is a data structure or code used by a compiler to represent source code. With an IR, Soteria then creates a model of the applications lifecycle, entry points, event handling methods, and call graphs. Furthermore, the IR is then used to extract a state model of the application which contains the states and transitions within the IoT application. Soteria then uses model checking to determine whether or not the application conforms to its own properties.

To determine Soterias ability to handle real life use cases, the framework was tasked with testing 35 official market applications, 30 community made applications, and 17 synthetic applications built to have varying levels of violations [3]. The results show that Soteria managed to successfully identify violations in all of the market applications that were either running independently or in communion with each other. However, the test results with synthetic applications show that despite being able to identify all of the planted violations, Soteria was responsible for creating a false report of a violation for one application. Furthermore, Soteria did not detect violations for applications which leak sensitive data or implement dynamic device permission because it was not within Soterias testing scope.

Review

The Soteria framework is a technology that should have been developed during the inception of IoT devices considering how important it is nowadays to validate the security of one’s network devices. Unlike the other previously surveyed papers, which face challenges by adopting approaches by other works [9, 7], Soteria’s challenges arose from its novelty. Although there are indeed many works which discuss IoT applications, their properties, and guidelines for security, there is an absence of security validation frameworks in IoT. This main issue is what makes every implementation in Soteria much more significant.

Personal Insights

Although Soteria is novel in scope and demonstration, it does have its limitations [3]. For example: Soteria’s implementation and evaluation is based on Samsung’s SmartThings IoT technology child company’s programming platform which utilizes the Groovy programming language. Thankfully, the Groovy programming language is built on the java platform [1], which can make it easier to implement Soteria to Amazon’s IoT. Considering that Amazon released its Alexa Blueprints portal for developers and tech hobbyists alike [4], there is potential for someone to unknowingly create an unsafe IoT application that could potentially leak private information – something out of Soteria’s scope.

The closing sentences from the authors of Soteria state that they will “extend the kinds of analysis and provide tools to evaluate implementations and study the complex interactions between users and IoT environment devices.” [3] Due to the fact that IoT is becoming more commonplace that companies are providing customers with the means to develop their own IoT network device applications, a framework like Soteria is an important step forward in maintaining security in IoT.

Conclusion

In conclusion, the research presented in this paper highlights the critical importance of improving service availability in cloud systems, particularly by predicting disk errors. With the prevalence of cloud computing in various industries and the reliance on services provided by major tech companies, ensuring high service availability is paramount to customer satisfaction and business success.

The paper introduces CDEF, an innovative approach to proactive disk failure prediction that utilizes both SMART data and error-reflected system-level signals to detect gray failures. By addressing the challenges of imbalanced datasets and the time sensitivity of disk data, CDEF provides a more effective solution compared to existing methods. The evaluation of CDEF using data from Microsoft production cloud systems demonstrates its effectiveness in reducing downtime and improving service availability.

Furthermore, the paper emphasizes the need for continuous improvement in cloud system reliability and performance. While CDEF represents a significant advancement in predicting disk errors, there are still opportunities for further optimization and refinement. The authors' personal insights suggest exploring additional techniques to enhance performance, such as implementing barrierless functionality while maintaining adaptability and fault tolerance.

Get a custom paper now from our expert writers.

Overall, the research contributes valuable insights into enhancing service availability in cloud systems and underscores the importance of proactive measures in addressing potential failures. As cloud computing continues to evolve and expand, innovative approaches like CDEF will play a crucial role in ensuring reliability, scalability, and efficiency in cloud services.

References

  1. “A Multi-Faceted Language for the Java Platform.” The Apache Groovy Programming Language, groovy-lang.org/
  2. Apple Inc. “Swift.” Apple Developer, developer.apple.com/swift/
  3. Celik, Z. Berkay, Patrick McDaniel, and Gang Tan. 'Soteria: Automated iot safety and security analysis.' In 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18)
  4. Crum, Bryan. “Blogs.” Amazon, Greenhaven Press/Gale, 13 Feb. 2019, developer.amazon.com/blogs/alexa/post/9c7792fd-271d-4eac-a850-6257704142e4/now-anyone-can-use-alexa-skill-blueprints-to-create-and-publish-an-alexa-skill-in-minutes-with-no-coding-required-and-new-blueprints-for-content-creators-bloggers-and-organizations.
  5. MICROSOFT. Machine learning fast tree, https://docs.microsoft.com/en-us/machine-learning-server/python-reference/microsoftml/rx-fast-trees, 2017.
  6. SmartThings Documentation http://docs.smartthings. com. [Online; accessed 08-October-2019].
  7. Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M. J., ... & Stoica, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles (pp. 374 - 389). AC
  8. VERMA, A., CHO, B., ZEA, N., GUPTA, I., AND CAMPBELL, R. H. Breaking the mapreduce stage barrier. Cluster computing 16, 1 (2013), 191–206.
  9. Xu, Y., Sui, K., Yao, R., Zhang, H., Lin, Q., Dang, Y., ... & Chintalapati, M. (2018). Improving service availability of cloud systems by predicting disk error. In 2018 {USENIX} Annual Technical Conference ({USENIX})
Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

Enhancing Service Availability in Cloud Systems. (2024, February 13). GradesFixer. Retrieved September 20, 2024, from https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/
“Enhancing Service Availability in Cloud Systems.” GradesFixer, 13 Feb. 2024, gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/
Enhancing Service Availability in Cloud Systems. [online]. Available at: <https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/> [Accessed 20 Sept. 2024].
Enhancing Service Availability in Cloud Systems [Internet]. GradesFixer. 2024 Feb 13 [cited 2024 Sept 20]. Available from: https://gradesfixer.com/free-essay-examples/enhancing-service-availability-in-cloud-systems/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now