450+ experts on 30 subjects ready to help you just now
Starting from 3 hours delivery
Google is recognized as the world’s largest search engine company, with a large number of users around the world. It operates more than one million servers in data centers around the world, integrates global information, processes hundreds of millions of search requests every day, automatically ‘browses’ each web page, and scores them one by one. Users only need to input in the search home page Keywords, Google search engine will find out the relevant pages with the highest score from the pages it visited, and display them in less than a second so that everyone can access and get the information they want.
Google has been able to grow into a company with a dominant share of the Internet search market, thanks to the effectiveness of the ranking algorithms used at the bottom of its search engine. The underlying system for search has managed to handle more than 88 billion searches per month. During this time, the main search engine has never experienced an outage, and users can expect query results in about 0.2 seconds.
Google’s search engine is implemented in C or C++, which is efficient and can run on Solaris or Linux. In this section, we will give a high-level overview of how the whole system design. In Google, the Web crawling is done by severe distributed Crawlers. The function of the URL server is to send the list of URLs to Crawler, and then Crawler will send all the acquired web pages to the store server, and then Repository will compress the webpages and store them in the database. When the system starts to parse web pages, because each web page has an ID number (called docID) associated with it, the parsed URL will be assigned that number.
The indexer performs many functions that can read repositories, extract documents, and parse them. Every document is converted into the occurrences of a set of a word called hits. The hits are used to record words, their position in the text, estimate font size and capitalization. The indexer distributes these hits into a set of ‘barrels’, creating a partially sorted forward index. The indexer also has an important function, which parses all links in each web page, and stores important information about these links in the anchor’s file. File information can accurately locate the location of each link from and to and the text of the link.
URL resolver reads the anchors file and converts the relative URLs to absolute URLs, then to docID. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also creates database links for each pair of docIDs. The links database is used to calculate the PageRanks of all documents. The sorter takes the barrels, which sorted by docID and resorts them by wordID to generate the inverted index. This operation requires a little temporary space.
The sorter also generates a list of wordIDs and offsets it into the reverse index. The DuffSimulink function generates a new dictionary for the searcher along with the LeX icon generated by the indexer. The searcher is run by a web server and answers queries using dictionaries built by DopCopION, inverted indexes, and PageRanks.
From the perspective of distributed system, Google’s search engine is a fascinating case study, which can handle extremely demanding high demand, especially in scalability, reliability, availability and security.
Scalability refers to the effective and efficient operation of distributed systems in different scales (from the intranet of small enterprises to the Internet). If the number of resources and users surge, the system can still maintain its effectiveness. There are three challenges to achieve scalability. (1) Control the cost of physical resources When the demand for resources increases, we should spend reasonable cost to expand the system to meet the requirements. For example, if a search engine server cannot handle all the access requirements, it is necessary to increase the number of servers in order to avoid performance bottlenecks.
In this respect, Google considers scalability in three dimensions: (1) being able to process more data (x) (2) being able to process more queries (y) (3) seeking better results (z). From the data in the Introduction, Google’s search engine is undoubtedly very good in these aspects. However, in order to be scalable, other functions, including indexing, ranking, and searching, require highly distributed solutions. (2) Control the loss of performance When the distributed system deals with a large number of users or resources, it will produce a lot of data sets.
The management of these data sets has a great demand on the performance of the distributed system. In this case, the scalability of the hierarchical algorithm is obviously better than that of the linear algorithm, but the performance loss cannot be completely avoided. Because Google’s search engine requires high interaction with users, it is necessary to achieve low latency as much as possible. Therefore, the better the performance is, the better the network search operation can be completed within 0.2S. Only in this way can Google make more profits from the sale of advertisements.
The annual advertising revenue is as high as US $32 billion, which shows that Google is superior to other search engines in the performance processing of related underlying resources, including network, storage and computing resources. (3) Prevent the exhaustion of software resources The search engine uses 32 bits as the network address. If there are too many Internet addresses, the Internet address will be exhausted. For this, Google does not have a good solution at present, because if we use 128-bit Internet address, there is no doubt that many software components need to be modified.
The availability of distributed system mainly depends on the extent to which new resource sharing services can be added and used by multiple clients. Because Google’s search engine needs to handle the highest requirements in the shortest time in web crawling, indexing and sorting, availability is also a strong demand. To meet these needs, Google has developed a physical architecture. The middle layer defines a general distributed system infrastructure, which not only enables the development of new applications and services to reuse the underlying system services, but also provides integrity for Google’s huge code database.
There are many information resources with high value to users in distributed system, so it is very important to protect the security of these resources. The security of information resources includes three parts: confidentiality (to prevent disclosure to unauthorized individuals), integrity (to prevent change or damage), availability (to prevent interference with the means of accessing resources) When investigating the security of Google’s search engine, we found that Google has not been very successful in security, and even has publicly admitted to divulge user information to seek benefits, which also makes users use Google’s software, information security can not be guaranteed.
The implementation of Google file system is to meet the rapid growth of Google’s big data processing and management needs. In addition to this demand, GFS faces the challenge of managing distribution and the risk of increased hardware failure. Ensuring safety of data as well as being able to scale up to thousands of computers while managing multiple terabytes of data can thus be considered the key challenges faced by GFS. So Google made an important decision not to use any of the existing distributed file systems. Instead it decided to develop a new file system.
The biggest difference with other file systems is that it optimizes the use of large files (i.e. Gigabyte to multi-terabyte), resulting in the majority of files is considered immutable, and can be read many times with only one write. A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. These machines are common Linux process machines that can run user level server processes. As long as the user’s resources allow the block server and client to run on one machine at the same time.
The stored files are divided into fixed size blocks, each with a globally unique 64-bit chunk handle. Chunk servers store on local disks as Linux files，It can read and write at the same time. Chunk data which is assigned by chunk handle and data range. To improve GFS performance, every chunk needs to be replicated to at least three servers. Chunk master maintains the metadata of the whole GFS. In a certain period, the chunk master will ask every chunk server to upload state through HeartBeat messages. Data bearing communication, which does not need to be linked to the Linux Vnode layer, directly connects to the chunk server.
Neither the client nor the chunk server caches file data. This approach without storing data not only avoids the inability to cache because the working set is too large, but also makes the client and the whole system consistent. The buffer of Linux stores all the frequently accessed data in memory, so chunk servers do not need to cache file data, which greatly improves the performance and speed of GFS.
The setting and selection of communication protocols is very important for the overall design of a system. Google adopts a simple, minimal and efficient remote call protocol. Communication of the remote call protocol requires a serialization component to transform the procedure call data. So, Google developed a protocol buffer, which is a simplified, high-performance serialization component. Google also uses a separate protocol to publish and subscribe.
Protocol buffers focuses on data description and subsequent data serialization. It wants to provide a simple, efficient, extensible way to specify and serialize data independent of language and platform. The serialized data can be stored, transferred, or any scenario that needs to serialize the data format. There are three reasons why Google chose to use protocol buffers. The disadvantage of Google’s design is that it’s not as expressive as XML.
Because protocol buffers cannot fully meet Google’s requirements for communication, the designer also uses publish subscribe. It can ensure that distributed events can be sent to a large number of potential customers in real time and reliably. The main reason it is used is to support Google’s advertising system. Google’s publish subscribe uses a theme-based approach that emphasizes reliable and timely delivery. In this way, although communication can be effectively implemented, it will cause additional overhead.
Google search engine can achieve the fastest speed and the most efficient retrieval mode without taking up too many resources, no matter in the distributed system architecture, the way of management scalability, availability and security, or in the way of communication. I think the core technology that Google search engine can complete the whole retrieval requirements in 0.2S is Google’s unique distributed file system. At the stage of Google’s design of Google file system, the goal is to provide redundancy for the storage of massive data on cheap but low reliability computers.
Because the distributed file system Google wants needs to meet the application and workload of Google, the designer designed the Google File System (GFS) on the premise of high component failure rate, high throughput and low latency. The framework and basic operation of GFS are introduced in 3.1. It can be seen that the biggest difference between GFS and other distributed file systems is the use of a single primary device. Because the traditional distributed file system will have a single point of failure and throughput bottleneck.
In order to avoid these failures, GFS weakens the main device and never moves data (excluding metadata), and establishes a cache on the server. Only when the data changes, can the primary device agent replicate the data. Although the design is simple, it is good enough. At the same time, the system has high fault tolerance.
In the event of a system error or failure, the primary device and block server can be restarted in a few seconds, and there are at least three replicas for block replication. In addition, the main device is hidden. GFS also has some problems of reducing efficiency. At present, Google has more than 450000 devices, but only 1 / 3 of them are really effective. This brings Google a lot of extra cost, extra energy and extra space. I think that since GFS can achieve high performance at low cost, the next problem to be solved is to reduce unnecessary cost.
Remember! This is just a sample.
You can get your custom paper by one of our expert writers.Get custom essay
121 writers online
Remember: This is just a sample from a fellow student.
450+ experts on 30 subjects ready to help you just now
Starting from 3 hours delivery
We provide you with original essay samples, perfect formatting and styling
To export a reference to this article please select a referencing style below:
By clicking “Send”, you agree to our Terms of service and Privacy statement. We will occasionally send you account related emails.
Where do you want us to send this sample?
Be careful. This essay is not unique
This essay was donated by a student and is likely to have been used and submitted before
Download this Sample
Free samples may contain mistakes and not unique parts
Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.
Please check your inbox.
We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!