Pssst… we can write an original essay just for you.
Any subject. Any type of essay.
We’ll even meet a 3-hour deadline.
121 writers online
Historical context of parallelism‘Parallelism’ or ‘parallel computing’ is a term used to describe the practice of running or creating processes that contain operations able to be simultaneously performed. Although the practise of parallelism has become increasingly popular in recent years, the concept originated in 1842 in L. F. Menabrea’s “Sketch of the analytical engine Invented by Charles Babbage”. Menabrea describes a process by which the operation upon and duplication of a set of entered numbers may be pipelined as to occur simultaneously.
This process prevents the user from having to enter the same set of numbers more than once for use in many operations, and reduces both the chance of human error and the total runtime per input.Whilst this was a necessary optimisation at the time, the advent of digital computing temporarily offset this requirement due to the fact that the speed with which data could be entered and operations could be performed was massively increased. Though the original electronic digital computers such as ENIAC used a form of parallelism , subsequent computers most often scheduled operations using a more serial approach, the exception to this being with regards to input or output. Although the first commercially available parallel computer was launched in 1952 , the need for widespread parallel computing was not universally acknowledged until much later, when it was realised that solitary processing units were likely to soon reach their maximum speed in terms of clock rate and floating-point operations per second (FLOPS).
In recognition of this it was determined that the most efficient way to increase computational speed was now to add additional processing units, an initiative now known as multiprocessing.The advent of multiprocessing had a great impact on the design of both hardware and software. As the speed of a CPU vastly outstripped that of any other component, CPU-specific memory had to be increased in order to reduce slowdown caused by storage read-times .
To allow the cores to communicate without unnecessary latency, ‘bridges’ had to be created between the cores that ran at a speed comparative to the cores themselves. In order to facilitate the collaboration of the cores on single tasks, the availability of fast memory accessible by the multiple cores became more important. This created a need for software which was able to encompass the asynchronous nature of access to these memory banks, known collectively as ‘caches’, and to be able to efficiently split the list of tasks so that they could be assigned to multiple cores.
‘Cache’ is a term commonly used to refer to fast access-rate memory that is reserved solely for use by the CPU in order to speed up operations performed. A cache may be used as a sort of buffer, where sizeable chunks of relevant data are stored in the hope that they may be useful (a ‘cache hit’), or to contain values that are generated by the CPU whilst performing an operation. One example of the former might be reading the next N values of a list when the first item is requested as it is likely that the rest will be needed subsequently. One example of the latter might be to contain a loop counter during a mean average operation. Caches are organised into ‘levels’ of speed, with the highest (level 1) being physically connected to the CPU, often to an individual core. In modern CPUs the level 2 caches are normally connected to each core’s level 1 cache, whereas the level 3 cache is separate and shared by all cores. Cache architecture is designed in this way to allow a tiered approach to reading from it – if data is required by a core, the highest level cache is read. If the data is not found, the lower level caches are read in succession until finally the main storage is consulted.BridgesA ‘bridge’ is a term commonly used to describe the connection between the CPU, its attendant ram and the motherboard. In many architectures there are two bridges, referred to as the ‘northbridge’ and the ‘southbridge’ .
The northbridge runs at a clock speed which is only slightly less than the CPU cores themselves, and is used to allow rapid communication between the cores and the faster caches. The southbridge runs significantly slower than the northbridge, and is used to convey data from and to the motherboard. Due to this it is often considered to be the ‘I/O relay’ for the CPU. It is worth noting however, that this architecture has recently been modified by Intel so as to include the northbridge within the die of the CPU, now known as ‘sandy bridge’. This has occurred in order to reduce the need for CPU-specific components on a motherboard.Parallel programming paradigms.ThreadingDefinition‘Threading’ is a term used to refer to the practice of separating a program into multiple distinct control flows or ‘threads’, which are largely independent of one another . These threads may then run concurrently and thus can greatly increase a process’ overall speed of execution. Threads have access to a global memory bank and thus can share data between each other, although care must be taken to ensure that this shared memory is not adversely affected by asynchronous access.
Most modern operating systems make extensive use of threading in order to streamline the user experience. A simple process such as Microsoft Notepad may contain only one thread, whereas a more complex process such as Google Chrome may contain many threads performing different functions. A thread that is managed by the operating system is known as a ‘kernel thread’ and is typically generated on boot. Threads managed by user-controlled programs are known as ‘user threads’, and are mapped to a free kernel thread when they are executed.
The process of creating and optimising threads so that they may run in tandem is often referred to as ‘multithreading’. Separate but related to this is ‘interleaved multithreading’, where multiple virtual processors are simulated on one core and are scheduled so as to minimise the impact of latency caused by memory reads. This differs from standard multithreading as the emphasis in this scenario is now on creating a block of read/write operations across all interleaved threads, rather than on asynchronous processing.
This approach can be further broken down into ‘fine-grained’ multithreading (where threads are switched between in a round-robin fashion), ‘coarse-grained’ multithreading (where threads are switched if a particularly slow read occurs), ‘time-slice’ multithreading (where threads are switched between after a set time has elapsed) and ‘switch-on-event’ multithreading (where threads are switched between if the current thread has to wait for input).
Allows simultaneous completion of tasks without the use of specialist hardware. Provides a conceptually unchallenging approach to parallelism, thus allowing the programmer to create more powerful solutions.
All threads within a process are affected by the state of global variables and settings within this process. If a thread performs an illegal operation and ends, the process to which the thread belongs will also end.
Definition‘Cluster processing’ is a term used to refer to the practise of linking multiple computers together to form a larger ‘super-computer’. In this scenario, each networked device can be regarded as analogous to a ‘core’ in a single computer.
When designing a computer cluster, the physical layout and description of the component machines must be carefully considered with respect to the tasks the completed system will be expected to perform. Responsibilities that require a disparate and unconnected series of events (such as running a web-server) may not necessitate homogeneity of component devices, whereas functionality with a high level of inter-process communication (such as complex modelling procedures) may demand a greater level of coupling and thus component machines of similar specification.Computer clusters may be constructed to perform a variety of tasks, but the emphases with which they are constructed fall into two main categories; load-balancing and high-availability. A high-availability or ‘failover’ cluster is constructed to ensure that the service provided is uninterrupted regardless of circumstance. It achieves this by creating simple virtual machines to serve requests rather than serving them all from the main operating system. If one of these machines fails, a duplicate may be quickly made and resume the set task.
A load-balancing cluster attempts to ensure all component machines within the cluster have an equal share of the workload in order to maximise the efficiency of execution. Parallelism in these systems is commonly accomplished using the Message Passing Interface, or MPI. MPI is built around the principle of using data packets sent between processes to both synchronise and allow them to communicate . This allows for efficiency on both a local and global scale in homogenous and heterogenous clusters alike, as local scheduling can be delegated to the component machines whilst allowing supervision by an overarching management protocol.BenefitsOne of the benefits of MPI is its portability. As it relies on a simple concept, it can be implemented efficiently on a great range of hardware. MPI(2) contains support for remote-memory operations and analogues for UNIX-type file operations, thus allowing it to be implemented for different operating systems .
Furthermore, MPI allows for easy manipulation of data regardless of locality, and is able to compensate for differing hardware speeds on various networked computers. Additionally, MPI is relatively efficient as it allows programmers to treat machines as individual units rather than sections of the whole machine, one may optimise for that unit. This division allows for machine-specific peculiarities to be addressed. DrawbacksMPI has limited support for shared-memory operations, and thus using MPI to implement a large-scale application with shared memory may require more complexity than other approaches.
General Programming on Graphics Processor Units (GPGPU) is the practice of running programs using a computer’s GPU rather than its CPU. As graphics processors are purpose built to facilitate the simultaneous processing of a great number of matrix operations, this can dramatically increase the performance of programs that operate in a compatible manner.DescriptionGPGPU is now a widely-used approach to parallelism, due to the fact that GPUs commonly possess a great number of homogeneous cores and thus from a conceptual standpoint, are easy to write parallel programs for.GPGPU was first attempted when DirectX 8 became available  as there were now programmable vertex and pixel shading routines within the graphics card. Initially, the only way to perform GPGPU was through the graphics API, so algorithms to be performed had to be presented to the graphics card as if they were required for rendering.At this point, the functionality presented by GPGPU was minimal for a number of reasons.  Firstly, the locations within graphics memory were allocated and managed by the GPU, meaning that algorithms requiring random locations within memory could not be run.
Furthermore, there was little in the way of a standardised approach to floating-point arithmetic within a graphics processing unit, and thus scientific calculations could not be guaranteed to run on any particular machine. Finally, If the program crashed or failed, there was little to no way that the programmer could debug the erroneous code.These problems were then addressed in 2006, when Nvidia released their first graphics processor built using the CUDA architecture. This architecture was designed in part to facilitate the use of the graphics processor for general purpose programming by allowing the reprogramming of many of the pipelines within the card .
In addition, the onboard ALUs were built to comply with the IEEE recommendations for floating point arithmetic, and thus were now reliably usable in scientific calculations. Finally, Nvidia worked to allow developers to use C in order to interface with the graphics card, rather than having to use a shader language through DirectX or openGL.GPGPU actually uses a system that is somewhat of a fusion of MPI and threading, as it uses ‘blocks’ or processes that may communicate with each other using messages, and also allows division of a ‘block’ into many threads that communicate with each other using shared memory.BenefitsGPGPU allows for an extreme increase in operational speed due to the architecture of the utilised GPU.
As graphics processors are constructed solely to perform matrix calculations, any program intended for implementation using GPGPU must present all operations as a series of matrix calculations. This may increase complexity in situations where the data types used are not natively suited to this expression.Some peculiarities of parallel computing.
A race condition is formed when a program’s output varies depending on the order in which instructions given are carried out. One example of this was found in the Therac-25 Medical Accelerator, which would emit a lethal dose of radiation if the user first selected ‘X-ray’ mode, then rapidly selected ‘treatment’ mode . The initial selection of X-ray mode would position the beam direction magnets so that they would not interfere with the beam, and the second selection would set the beam to be used in high-power mode rather than low power. These two conditions together resulted in an undirected emission of radiation at a lethal amplitude.
A deadlock is a situation in which two or more threads are waiting for the others to complete their respective tasks . As all threads are waiting for the other to signal them, none will progress and thus the system will stall. One example of this occurred in the North American version of the game ‘Bubble Bobble Revolution’. This deadlock was caused by a physical defect that prevented an enemy from spawning. The game would then be unplayable as the player would not be able to progress until the errant enemy was defeatedAmdahl’s lawAmdahl’s law states that parallelism may only ever increase the speed with which a program completes to the limit imposed by the the combined completion times of its serial components.This means that when a parallel program is being created, care must be taken to minimise the quantity of serial tasks that are performed in order to gain the greatest increase in speed. In this case, the descriptor ‘serial task’ also applies to tasks bounded by synchronisation operations.
We provide you with original essay samples, perfect formatting and styling
To export a reference to this article please select a referencing style below:
Sorry, copying is not allowed on our website. If you’d like this or any other sample, we’ll happily email it to you.
Attention! this essay is not unique. You can get 100% plagiarism FREE essay in 30sec
Sorry, we cannot unicalize this essay. You can order Unique paper and our professionals Rewrite it for you
Your essay sample has been sent.
Want us to write one just for you? We can custom edit this essay into an original, 100% plagiarism free essay.Order now
Are you interested in getting a customized paper?Check it out!