close
test_template

Lattice Boltzmann Method Implementation on Multiple Devices Using Opencl

Human-Written
download print

About this sample

About this sample

close
Human-Written

Words: 1914 |

Pages: 4|

10 min read

Published: Jul 10, 2019

Words: 1914|Pages: 4|10 min read

Published: Jul 10, 2019

Scientific computing community has been in close connection with high performance computing (HPC), which has been privilege of a limited group of scientists. Recently, with rapid development of Graphics Processing Units (GPUs), the parallel processing power of high performance computers has been brought up to every commodity desktop computer, reducing cost of scientific computations. In this paper, we develop a general purpose Lattice Boltzmann code that runs on commodity computer with multiple heterogeneous devices that support OpenCL specification. Different approaches to Lattice Boltzmann code implementations on commodity computer with multiple devices were explored. Simulation results for different code implementations on multiple devices have been compared to each other, to results obtained for single device implementation and with results from the literature. Simulation results for the commodity computer hardware platforms with multiple devices implementation have showed significant speed improvement compared to simulation implemented on single device.

The computer processor industry was at a turning point a few years ago, when CPU performance improvements hit a serious frequency wall. Major processor vendors started manufacturing multi-core processors and all the major GPU vendors turned to many-core GPU design. With the development of the many-core and multi-core hardware architectures there has been increase in numerical computer simulations in nearly every area of science and engineering.

Recently, the lattice Boltzmann method (LBM) has become an alternative method for computational fluid dynamics (CFD) and has proved its capability to simulate a large variety of fluid flows LBM is computationally expensive and memory demanding, but because it is explicit and the local property of the dominant equations (needs only nearest neighbor information), the method is very suitable for parallel computation using many-core and multi-core hardware architectures.

Graphics Processing Unit (GPU) is a massively multi-threaded architecture and then is widely used for graphical and now non-graphical computations. The main advantage of GPUs is their ability to perform significantly more floating-point operations (FLOPs) per unit time than CPUs.

In order to unify software development of different hardware devices (mostly GPUs), an effort has been made to establish a standard for programming heterogeneous platforms OpenCL.

There is a considerable cost associated with using the full potential of modern day many-core CPUs and many-core GPUs, sequential code must be (re)written to explicitly expose algorithmic parallelism. Various programming models have been established which are often vendor specific.

The main objective of the present work is to implement the Lattice Boltzmann method according to OpenCL specification, where computationally most intensive parts of the algorithm are running on multiple heterogeneous devices, which results in simulation speed up compared to implementation for single device. Also, one of the objectives is to show that by using Java programing language and OpenCL all devices available on the commodity computer hardware can be exploited to speed up scientific simulations.

Moreover two different implementations for commodity computer with multiple heterogeneous devices are created and their performances are compared. Implementations are developed using: Java programing language for host (controlling program), and OpenCL specification for kernels (written to parallelize parts of algorithm on two or more heterogeneous devices). Binding between host (Java) and kernel (OpenCL) programs is done by Java library (JOCL). Simulation has been executed on three different commodity hardware platforms. Performances of implementations are compered, it is concluded that implementations that run on two or more OpenCL devices have better performances then implementation presented at running on only one device.

Multi-GPU implementations of LBM using CUDA have been discussed extensively in literature.

implementation of cavity flow, using D3Q19 lattice model, multi-relaxation-time (MRT) approximation and CUDA is presented. Simulation was tested on one node consisting of six Tesla C1060 and POSIX thread is used to implement parallelism. described cavity flow for various depth–width aspect ratios using D3Q19 model and MRT approximation. Simulation is parallelized using OpenMP and tested on a single node multi GPU system, consisting of three nVIDIA M2070 devices or three nVIDIA GTX560 devices. presented LBM implementation for fluid flow through porous media on multi-GPU also using CUDA and MPI. Some optimization strategies based on the data structure and layout are also proposed. Implementation is tested on a one-node cluster equipped with four Tesla C1060.

authors adopted message passing interface (MPI) technique for GPU management for cluster of GPUs and explored speed up of implementation of cavity flow using overlapping of communication and computation. In this reference D3Q19 model and MRT approximation are also used. Xian described CUDA implementation of the flow around a sphere using D3Q19 model and MRT approximation. Parallelism of code is based on MPI library. Reducing the size of communicational time is achieved using partitioning method of solution domain or using the computation and the communication by multiple streams. For computation is used supercomputer equipped with 170 nodes of Tesla S1070 (680 GPUs). implemented single-phase, multi-phase and multi-component LBM on multi-GPU clusters using CUDA and OpenMP.

So far very few OpenCL implementations of LB codes have been described in literature.

Compares CUDA and OpenCL LBM implementations on one compute unit and shows that properly structured OpenCL code reaches performance levels close to those obtained by CUDA architecture.

To the best of the author's knowledge, no papers have been published concerning implementation of LBM using Java and OpenCL on multiple devices of commodity computers.

A. Lattice Boltzmann equation

In the Lattice Boltzmann Method, the motion of the fluid is simulated by particle movement and collision on a uniform lattice, and the fluid is modelled by a single particle distribution function. The evolution of the distribution function is governed by a lattice Boltzmann equation:

where is the distribution function for the particle with velocity at position and time , is the time increment and the is collision operator. Above equation states that the streamed particle distribution function at the neighbour node at the next time step is the current particle distribution plus the collision operator . The streaming of a particle distribution function occurs in the time over a distance which is the distance between lattice sites. Collision operator models the rate of change of the distribution function due to the molecular collision.

A collision model was proposed by (BGK) to simplify the analysis of the lattice Boltzmann equation. Using LB-BGK approximation equation (1) can be written as

Above equation is a well-known LBGK model and it is consistent with the Navier-Stokes equation for the fluid flow in the limit of small Mach number and incompressible flow. In equation (2) is the local equilibrium distribution, and is a single relaxation parameter associated with the collision relaxation to the local equilibrium.

In application, a lattice Boltzmann model must be chosen. Most of the research papers are done with the D2Q9 model. D2Q9 model was also used in this work. The name implies that the model is for two dimensions and at each lattice point there are nine velocities (N=9) in which particle can travel. The equilibrium particle distribution function for the D2Q9 model is given by

Where and are macroscopic velocity and density, respectively, is which has magnitude of one in this model, and are the weights and are given by for The discrete velocities for D2Q9 are given by

Macroscopic quantities and can be evaluated as

The macroscopic kinematic viscosity is given by

Equation (2) is usually solved by assuming according to the following two steps where: denote the distribution function after collision, and is the value of the distribution function after both the streaming and collision operation are finished.

The third step in implementation of LBM is the determination of the boundary conditions. In the present work for the walls the bounce-back boundary condition has been applied because it has easy implementation and reasonable results in the simple bounded domain. For the moving lid the equilibrium scheme has been used.

Lattice Boltzmann method implementations for multiple heterogeneous devices are shown in this section. The main difference between these implementations is in the data transfer from and to heterogeneous OpenCL devices.

Both implementations use the same OpenCL kernels. D2Q9 model is used for data representation, particle distribution functions are presented by nine arrays. Since OpenCL does not support two-dimensional arrays data is mapped from two-dimensional to one-dimensional array. Two-lattice algorithm is used for both implementations of Lattice Boltzmann method. Since this algorithm handles the data dependency by storing the distribution values in duplicated lattices for streaming phase, ghost layer of arrays for particle distribution functions is created.

Created arrays are divided on subdomains, one for each (multi-core/many-core) device along X direction. Subdomain size depends of each (multi-core/many-core) device characteristics. The domain is split across (multi-core/many-core) devices. Since border information after streaming phase needs to be exchanged between iterations of the solver one more ghost layer is created. This layer is used to exchange data of particle distribution functions between devices and contains only border information that needs to be exchanged. This is done to minimize number of data copied from device to host and from host to next device and it is employed for each subdomain. Arrays containing input parameters (like: size by x axis, size by y axis, number of devices, u0, alpha …) are used by all devices, this data are not divided on subdomains since they must be sent to all devices. Implementation consists of five steps.

First step in implementation is allocation of memory on the host, all needed arrays are allocated and pointers to them are created using library org.jocl.Pointer.

Second step is creating of OpenCL objects and division of data on subdomains. This step can be implemented in two different ways; data can be split on subdomains before and after creating OpenCL objects.

In the first implementation (Sub-buffer impl.) for each previously created pointer one OpenCL object is created using clCreateBuffer function. OpenCL objects are then split on partial objects using function clCreateSubbufer. From each object one new array consisting of partial objects is created. Method createInfo returns pointer to a structure that defines the buffer subset for the sub-buffer, all partial objects are aliases to matching global buffers and new memory space is not allocated. Number of partial objects created from one OpenCL object is equal to number of OpenCL devices. In addition division of one particle distribution function on subdomains using sub-buffers is given.

In the second implementation (Pointer impl.) data are split on subdomains before creation of OpenCL objects. For each pointer that points to one global buffer using method org.jocl.Pointer.withByteOffset new array of pointers is created. withByteOffset method returns a new partial pointer with an offset of the given number of bytes. Size of each created array of pointers is equal to number of available OpenCL devices. For each created partial pointer one OpenCL object is created using function clCreateBuffer. Next lines show division of one particle distribution function on subdomains using org.jocl.Pointer. Value of flagPtr depend of used OpenCL device, if device is host and compute device at the same time value is CL_MEM_USE_HOST_PTR, if device is only compute device then the value is CL_MEM_COPY_HOST_PTR.

Third step is accessing available OpenCL devices on the platform. During this phase one context is created, devices are associated with context by obtaining ID of the devices and one command queue is created per device. At run time program is created and built from a source code. One instance of each kernel is created for each device and appropriate memory objects are set as arguments for each instance of kernels.

Fourth step is simulation. First executed kernel compute values for bounce back, macroscopic quantities and collision. These operations are purely local, they require only local computation and each cell can execute this processes independently.

Get a custom paper now from our expert writers.

Since two-lattice algorithm is used for implementation, streaming phase is divided in two parts, for each of these parts one kernel is created.

Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

Lattice Boltzmann Method Implementation On Multiple Devices Using OpenCL. (2019, Jun 27). GradesFixer. Retrieved January 11, 2025, from https://gradesfixer.com/free-essay-examples/lattice-boltzmann-method-implementation-on-multiple-devices-using-opencl/
“Lattice Boltzmann Method Implementation On Multiple Devices Using OpenCL.” GradesFixer, 27 Jun. 2019, gradesfixer.com/free-essay-examples/lattice-boltzmann-method-implementation-on-multiple-devices-using-opencl/
Lattice Boltzmann Method Implementation On Multiple Devices Using OpenCL. [online]. Available at: <https://gradesfixer.com/free-essay-examples/lattice-boltzmann-method-implementation-on-multiple-devices-using-opencl/> [Accessed 11 Jan. 2025].
Lattice Boltzmann Method Implementation On Multiple Devices Using OpenCL [Internet]. GradesFixer. 2019 Jun 27 [cited 2025 Jan 11]. Available from: https://gradesfixer.com/free-essay-examples/lattice-boltzmann-method-implementation-on-multiple-devices-using-opencl/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now