Abstract
To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 \(\times \) 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Benzi R, Succi S, Vergassola M (1992) The lattice boltzmann equation: theory and applications. Phys Rep 222(3):145–197
Chu X, Zhao K, Wang M (2008) Massively parallel network coding on gpus. In: Performance, computing and communications conference, 2008. IPCCC 2008. IEEE International. IEEE, pp 144–151
Habich J (2008) Performance evaluation of numeric compute kernels on nvidia gpus. Master’s Thesis, University of Erlangen-Nurnberg
Hou S, Sterling J, Chen S, Doolen G (1996) A lattice boltzmann subgrid model for high reynolds number flows. Pattern Form Lattice Gas Automata 6:151–166
Kuznik F, Obrecht C, Rusaouen G, Roux J (2010) Lbm based flow simulation using gpu computing processor. Comput Math Appl 59(7):380–2392
Li Y, Zhao K, Chu X, Liu J (2012) Speeding up k-means algorithm by gpus. J Comput Sys Sci
Maier R, Bernard R, Grunau D (1996) Boundary conditions for the lattice boltzmann method. Phys Fluids 8(7):1788–1801
Micikevicius P (2011) Multi-gpu programing. http://www.nvidia.com/docs/IO/116711/sc11-multi-gpu.pdf
Nvidia C (2011) Nvidia cuda programming guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
Nvidia C (2011) Nvidia gpudirect. http://developer.nvidia.com/gpudirect
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Global memory access modelling for efficient implementation of the lattice boltzmann method on graphics processing units. High Performance Comput Comput Sci-VECPAR 2010:151–161
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Multi-gpu implementation of the lattice boltzmann method. Comput Math Appl
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) A new approach to the lattice boltzmann method for graphics processing units. Comput Math Appl 61(12):3628–3638
Qian Y, d’Humieres D, Lallemand P (2007) Lattice bgk models for navier-stokes equation. EPL (Europhys Lett) 17(6):479
Rosales C (2011) Multiphase lbm distributed over multiple gpus. In: 2011 IEEE international conference on cluster computing (CLUSTER).IEEE, pp 1–7
Smagorinsky J (1963) General circulation experiments with the primitive equations. Monthly Weather Rev 91(3):99–164
Tölke J (2010) Implementation of a lattice boltzmann kernel using the compute unified device architecture developed by nvidia. Comput V Sci 13(1):29–39
Tölke J, Krafczyk M (2008) Teraflop computing on a desktop pc with gpus for 3d cfd. Int J Comput Fluid Dyn 22(7):443–456
Vreman A (2004) An eddy-viscosity subgrid-scale model for turbulent shear flow: Algebraic theory and applications. Phys Fluids 16:3670
Acknowledgments
This project was supported by the Aeronautical Science Fund of China (Grant No. 20111453012) and the National Defense Pre-Research Foundation of China (Grant No. 9140A13040111HK0329).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Table 3
Appendix B: Kernel function LBCollProp \( Smagorinsky\,\, model \)
__global__ static void LBCollProp ( int nx, int* geoD, double tauf, double prefix, double* fr0, \(\cdots \), double fse0, double fr1, \(\cdots \), double* fse1 )
{
// collision and propagation processes of the fluid nodes
//registers for kernel processing
double ux, uy, uv, rho, tau;
double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*
double f1, f2, f3, f4, f5, f6, f7, f8; //feq
//Index K in 1D-arrays
int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;
if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update
{
//“ propagation” was performed at the global memory read transactions.
// Note: some global memory read transactions remained misaligned.
fr = fr0[k];
fe = fe0[k-1];
fn = fn0[k-nx];
\(\cdots \)
//get the macroscopic properties: velocity (ux, uy) and density (rho).
rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;
uv = 1/rho;
ux = (fe + fne + fse - fw - fnw - fsw);
ux *= uv;
uy = (fn + fne + fnw - fs - fsw - fse);
uy *= uv;
//get the feq after propagation.
tau = rho/9.0;
uv = 1.0 - 1.5*(ux*ux + uy*uy);
f1 = tau*( 3.0*ux + 4.5*ux*ux + uv);
\(\cdots \)
tau = tau*0.25;
f5 = tau*( 3.0*(ux+uy) + 4.5*(ux+uy)*(ux+uy) + uv);
\(\cdots \)
uv = 16*tau*uv; //means f0
//get the single relaxation time according Smagorinsky model.
//only needs the local quantities such as nonequilibrium stress tensor.
ux = (fe-f1) + (fne-f5) + (fse-f8) + (fnw-f6) + (fw-f3) + (fsw-f7);
uy = (fne-f5) + (fn-f2) + (fnw-f6) + (fsw-f7) + (fs-f4) + (fse-f8);
tau = (fne-f5) + (fsw-f7) + (f6-fnw) + (f8-fse);
tau = sqrt(2*(ux*ux+uy*uy+2*tau*tau))*prefix;
tau = (sqrt(tauf*tauf+tau/rho)+tauf)*0.5;
// perform the LBM iteration according to equation:
// f = f* - (f* - feq)/tau \( \rightarrow \) f = (1-1/tau)f* + feq/tau
// write data back to device memory, coalesced global memory accesses.
tau = 1/tau;
ux = 1-tau;
fr1[k] = ux*fr + tau*uv;
fe1[k] = ux*fe + tau*f1;
fn1[k] = ux*fn + tau*f2;
\(\cdots \)
}
}
Appendix C: Kernel function LBProp and LBColl (Vreman model)
__global__ static void LBProp (const int nx, const int* geoD, double * dev_ux, double * dev_uy, double * dev_rho, double* fr0, \(\cdots \), double* fse0, double* fr1, \(\cdots \), double* fse1)
{
//This kernel function is responsible for propagation of the fluid nodes
double ux, uy, uv, rho, tmp;
double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*
//Index K in 1D-arrays
int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;
if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update
{
//“ propagation” was performed at the global memory read transactions.
// Note: some global memory read transactions remained misaligned.
fr = fr0[k];
fe = fe0[k-1];
fn = fn0[k-nx];
\(\cdots \)
// get and store the whole updated macroscopic properties to global memory
// for the sake of synchronization.
rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;
dev_rho[k] = rho;
tmp = 1/rho;
ux = (fe + fne + fse - fw - fnw - fsw);
dev_ux[k] = ux*tmp;
uy = (fn + fne + fnw - fs - fsw - fse);
dev_uy[k] = uy*tmp;
fr1[k] = fr;
fe1[k] = fe;
fn1[k] = fn;
\(\cdots \)
}
}
__global__ static void LBColl ( int nx, int* geoD, double tauf, double prefix, double * dev_ux, double * dev_uy, double * dev_rho, double* fr1, \(\cdots \) , double* fse1)
{
//This kernel function is responsible for collision of the fluid nodes
double ux, uy, uv, rho, tmp1, tmp2;
double delta_Uxx, delta_Uxy, delta_Uyx, delta_Uyy, up, down;
double tauf0, tauf1, tauf2, tauf3,tauf4, tauf5;
//Index K in 1D-arrays
int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;
if(geoD[k]\(=\)=FLUID)
{
ux = dev_ux[k];
uy = dev_uy[k];
rho = dev_rho[k];
//get the resolved velocity gradient tensor: gij
delta_Uxx = 0.5*(dev_ux[k+1] - dev_ux[k-1]); //g11
delta_Uxy = 0.5*(dev_ux[k+nx] - dev_ux[k-nx]); //g12
delta_Uyx = 0.5*(dev_uy[k+1] - dev_uy[k-1]); //g21
delta_Uyy = 0.5*(dev_uy[k+nx] - dev_uy[k-nx]); //g22
//numerator : \((g12*g21 - g11*g22)^2\)
up = (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy) * (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy);
//denominator : gij gij = g11*g11 + g12*g21 + g21*g11 + g22*g22
down = delta_Uxx*delta_Uxx + delta_Uxy*delta_Uxy + delta_Uyx*delta_Uyx + delta_Uyy*delta_Uyy + 1e-15;
//get the single relaxation time according to Equation(6).
tauf0 = tauf + 3.0*prefix*sqrt(up/down);
// perform the LBM iteration according to equation:
// f = f* - (f* - feq)/tau \(\rightarrow \) f = (1-1/tau)f* + feq/tau
// read f* and write f back, coalesced global memory accesses.
tauf1 = 1.0/tauf0;
tauf2 = 1.0 - tauf1;
tauf3 = (4.0/9.0)*tauf1;
tauf4 = 0.25*tauf3;
tauf5 = 0.25*tauf4;
uv = 1.0 - 1.5*(ux*ux + uy*uy);
tmp1 = 4.5*ux*ux;
tmp2 = 4.5*uy*uy;
fr1[k] = tauf2*fr1[k] + tauf3*rho*uv;
fe1[k] = tauf2*fe1[k] + tauf4*rho*( 3.0*ux + tmp1 + uv);
fn1[k] = tauf2*fn1[k] + tauf4*rho*( 3.0*uy + tmp2 + uv);
\(\cdots \)
tmp1 = 4.5*(ux + uy)*(ux + uy) + uv;
tmp2 = 4.5*(uy - ux)*(uy - ux) + uv;
fne1[k] = tauf2*fne1[k] + tauf5*rho*( 3.0*(ux + uy) + tmp1);
fnw1[k] = tauf2*fnw1[k] + tauf5*rho*( 3.0*(uy - ux) + tmp2);
\(\cdots \)
}
}
Appendix D: An excerpt of the evolution loop of multi-GPU implementation
void evolution(int steps)
{
dim3 dimBlock( THREAD_NUM, 1);
dim3 dimGrid1( nx / THREAD_NUM );
dim3 dimGrid2( 0, 2 );
dim3 dimGrid3( nx / THREAD_NUM, 1 );
bool isEvenStep = true; //switch between two sets of arrays.
for(int i=0; i \(<\) steps; i++){
isEvenStep = (i%2 \(=\)= 0)?true:false;
switch(gpu_count){
case QUADRUPLE_GPU:
cudaSetDevice(gpuid_tesla[3]);
dimGrid1.y = height_r[3];
dimGrid2.x = (height_r[3] + THREAD_NUM - 1) / THREAD_NUM;
launch_LBCollProp(\(\cdots \)); //dimGrid1, inner fluid nodes
launch_LBBC_LR(\(\cdots \)); //dimGrid2, left and right boundaries
launch_LBBC_UP(\(\cdots \)); //dimGrid3
\(\cdots \) //Note: no “ break” here!
case SINGLE_GPU:
cudaSetDevice(gpuid_tesla[0]);
dimGrid1.y = height_r[0];
dimGrid2.x = (height_r[0] + THREAD_NUM - 1) / THREAD_NUM;
launch_LBCollProp(\(\cdots \));
launch_LBBC_LR(\(\cdots \));
launch_LBBC_DOWN(\(\cdots \)); //dimGrid3
if(gpu_count \(=\)= SINGLE_GPU)
launch_LBBC_UP(\(\cdots \));
break;
default:
break;
}
//GPU to GPU communication between adjacent work sets.
if(gpu_count \(>\)= DOUBLE_GPU){ //GPU1 - GPU2
if(isEvenStep \(=\)= true){
src_offset = nx*(ny/gpu_count-1);
dest_offset = 0;
//transfer 3 fractions to the corresponding “ ghost layer”
packedTransfer(src_offset, dest_offset, fn1, fne1, fnw1, fn3, fne3, fnw3, pitch, width, 1, cudaMemcpyDefault);
src_offset = nx;
dest_offset = nx*(ny/gpu_count);
packedTransfer(src_offset, dest_offset, fs3, fsw3, fse3, fs1, fsw1, fse1, pitch, width, 1, cudaMemcpyDefault);
}else{
src_offset = nx*(ny/gpu_count-1);
dest_offset = 0;
packedTransfer(src_offset, dest_offset, fn0, fne0, fnw0, fn2, fne2, fnw2, pitch, width, 1, cudaMemcpyDefault);
src_offset = nx;
dest_offset = nx*(ny/gpu_count);
packedTransfer(src_offset, dest_offset, fs2, fsw2, fse2, fs0, fsw0, fse0, pitch, width, 1, cudaMemcpyDefault);
}
}
if(gpu_count \(>\)= TRIPLE_GPU) {\(\cdots \)} //GPU2 - GPU3
if(gpu_count \(>\)= QUADRUPLE_GPU) {\(\cdots \)} //GPU3 - GPU4
cudaThreadSynchronize();
}
Rights and permissions
About this article
Cite this article
Li, Q., Zhong, C., Li, K. et al. A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs. Computing 96, 479–501 (2014). https://doi.org/10.1007/s00607-013-0356-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-013-0356-7