Performance analysis of a sum-table-based method for computing cross-correlation in GPU-accelerated ultrasound strain elastography
Peng Bo1, Luo Shasha1, Yang Feng1, Jiang Jinfeng2     
1. School of Computer Science, Southwest Petroleum University, Chengdu, Sichuan 610500, China;
2. Department of Biomedical Engineering, Michigan Technological University, Houghton, Michigan 49931, USA

Overview: In our ultrasound strain elastography system, a modified block-matching algorithm is adopted to assess tissue motion. Then, local strains are assessed and used as surrogates of tissue elasticity. The calculation of correlation under the framework of the block-matching algorithm is a critical step and very computationally intensive. Because the correlation calculation is largely independent, graphics processing units (GPUs) have been utilized to improve computational efficiency through massive parallel programming. It is known in the literature that the sum-table based method can greatly reduce the computing burden when the calculation of the normalized correlation coefficient is needed in a serial computing environment. The sum-table based method is abbreviated as ST-NCC below. However, the performance of ST-NCC is yet to be investigated given a parallel computing platform, particularly, in a GPU environment. Consequently, our objective of this study is to investigate the performance of the ST-NCC method for the above-mentioned GPU-accelerated ultrasound strain elastography. More specifically, a published ST-NCC method by Luo et al. and the conventional NCC method were both programmed using CUDA (Version 9.0, NVIDIA Inc., CA, USA) and tested on an NVIDIA GeForce GTX TITAN X card. During the CUDA implementation, in order to achieve the best computational efficiency, two basic CUDA programming strategies were employed to improve computational efficiency for all CUDA implementation. First, in order to increase the memory bandwidth of GPUs, TEXTURE (memory) access was used for storing 2-D RF signals prior to the calculation of cross correlation. Second, programming variables that require frequent access (e.g., axial and lateral search ranges) were locked in read-only memory for rapid access. In terms of advanced CUDA programming strategies, on the one hand, a classic parallel scan method was adopted to generate those sum-table data for the ST-NCC method. On the other hand, a few different on-ship memory optimization strategies were used to implement the classic NCC method and they were compared against each other. Only the computationally most efficient implementation was used to compare with the above-mentioned GPU-accelerated ST-NCC method. Finally, performance assessments were conducted using simulated ultrasound data. Ultrasound data simulations involve both finite element modeling and acoustic simulations. Both displacement tracking accuracy and computational efficiency were evaluated during the performance assessments. Based on data investigated, we found that, under the GPU platform, the implemented ST-NCC method did not further improve the computational efficiency, as compared to the classic NCC method implemented into the same GPU platform. Comparable displacement tracking accuracy was obtained by both methods.

Supported by Scientific Innovation Program of Sichuan Province (Major Engineering Project: 2018RZ0093) and Nanchong Scientific Council (Strategic Cooperation Program Between University and City: NC17SY4020)