C N4 3 -1 2 5 8 / T P 4 1 4 2 0 1 9 4 ISSN 1007-130X Computer Enginering & Science Vol.41No.4Apr.2019
:1007-130X(2019)04-0590-08
SpMV *
Akrem Benatia ( 100081)
:SpMV SpMV CPU GPU CPU GPU SpMV
CPU-GPU SpMV 2:Quasi-di- agonal Tetris SVR 2 66% GPU
Quasi-diagonal Tetris1.742.15 : ; ; ;S V R ;S p M V
:O246 :A
doi:10.3969/j.isn.1007-130X.2019.04.003 An SpMV partitioning and optimization algorithm
on heterogeneous computing platforms TAN Zhao-nianJI Wei-xingAKREM BenatiaGAO Jian-huaLI An-minWANG Yi-zhuo
(School of Computer Science and TechnologyBeijing Institute of TechnologyBeijing 100081China)
Abstract:Sparse matrix vector multiplication SpMV is widely used in solving scientific computing and enginering problems.The distribution of non-zero elements of the sparse matrix can greatly afect the computational eficiency of SpMVand significant performance improvements can be achieved by u- sing specific algorithms for diferent data distribution paterns.CPU has strong controlability which en- ables its application in general-purpose computing while GPU has high degre of paralelism and many coresthus suitable for data-intensive computing.Taking advantage of the two can gain greater perform- ance for SpMV.We study the task partitioning and optimization methods of SpMV on CPU-GPU hybrid architecture and propose a task-based rescheduling algorithm based on SVR for two main data distribu- tion paterns:Quasi-diagonal and Tetris.The two representative sparse matrix paterns acount for 66% of the data in practical scientific and enginering applications.Experimental results show that our approach can achieve an average spedup of 1.74xand 2.15xfor Quasi-diagonal and Tetris on heteroge- neous platforms over on GPU respectively.
Key words:heterogeneous computation;matrix partitioning;colaborative optimization;SVR;SpMV
* :2018-10-17; :2018-12-20 :100081
Addres:School of Computer Science and TechnologyBeijing Institute of TechnologyBeijing 100081P.R.China
: SpMV
591
1
SpM V (Sparse Matrix Vec- tor multiplication) SpMV[1]
CPU-FPGACPU- GPU CPU-GPU [2]
CPU ;
GPU
SpMV [3]
Sp M V Maringanti [4] CSR(Compresed Sparse
Row) ELL(ELLpack-itpack) :CSR 32 CSRELL
Su [5 ] Cocktail M o n a k o v [6 ] BCOO(Blocked COOrdinate) BCSR(Blocked Compres sed Sparse Row ) M uralidharan [7 ] GPU [8][9] 2:(1) ;(2) CPUGPU[10]
GPU CPU
SVR (Support Vector Regres sion)[11]
SpMV SVR
2
CSRELL DIA (DIAgonal) 1
Figure 1 CSRELL and DIA format 1 C S R E L L D I A
CSR 3 1 ELL ELL DIA DIA
COO (COOrdi- nate)CSRELL DIA
CSR[12]Merge-based CSR[13]Jagged-DIA[14] BSR (Blocket Sparse R o w ) [1 5 ]
SpMV y = Ax A MN xN y
M
592 Computer Engine ering & Science 201941(4) 3 SpMV
3 .1
SpMV
SuiteSparse ( )[16 ]
2:Quasi-diagonal Tet- ris 2 SuiteSparse 23.0% 43.1%
SpMV GPU
GPU CPU Cache
2: 2 GPU CPU GPU CPU (S V R ) S p M V CPU
GPU
2 SpMV SVR
3 .2 S p M V
(1 )Q uasi-diagonal
Quasi-diagonal 2diagGT01R lock1074 Quasi-diagonal
Figure 2 Examples of Quasi-diagonal matrix 2 Q uasi-diagonal
Quasi-diagonal 3 WIDTH 2:Q1 Q2 Q1 DIA Q2 CSRSpMVQ1
G P U Q2
C P U [8 ]
Figure 3 An example of Tetris matrix 3 T etris
(2)Tetris
4 bratu 3dbig geom Tetris
Figure 4 Schematic diagram of Quasi-diagonal partitioning 4 Q uasi-diagonal
Tetris5 K 2:T1 T2 CPU GPU K 80%
GPU20%CPUT1 ELLT2 CSR SpMVT1 GPUT2
CPU
3.3 SpMV
(1)
SpMV [17]
: SpMV
593
Figure 5 Schematic diagram of Tetris partitioning
5 T etris A B
6
S p M V G P U C P U2 t0 GPU 5C1 CPU 3C2t2 GPU CPU C2
SpMVt3-t0 GPU CPU
GPU
Figure 6 Task execution without coperative optimization 6
Figure 7 Task execution after coperative optimization 7
SpMV1 :
(1) SpMV Q1 Q2 (T1 T2 );
(2)SVR:MGPU MCPU ; (3 )
Q1 Q2 (
T1 T1 )
1 SpMV
:Q 1 Q 2 M G P U M C P U :Q1 Q2
01:TGPU TGPU =0;
02:do
0 3 : T G P U = M G P U (Q 1 ) ;
0 4 : T C P U = M C P U (Q 2 ) ;
05: if|TGPU -TCPU|<06: break;07: else if TGPU >TCPU
08:RQ1 nnz(Q1); 09: Q2 Q2 R;
10: Q1 Q1 R;
11: else
12:RQ2 nnz(Q2); 13: Q1 Q1 R;
14: Q2 Q2 R;
15: endif
16:while(1)
17:Q1 Q1 Q2 Q2 ;
7
GPUCPU
CPU
C2C2 (01)
GPU C1 C2t2 GPUCPU SpMV
t2 -t0 t3 -t2
GPU CPU GPU CPU SpMV
1nz(Q1)Q1
594 Computer Engine ering & Science 201941(4)
(2)SVRSpMV[18]
SVR GPU CPU SpMV :
1 Q T ;
2(1) 80 % 20 %
3 SpMV SVR ;
4 SVR 4;
Intel MKL(Math Kernel Library)[20] CPU SpMV
5 Table 1 Data set information
1
Quasi-diagonalQ629TetrisT1 165
QDIA QCSR TELLTCSR
4 .2
GFLOPS10
SpMV GFLOPS
Sp M V Sp M V SpMV SpMV SpMV
34
/GB 1112
/M Hz 1 582 2.13
Table 2 Experimental equipment 2
GPUNVIDIA GTX 1080TiCPUIntel Xeon E5606
Quasi-diagonal Q QDIA QCSR (1)
Q = {A 1 A 2 A N }
Q N Ai 2ADIAi ACSRi Tetris
Quasi-diagonal
2 Tetris2 SpMV
4 4 .1
GPU CPU 2 Intel Xeon48 Win- dows 7-64 GPU CUSP
[19] SpMV CUSP Thrust CPU
Table 3 Information of Quasi-diagonal matrics 3 Quasi-diagonal
QDIA = {ADIA1 ADIA2 ADIAN }
{ } (1)
1
2
3
4
5
6
7
8
9
10
bbmat
conf5_4-88-05epb2ex40lock1074nasa4704rg_n_2_15_s0rimted_Bxenon1
38744*38744
49152*49152
25228*25228
7740*7740
1074*1074
4704*4704
32768*32768
22560*22560
10605*10605
48600*48600
NNZ 1 771 722
1 916 928
175 227
456 188
51 588
104 756
320 480
1 014 951
144 579
1 181 120
QCSR = ACSR1 ACSR2 ACSRN Ai=ADIA +ACSR i[1N]
ii
: SpMV
595
Table 4 Information of Tetris matrices 4 Tetris
CPU-GPU GFLOPS 1 000 CSR [21] GPU CSR SpMV Quasi-diagonal GFLOPS GPU 1.43 Tetris GFLOPS
GPU 1.82 Quasi-di- agonal 1 .74 Tetris 2.15 Tetris ELLCSR C S R [1 7 ] 2 GPU
CPU SpMV
8 bbmatrg g_n_2_15_ s0xenon13 GFLOPS
DIA() GPU
Quasi-diagonal : QDIA QCSR GPU DIA CSR QDIA SpMV QDIA 2:(1) DIA ;(2) DIA DIA Tetris GPU ELL CSR
TELL SpMV10GPUDIA
11
12
13
14
15
16
17
18
19
20
apach2
ar2010
ASIC_680ks
mn2010
msc10848
m_t1
ne2010
sc2010
t3dhvsp_finan51
2_scagr7- 2c_rlfddd
715176*715176
186211*186211
682712*682712
259777*259777
10848*10848
97578*97578
193352*193352
181908*181908
79171*79171
139752*139752
NNZ 4 817 870
904 310
1 693 767
1 227 102
1 229 776
9 753 570
913 854
893 160
4 352 105
1 104 040
Tetris Quasi- diagonal 2 2
4.3
8 9 Quasi-diagonal Tetris GPU
Figure 8 Experimental results of Quasi-diagonal patern 8 Q uasi-diagonal
596 Computer Engine ering & Science 201941(4)
( ELL) CSR
101
bbmatrg g_n_2_15_s0 xenon1 8 DIA
5
2 :Quasi-diagonal Tetris CPU -GPU
2 SpMV SVR Sp M V
:
[1] Beck ATeboule M.A fast iterative shrinkage-thresholding algorithm for linear inverse problems[J].Siam Journal on Im- aging Sciences20092(1):183-202.
[2] Shen JVarbanescu A LLu Yet al.Workload partitioning for acelerating applications on heterogeneous platforms[J].
IEEE Transactions on Paralel and Distributed Systems
201627(9):2766-2780. [3] Filippone SCardelini VBarbieri Det al.Sparse matrix-vec-
tor multiplication on GPGPUs[J].ACM Transactions on Mathematical Software201743(4):1-49.
[4]Maringanti AAthavale VPatkar S Bet al.Aceleration of conjugate gradient method for circuit simulation using CUDA [C]Proc of IEEE International Conference on High Per-
formance ComputingDataand Analytics2009:438-444. [5] Su BKeutzer K.clSpMV:A cros-platform OpenCL SpMV framework on GPUs[C]Proc of International Conference
on Supercomputing2012:353-364.
[6]Monakov AAvetisyan A.Implementing blocked sparse ma-
trix-vector multiplication on NVIDIA GPUs[C]Proc of In- ternational Workshop on Embedded Computer Systems 2009:289-297.
[7]Muralidharan SRoy A Hal l MW et al.Architecture-adap- tive code variant tuning[J].Architectural Support for Pro- gramming Languages and Operating Systems201651(4): 325-338.
[8] Yang WLi KMo Zet al.Performance optimization using partitioned SpMV on GPUs and multicore CPUs[J].IEEE Transactions on Computers201564(9):2623-2636.
[9] Yang W DLi K LLi K Qet al.A hybrid computing meth- od of SpMV on CPU-GPU heterogeneous computing systems [J].Journal of Paralel and Distributed Computing2017
104:49-60.
Figure 9 Experimental results of Tetris patern 9 T etris
igure 10 Spedup of DIA (or ELL)relative to CSR on GPU platform 10 GPU DIA( ELL) CSR
: SpMV
[10] Chen LVila OKrishnamoorthy Set al.Dynamic load bal- ancing on single-and multi-GPU systems[C]Proc of Inter- national Paralel and Distributed Procesing Symposium 2010:1-12.
[11] Drucker HBurges C J CKaufman Let al.Support vector regresion machines[C]Proc of Neural Information Pro- cesing Systems Conference1997:155-161.
[12] Sad Y.SPARSKIT:A basic tool kit for sparse matrix com- putations[R].Ilinios University of Ilinios1994.
[13]Meril DGarland M.Merge-based paralel sparse matrix- vector multiplication[C]Prco of the International Confer- ence for High Performance ComputingNetworkingStror- age and Analytics2016:Article No.58.
[14] Dazevedo E FFahey M RMils R Tet al.Vectorized sparse matrix multiply for compresed row storage format [C] Proc of International Conference on Computational
Science2005:99-106. [15] Remington KPozo R.NIST sparse BLAS:Users guide:
Technical Report NISTIR 6744[R].Caithersbury:National Institute of Standards and Technology1996.
[16] Davis T AHu Y.The University of Florida sparse matrix colection[J].ACM Transactions on Mathematical Soft-
ware200938(1):1-25. [17] Bel NGarland M.Implementing sparse matrix-vector mul-
tiplication on throughput-oriented procesors[C]Proc of IEEE International Conference on High Performance Com- puting Data and Analytics2009:1-11.
[18] Benatia AJi WWang Yet al.Machine learning approach for the predicting performance of SpMV on GPU[C]Proc of International Conference on Paralel and Distributed Sys- tems2016:894-901.
[19] CUSP.The NVIDIA library of generic paralel algorithms for sparse linear algebra and graph computations on CUDA architecture GPUs[EB/OL].[2018-05-15].htps://devel- oper.nvidia.com/cusp.
[20] Intel.Intel math kernel library[EB/OL].[2012-12-10].ht- tp://software.intel.com/en-us/intel-mkl.
[21] Bel NGarland M.Eficient sparse matrix-vector multiplica- tion on CUDA:NVIDIA Technical Report NVR-2008-004
[R].Santa Clara:NVIDIA Corporation2008.
597 didateCCF member(80204G)his research interests in-
clude high performance computingand machine learning.
(1980-) CCF (16954M ) E-mail:
[email protected] JI Wei-xingborn in 1980PhDaso-
ciate profesorCCF member(16954M)his research inter- ests include paralel computingand code analysis & opti-
mization.
Akrem Benatia(1990-) E-mail:Akrem.benatia@
yahoo.com AKREM Benatiaborn in 1990PhD
:
candidatehis research interests include high performance computingand computer architecture.
(1995-) CCF (79759G ) E-mail:[email protected]
GAO Jian-huaborn in 1995MS can- didateher research interest includes high
performance computing.
(1994-)
CCF (80264G ) E-mail:momentum @
bit.edu.cn LI An-minborn in 1994MS candi-
dateCCF member(80264G)his research interests include computer architectureand high performance computing.
(1979-) CCF (E200022292M ) E-mail:
[email protected] WANG Yi-zhuoborn in 1979PhD
candidatelecturerCCF member(E200022292M )his re- search interests include computer architectureand paralel programming.
(1994-) CCF (80204G ) E -mail:tan _zn @
163.com TAN Zhao-nianborn in 1994MS can-
Reviews
There are no reviews yet.