Trinity

BulldozerさんとAPUと複数GPUが試せると聞いて↓
http://www.asus.co.jp/News/O4DgwAI4xKiWT9bE/
を買ってみた。


HDDがGPTになってて、10年振りくらいにLinuxのインストールで苦労している。
なので、以下Windows7 64の結果。


とりあえず サンプルに付いてる bufferBandWidth 試したところ、

  • -d 0 = Devastator : CPUに付いてるやつ
  • -d 1 = Caicos : 独立GPU
  • -if/-of 3 : CL_MEM_USE_HOST_PTR
  • -if/-of 5 : CL_MEM_ALLOC_HOST_PTR
  • -if/-of 6 : CL_MEM_USE_PERSISTENT_MEM_AMD
 $ ./BufferBandwidth.exe -d 0 
Device  0            Devastator
//.. 略
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

Host baseline (naive):

Timer resolution     543.49  ns
Page fault           4051.46 ns
Barrier speed        694.08  ns

CPU read             2.96 GB/s
memcpy()             2.26 GB/s
memset(,1,)          4.02 GB/s
memset(,0,)          4.06 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.008390 s [     4.00 GB/s ]
                       memset():  0.008150 s       4.12 GB/s
      clEnqueueUnmapMemObject():  0.006671 s [     5.03 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.028540 s      23.51 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.049561 s      13.54 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.006312 s [     5.32 GB/s ]
                       CPU read:  0.011460 s       2.93 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000046 s [   722.19 GB/s ]


$ ./BufferBandwidth.exe -d 1
Device  1            Caicos
// ...
inputBuffer:         CL_MEM_READ_ONLY 
outputBuffer:        CL_MEM_WRITE_ONLY 

// ...
1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.012111 s [     2.77 GB/s ]
                       memset():  0.008273 s       4.06 GB/s
      clEnqueueUnmapMemObject():  0.010193 s [     3.29 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.049454 s      13.57 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.054938 s      12.22 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.010129 s [     3.31 GB/s ]
                       CPU read:  0.010677 s       3.14 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000044 s [   759.46 GB/s ]


Passed!

$ ./BufferBandwidth.exe -d 0 -if 3 -of 3
// ...
inputBuffer:         CL_MEM_READ_ONLY CL_MEM_USE_HOST_PTR 
outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_USE_HOST_PTR 

// ...
AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.000027 s [  1226.62 GB/s ]
                       memset():  0.008330 s       4.03 GB/s
      clEnqueueUnmapMemObject():  0.000059 s [   565.56 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.139363 s       4.82 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.126503 s       5.30 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.000045 s [   748.78 GB/s ]
                       CPU read:  0.010935 s       3.07 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000055 s [   613.31 GB/s ]


Passed!

$ ./BufferBandwidth.exe -d 0 -if 5 -of 5
// ...
inputBuffer:         CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR 
outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR 

// ...

AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.000027 s [  1240.22 GB/s ]
                       memset():  0.004913 s       6.83 GB/s
      clEnqueueUnmapMemObject():  0.000235 s [   142.80 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.032603 s      20.58 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.127871 s       5.25 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.000044 s [   763.08 GB/s ]
                       CPU read:  0.011176 s       3.00 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000073 s [   460.17 GB/s ]


Passed!

$ ./BufferBandwidth.exe -d 0 -if 6 -of 6
// ...
inputBuffer:         CL_MEM_READ_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD 
outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD 

// ...
AVERAGES (over loops 2 - 19, use -l for complete log)
--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.010699 s [     3.14 GB/s ]
                       memset():  0.015516 s       2.16 GB/s
      clEnqueueUnmapMemObject():  0.000077 s [   437.15 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.040472 s      16.58 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.062263 s      10.78 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.004408 s [     7.61 GB/s ]
                       CPU read:  1.008308 s       0.03 GB/s
                 verification ok
      clEnqueueUnmapMemObject():  0.000096 s [   348.82 GB/s ]


Passed!

見るべき点は、

inputBuffer:         CL_MEM_READ_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD 
outputBuffer:        CL_MEM_WRITE_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD 

で、どのメモリ使ってるか。

      clEnqueueMapBuffer(WRITE):  0.010699 s [     3.14 GB/s ]
                       memset():  0.015516 s       2.16 GB/s
      clEnqueueUnmapMemObject():  0.000077 s [   437.15 GB/s ]

で、

  • コピーしてるかどうか。コピーしてない場合は時間0に近くなるので、数百GB/sとかになる。
  • ホストからフルスピードアクセスできてるかどうか。
CPU read             2.96 GB/s
memcpy()             2.26 GB/s
memset(,1,)          4.02 GB/s
memset(,0,)          4.06 GB/s

よりも遅くなってる場合、キャッシュ不可領域に置かれてると思われる。

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.040472 s      16.58 GB/s
                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.062263 s      10.78 GB/s

でHost <-> GPUの転送速度。


まとめると、

  • CL_MEM_USE_HOST_PTR は、フルスピード出るわけではないみたい。(多分ページフォルト考慮しないといけないから)
  • CL_MEM_ALLOC_HOST_PTR は、ホストでの読み書き + GPUでの読みでほぼフル近く出る。GPUでの書き込みは5GB/sになってるがなんで?
  • Caicosは独立GPUなのにDevastatorより転送量少ないので存在理由謎い。
  • CPUのmemcpy遅すぎじゃ…

という感じか。Host -> GPUは、うまくプログラム書けばかなりオーバーヘッド減らせるところまできてるように見える。
GPU -> Hostは今のところどうしようも無い?


あと↓も測った。
http://d.hatena.ne.jp/w_o/20110227#1298798613

==== alloc host ptr ====
memcpy 1MB: 0.002052[sec] 0.475866[GB/s]
memcpy 1MB: 0.000259[sec] 3.776131[GB/s]
memcpy 1MB: 0.000290[sec] 3.362020[GB/s]
memcpy 1MB: 0.000265[sec] 3.691360[GB/s]
map-unmap 1MB: 0.002269[sec] 0.430454[GB/s]
map-unmap 1MB: 0.000047[sec] 20.790421[GB/s]
map-unmap 1MB: 0.000135[sec] 7.235066[GB/s]
map-unmap 1MB: 0.000050[sec] 19.449103[GB/s]
map-unmap + memcpy 1MB: 0.000428[sec] 2.283796[GB/s]
map-unmap + memcpy 1MB: 0.000382[sec] 2.554755[GB/s]
map-unmap + memcpy 1MB: 0.000420[sec] 2.327885[GB/s]
map-unmap + memcpy 1MB: 0.000386[sec] 2.529743[GB/s]
empty latency 0.004159[sec]
empty latency 0.001131[sec]
empty latency 0.000096[sec]
empty latency 0.000099[sec]
read latency(1byte): 0.000021[sec]
read latency(1byte): 0.000022[sec]
read latency(1byte): 0.000021[sec]
read latency(1byte): 0.000021[sec]
gpu memcpy 1MB: 0.004840[sec] 0.201759[GB/s]
gpu memcpy 1MB: 0.003432[sec] 0.284531[GB/s]
gpu memcpy 1MB: 0.003568[sec] 0.273682[GB/s]
gpu memcpy 1MB: 0.003335[sec] 0.292823[GB/s]
==== use host ptr ====
memcpy 1MB: 0.000436[sec] 2.238573[GB/s]
memcpy 1MB: 0.000420[sec] 2.327885[GB/s]
memcpy 1MB: 0.000372[sec] 2.625206[GB/s]
memcpy 1MB: 0.000409[sec] 2.389388[GB/s]
map-unmap 1MB: 0.005652[sec] 0.172790[GB/s]
map-unmap 1MB: 0.004737[sec] 0.206151[GB/s]
map-unmap 1MB: 0.004774[sec] 0.204542[GB/s]
map-unmap 1MB: 0.004576[sec] 0.213424[GB/s]
map-unmap + memcpy 1MB: 0.005063[sec] 0.192873[GB/s]
map-unmap + memcpy 1MB: 0.005313[sec] 0.183799[GB/s]
map-unmap + memcpy 1MB: 0.005061[sec] 0.192976[GB/s]
map-unmap + memcpy 1MB: 0.005352[sec] 0.182464[GB/s]
empty latency 0.000221[sec]
empty latency 0.000120[sec]
empty latency 0.000105[sec]
empty latency 0.000096[sec]
read latency(1byte): 0.000092[sec]
read latency(1byte): 0.000083[sec]
read latency(1byte): 0.000092[sec]
read latency(1byte): 0.000121[sec]
gpu memcpy 1MB: 0.005560[sec] 0.175643[GB/s]
gpu memcpy 1MB: 0.002094[sec] 0.466297[GB/s]
gpu memcpy 1MB: 0.001976[sec] 0.494199[GB/s]
gpu memcpy 1MB: 0.002175[sec] 0.448937[GB/s]


カーネル起動は良くて100usecぐらい?