Trinity
BulldozerさんとAPUと複数GPUが試せると聞いて↓
http://www.asus.co.jp/News/O4DgwAI4xKiWT9bE/
を買ってみた。
HDDがGPTになってて、10年振りくらいにLinuxのインストールで苦労している。
なので、以下Windows7 64の結果。
とりあえず サンプルに付いてる bufferBandWidth 試したところ、
- -d 0 = Devastator : CPUに付いてるやつ
- -d 1 = Caicos : 独立GPU
- -if/-of 3 : CL_MEM_USE_HOST_PTR
- -if/-of 5 : CL_MEM_ALLOC_HOST_PTR
- -if/-of 6 : CL_MEM_USE_PERSISTENT_MEM_AMD
$ ./BufferBandwidth.exe -d 0 Device 0 Devastator //.. 略 inputBuffer: CL_MEM_READ_ONLY outputBuffer: CL_MEM_WRITE_ONLY Host baseline (naive): Timer resolution 543.49 ns Page fault 4051.46 ns Barrier speed 694.08 ns CPU read 2.96 GB/s memcpy() 2.26 GB/s memset(,1,) 4.02 GB/s memset(,0,) 4.06 GB/s AVERAGES (over loops 2 - 19, use -l for complete log) -------- 1. Host mapped write to inputBuffer clEnqueueMapBuffer(WRITE): 0.008390 s [ 4.00 GB/s ] memset(): 0.008150 s 4.12 GB/s clEnqueueUnmapMemObject(): 0.006671 s [ 5.03 GB/s ] 2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.028540 s 23.51 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.049561 s 13.54 GB/s 4. Host mapped read of outputBuffer clEnqueueMapBuffer(READ): 0.006312 s [ 5.32 GB/s ] CPU read: 0.011460 s 2.93 GB/s verification ok clEnqueueUnmapMemObject(): 0.000046 s [ 722.19 GB/s ] $ ./BufferBandwidth.exe -d 1 Device 1 Caicos // ... inputBuffer: CL_MEM_READ_ONLY outputBuffer: CL_MEM_WRITE_ONLY // ... 1. Host mapped write to inputBuffer clEnqueueMapBuffer(WRITE): 0.012111 s [ 2.77 GB/s ] memset(): 0.008273 s 4.06 GB/s clEnqueueUnmapMemObject(): 0.010193 s [ 3.29 GB/s ] 2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.049454 s 13.57 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.054938 s 12.22 GB/s 4. Host mapped read of outputBuffer clEnqueueMapBuffer(READ): 0.010129 s [ 3.31 GB/s ] CPU read: 0.010677 s 3.14 GB/s verification ok clEnqueueUnmapMemObject(): 0.000044 s [ 759.46 GB/s ] Passed! $ ./BufferBandwidth.exe -d 0 -if 3 -of 3 // ... inputBuffer: CL_MEM_READ_ONLY CL_MEM_USE_HOST_PTR outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_USE_HOST_PTR // ... AVERAGES (over loops 2 - 19, use -l for complete log) -------- 1. Host mapped write to inputBuffer clEnqueueMapBuffer(WRITE): 0.000027 s [ 1226.62 GB/s ] memset(): 0.008330 s 4.03 GB/s clEnqueueUnmapMemObject(): 0.000059 s [ 565.56 GB/s ] 2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.139363 s 4.82 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.126503 s 5.30 GB/s 4. Host mapped read of outputBuffer clEnqueueMapBuffer(READ): 0.000045 s [ 748.78 GB/s ] CPU read: 0.010935 s 3.07 GB/s verification ok clEnqueueUnmapMemObject(): 0.000055 s [ 613.31 GB/s ] Passed! $ ./BufferBandwidth.exe -d 0 -if 5 -of 5 // ... inputBuffer: CL_MEM_READ_ONLY CL_MEM_ALLOC_HOST_PTR outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_ALLOC_HOST_PTR // ... AVERAGES (over loops 2 - 19, use -l for complete log) -------- 1. Host mapped write to inputBuffer clEnqueueMapBuffer(WRITE): 0.000027 s [ 1240.22 GB/s ] memset(): 0.004913 s 6.83 GB/s clEnqueueUnmapMemObject(): 0.000235 s [ 142.80 GB/s ] 2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.032603 s 20.58 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.127871 s 5.25 GB/s 4. Host mapped read of outputBuffer clEnqueueMapBuffer(READ): 0.000044 s [ 763.08 GB/s ] CPU read: 0.011176 s 3.00 GB/s verification ok clEnqueueUnmapMemObject(): 0.000073 s [ 460.17 GB/s ] Passed! $ ./BufferBandwidth.exe -d 0 -if 6 -of 6 // ... inputBuffer: CL_MEM_READ_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD // ... AVERAGES (over loops 2 - 19, use -l for complete log) -------- 1. Host mapped write to inputBuffer clEnqueueMapBuffer(WRITE): 0.010699 s [ 3.14 GB/s ] memset(): 0.015516 s 2.16 GB/s clEnqueueUnmapMemObject(): 0.000077 s [ 437.15 GB/s ] 2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.040472 s 16.58 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.062263 s 10.78 GB/s 4. Host mapped read of outputBuffer clEnqueueMapBuffer(READ): 0.004408 s [ 7.61 GB/s ] CPU read: 1.008308 s 0.03 GB/s verification ok clEnqueueUnmapMemObject(): 0.000096 s [ 348.82 GB/s ] Passed!
見るべき点は、
inputBuffer: CL_MEM_READ_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD outputBuffer: CL_MEM_WRITE_ONLY CL_MEM_USE_PERSISTENT_MEM_AMD
で、どのメモリ使ってるか。
clEnqueueMapBuffer(WRITE): 0.010699 s [ 3.14 GB/s ] memset(): 0.015516 s 2.16 GB/s clEnqueueUnmapMemObject(): 0.000077 s [ 437.15 GB/s ]
で、
- コピーしてるかどうか。コピーしてない場合は時間0に近くなるので、数百GB/sとかになる。
- ホストからフルスピードアクセスできてるかどうか。
CPU read 2.96 GB/s memcpy() 2.26 GB/s memset(,1,) 4.02 GB/s memset(,0,) 4.06 GB/s
よりも遅くなってる場合、キャッシュ不可領域に置かれてると思われる。
2. GPU kernel read of inputBuffer clEnqueueNDRangeKernel(): 0.040472 s 16.58 GB/s verification ok 3. GPU kernel write to outputBuffer clEnqueueNDRangeKernel(): 0.062263 s 10.78 GB/s
でHost <-> GPUの転送速度。
まとめると、
- CL_MEM_USE_HOST_PTR は、フルスピード出るわけではないみたい。(多分ページフォルト考慮しないといけないから)
- CL_MEM_ALLOC_HOST_PTR は、ホストでの読み書き + GPUでの読みでほぼフル近く出る。GPUでの書き込みは5GB/sになってるがなんで?
- Caicosは独立GPUなのにDevastatorより転送量少ないので存在理由謎い。
- CPUのmemcpy遅すぎじゃ…
という感じか。Host -> GPUは、うまくプログラム書けばかなりオーバーヘッド減らせるところまできてるように見える。
GPU -> Hostは今のところどうしようも無い?
あと↓も測った。
http://d.hatena.ne.jp/w_o/20110227#1298798613
==== alloc host ptr ==== memcpy 1MB: 0.002052[sec] 0.475866[GB/s] memcpy 1MB: 0.000259[sec] 3.776131[GB/s] memcpy 1MB: 0.000290[sec] 3.362020[GB/s] memcpy 1MB: 0.000265[sec] 3.691360[GB/s] map-unmap 1MB: 0.002269[sec] 0.430454[GB/s] map-unmap 1MB: 0.000047[sec] 20.790421[GB/s] map-unmap 1MB: 0.000135[sec] 7.235066[GB/s] map-unmap 1MB: 0.000050[sec] 19.449103[GB/s] map-unmap + memcpy 1MB: 0.000428[sec] 2.283796[GB/s] map-unmap + memcpy 1MB: 0.000382[sec] 2.554755[GB/s] map-unmap + memcpy 1MB: 0.000420[sec] 2.327885[GB/s] map-unmap + memcpy 1MB: 0.000386[sec] 2.529743[GB/s] empty latency 0.004159[sec] empty latency 0.001131[sec] empty latency 0.000096[sec] empty latency 0.000099[sec] read latency(1byte): 0.000021[sec] read latency(1byte): 0.000022[sec] read latency(1byte): 0.000021[sec] read latency(1byte): 0.000021[sec] gpu memcpy 1MB: 0.004840[sec] 0.201759[GB/s] gpu memcpy 1MB: 0.003432[sec] 0.284531[GB/s] gpu memcpy 1MB: 0.003568[sec] 0.273682[GB/s] gpu memcpy 1MB: 0.003335[sec] 0.292823[GB/s] ==== use host ptr ==== memcpy 1MB: 0.000436[sec] 2.238573[GB/s] memcpy 1MB: 0.000420[sec] 2.327885[GB/s] memcpy 1MB: 0.000372[sec] 2.625206[GB/s] memcpy 1MB: 0.000409[sec] 2.389388[GB/s] map-unmap 1MB: 0.005652[sec] 0.172790[GB/s] map-unmap 1MB: 0.004737[sec] 0.206151[GB/s] map-unmap 1MB: 0.004774[sec] 0.204542[GB/s] map-unmap 1MB: 0.004576[sec] 0.213424[GB/s] map-unmap + memcpy 1MB: 0.005063[sec] 0.192873[GB/s] map-unmap + memcpy 1MB: 0.005313[sec] 0.183799[GB/s] map-unmap + memcpy 1MB: 0.005061[sec] 0.192976[GB/s] map-unmap + memcpy 1MB: 0.005352[sec] 0.182464[GB/s] empty latency 0.000221[sec] empty latency 0.000120[sec] empty latency 0.000105[sec] empty latency 0.000096[sec] read latency(1byte): 0.000092[sec] read latency(1byte): 0.000083[sec] read latency(1byte): 0.000092[sec] read latency(1byte): 0.000121[sec] gpu memcpy 1MB: 0.005560[sec] 0.175643[GB/s] gpu memcpy 1MB: 0.002094[sec] 0.466297[GB/s] gpu memcpy 1MB: 0.001976[sec] 0.494199[GB/s] gpu memcpy 1MB: 0.002175[sec] 0.448937[GB/s]
カーネル起動は良くて100usecぐらい?