CUDA プロファイラを使う

この記事はGPGPU Advent Calendar11日目の記事です。

CUDAにはNVIDIA Visual Profiler(以下nvvp)というプロファイラが付いていて、時間はかったり、プロファイルカウンタが読めたりします。
あと、ヒントも出してくれます、が、まあ、大体よくわからんな、みたいな感じで終わると思います。

まあでも、これ以外に計測する方法無いので、もうちょっと知っておくのもよいと思います。というわけで知っておきましょう。

コマンドラインから使う

(書いてる途中で気付いたのですが、日本語でも解説あるのでそっち見たほうがいいです : http://topsecret.hpc.co.jp/wiki/index.php/CUDA_5%E3%81%AE%E6%96%B0%E6%A9%9F%E8%83%BD%284%29:_nvprof%E3%83%97%E3%83%AD%E3%83%95%E3%82%A1%E3%82%A4%E3%83%A9 )

nvvp使ってると、ログが表になっててあとでスクリプトでグラフにプロットしたりとかしたくなった時に面倒です。
あとコマンドラインから実行してると、設定ダイアログ開いて引数変えるとか、結構面倒なこと多いです。

そういう時はコマンドラインプロファイラを使いましょう。

 $ /usr/local/cuda/bin/nvprof --csv ./a.out

こんな感じで実行すると、

======== NVPROF is profiling a.out...
======== Command: a.out
hello
======== Profiling result:
Time(%),Time,Calls,Avg,Min,Max,Name
,us,,us,us,us,
52.09,1.70500,1,1.70500,1.70500,1.70500,"hello(char*)"
47.91,1.56800,1,1.56800,1.56800,1.56800,"[CUDA memcpy DtoH]"

こういうのが見れます。

ハードウェアイベントもとれて、 --query-events で見られる値を --events に指定することで、見ることができます。

 $ /usr/local/cuda/bin/nvprof --query-events
 /usr/local/cuda/bin/nvprof --query-events
======== Available Events:
                            Name   Description
Device 0:
        Domain domain_a:
                 sm_cta_launched:  Number of thread blocks launched on a multiprocessor.

               l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_load_miss:   Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

(... snip ...)

                  inst_issued2_1:  Number of dual instructions issued per cycle in pipeline 1.

                      atom_count:  Number of warps executing atomic reduction operations for thread-to-thread communication. Increments by one if at least one thread in a warp executes the instruction

                      gred_count:  Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction

 $ /usr/local/cuda/bin/nvprof --events inst_executed ./a.out
======== NVPROF is profiling a.out...
======== Command: a.out
hello
======== Profiling result:
                  Invocations       Avg       Min       Max  Event Name
Device 0
        Kernel: hello(char*)
                            1        15        15        15  inst_executed

カウンタ

いくつかぱっと見よくわからんけど重要なカウンタについて説明しておきます。(GT520=capability 2.1 の値です。2.0だとthread_inst_executedの名前がちょっと違ったような。3.xは真面目に調べたことない)

実行した命令数

inst_executed: ワープ単位で実行した命令数
thread_inst_executed_[0..3]: スレッド単位で実行した命令数

inst_executed は、

(出力は手で編集してます)

$ cat hello.cu
#include <stdio.h>

__global__ void
hello(char *p)
{
	p[0] = 'h';
	p[1] = 'e';
	p[2] = 'l';
	p[3] = 'l';
	p[4] = 'o';
	p[5] = '\0';

	if (threadIdx.x < 1) {
		for (int i=0; i<100; i++) {
			p[i+6] = 0;
		}
	}
}

int main(int argc, char **argv)
{
	char *buffer;
	char h_buffer[1024];
	int nth = atoi(argv[1]);

	cudaMalloc((void**)&buffer, 1024);

	hello<<<1,nth>>>(buffer);

	cudaMemcpy(h_buffer, buffer, 1024, cudaMemcpyDeviceToHost);

	puts(h_buffer);
}


$ /usr/local/cuda/bin/nvprof --events inst_executed ./a.out 1
                  Invocations       Avg       Min       Max  Event Name
======== Command: a.out 1
		            1       151       151       151  inst_executed
======== Command: a.out 2
		            1       151       151       151  inst_executed
======== Command: a.out 16
		            1       151       151       151  inst_executed
======== Command: a.out 32
		            1       151       151       151  inst_executed
======== Command: a.out 64
		            1       168       168       168  inst_executed

こんな感じの挙動をします。(divergentの数に関係無い)

thread_inst_executed_[0..3] は

(出力は手で編集してます)
======== NVPROF is profiling a.out...
======== Command: a.out 1
		            1       123       123       123  thread_inst_executed_0  # 123 + 28 = 151 命令実行してる
		            1        28        28        28  thread_inst_executed_2
		            1         0         0         0  thread_inst_executed_1
		            1         0         0         0  thread_inst_executed_3
======== Command: a.out 3
		            1       145       145       145  thread_inst_executed_0  # 145 + 40 = 185 命令/ thread 0 以外は 17命令実行してると思われる
		            1        40        40        40  thread_inst_executed_2
		            1         0         0         0  thread_inst_executed_1
		            1         0         0         0  thread_inst_executed_3
======== Command: a.out 33
		            1       464       464       464  thread_inst_executed_0  # 151 + 17*31 = 464 + 214 = 678
		            1       214       214       214  thread_inst_executed_2
		            1        11        11        11  thread_inst_executed_1  # 11 + 6 = 17
		            1         6         6         6  thread_inst_executed_3

こんな感じの挙動をします。こっちは、divergent して実行しなかったものはカウントされません。

このthread_inst_executed_[0..3]を足した値が実際に実行した命令数になります。
これをactive_cyclesで割ってやれば、クロックあたりの命令実行数が出せますし、thread_inst_executed_[0..3] を足した値 / inst_executed * 32 を計算すれば、divergentして無駄になった命令の割合がわかります。

0,2,1,3 は、あんまりまともなドキュメントが無いので、かなり推測なのですが、昔色々計測した結果によると、Compute Capability 2.1 では、issue 一回あたり 2 warp から、それぞれ2命令をissueできるようになっているみたいで、warp0 = 0,2, warp1=1,3 で、合計issueあたり4命令を実行できるというようになってるみたいでした。おぼえておきましょう。

prof_trigger_XX: デバイスコード中で、prof_tirggerという関数を呼ぶと、その回数を数えてプロファイラで見ることができます。なんかに使えるかもしれません。

CUPTI

これらのプロファイラは、CUPTIと呼ばれるAPIが用意されていて、これを使うことで、カウンタを読むことができます。
driver APIが必要なので、カーネルの起動とかが若干面倒ですが、我慢しましょう。

その他色々面倒なので、詳細は、 http://int.main.jp/files/cupti.tar.gz を見てもらうとして、やりかたは、

cuptiDeviceEnumEventDomains でデバイスからドメインを取得
cuptiEventDomainEnumEvents でドメインからイベントを取得。cuptiEventGetGetAttributeで名前を取得して、必要なCUpti_EventIDを取得
cuptiEventGroupCreate でイベントグループを作成
cuptiEventGroupAddEvent でグループにイベントを追加
cuptiEventGroupEnable で収集開始
対象カーネルを呼ぶ
cuptiEventGroupReadEvent で値を取得

となります。

手元だと上のサンプルを実行すると、

kernel.cu::f()::inst_executed = 9

と出ます。cuobjdumpの結果とも…

        /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100]; // 1
        /*0008*/     /*0x00001de440000000*/     NOP CC.T;                // 2
        /*0010*/     /*0x80009de428004000*/     MOV R2, c [0x0] [0x20];  // 3
        /*0018*/     /*0x9000dde428004000*/     MOV R3, c [0x0] [0x24];  // 4
        /*0020*/     /*0x50001de218000000*/     MOV32I R0, 0x14;         // 5
        /*0028*/     /*0x00201c8594000000*/     ST.E [R2], R0;           // 6
        /*0030*/     /*0x10201c8594000000*/     ST.E [R2+0x4], R0;       // 7
        /*0038*/     /*0x20201c8594000000*/     ST.E [R2+0x8], R0;       // 8
        /*0040*/     /*0x00001de780000000*/     EXIT;                    // 9

一致してますね。

うまく使えば、自分で必要なカウンタの値を集めてくるとかもできるようになると思います。