Bulldozerの256bit AVX - w

http://int.main.jp/txt/gcc-mtune.html
で、

Bulldozerでは、AVX256bit速くないんだろうか…(未確認)。

とか書いてたが、Agnerさんのmicroarchitecture.pdfの13.9に書いてあった。

・The instruction decoders cannot handle two double instructions per clock cycle.
　(デコーダはクロックあたり二個のdouble instructionsを扱えない)

・The throughput of 256-bit store instructions is less than half the throughput of 128-bit
　store instructions.
　(256-bit storeのスループットは、128-bit storeのスループットの半分より少ない)

・128-bit register-to-register moves have zero latency, while 256-bit register-to-register
　moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different
　domain (see below). Register-to-register moves can be avoided in most cases
　thanks to the non-destructive 3-operand instructions.
　(128-bitのレジスタ間 move はレイテンシ0。256-bit のレジスタ間moveは、
　 2 clockのレイテンシがある + ドメイン間の移動によるペナルティが2-3clockある。
　 3-operand命令があるので、レジスタ間のmoveは、大抵の場合回避できる)

ということらしい。

つまり、実際に256bit演算は、128bit演算x2よりも遅くなる場合があるみたい。

いくつか補足しておくと、

double instructionは、倍精度演算ではなくて、2つのuopsを生成する命令のこと。つまり、256bit演算は、ちょっとデコードが難しい系の命令に分類されてるということ
128-bitレジスタ間moveは、リネームだけの処理になっていて、uopsを生成しないらしい(13.11)
レジスタ間の転送は整数ユニットでしか実行できない(13.10)