iOSのBLAS(Accelerate.framework)とEigenの行列の掛け算の実行速度テスト

結論：BLASの方が速い。

　この前こんな記事を書きました。
　iOSのvDSP(Accelerate.framework)とEigenの行列の掛け算の実行速度テスト - 開発のヒホ

　で、こんなコメントを貰いました。

Steve Apologies in advance for not writing in Japanese, but I assume you’ll get the gist: vDSP_mmul is a legacy routine that has never been aggressively tuned for any platform. If you want to perform a matrix multiply using Accelerate (or actually any linear algebra operations), use the BLAS and LAPACK interfaces instead. Specifically, here you should be using cblas_sgemm.

　なんと！！ vDSP_~系は最適化されていない！！

　知らなかった･･･
　BLASを使えば行列演算が高速にできる、ということは小耳に挟んでいたのですが、vDSPは内部的にBLASを使っているのだと勘違いしていました。

　ちゃんとBLASを使うには、同じくAcceralete.frameworkに入ってあるcblas_~系を使う必要があるらしいです。
　BLAS Reference

　さっそく再テストしてみました。

    const int SIZE = 150 ;
    
    MatrixXf mat1 = Matrix<float, SIZE, SIZE>::Random() ;
    MatrixXf mat2 = Matrix<float, SIZE, SIZE>::Random() ;
    
    float* mat1_ = (float *)malloc(sizeof(float)*SIZE*SIZE) ;
    float* mat2_ = (float *)malloc(sizeof(float)*SIZE*SIZE) ;
    for( int i=0 ; i<SIZE ; i++ ) {
        for( int j=0 ; j<SIZE ; j++ ) {
            mat1_[j+i*SIZE] = mat1(i, j) ;
            mat2_[j+i*SIZE] = mat2(i, j) ;
        }
    }
    
    __block NSTimeInterval time_eigen = 0 ;
    __block NSTimeInterval time_dsp = 0 ;
    __block NSTimeInterval time_cblas = 0 ;
    __block NSTimeInterval time_forloop = 0 ;
    
    // multiplication by eigen
    void (^testEigen)() = ^{
        NSTimeInterval time_start = [NSDate timeIntervalSinceReferenceDate] ;
        for( int i=0 ; i<10 ; i++ ) {
            volatile MatrixXf mat_ans = mat1 * mat2 ;
            //for( int a=SIZE-3 ; a<SIZE ; a++ ) { printf("%f\n", mat_ans(a, a)) ; }
        }
        NSTimeInterval time_stop = [NSDate timeIntervalSinceReferenceDate] ;
        time_eigen += time_stop - time_start ;
    } ;
    
    // multiplication by vDSP
    void (^testDSP)() = ^{
        NSTimeInterval time_start = [NSDate timeIntervalSinceReferenceDate] ;
        for( int i=0 ; i<10 ; i++ ) {
            float* volatile mat_ans_ = (float *)malloc(sizeof(float)*SIZE*SIZE) ;
            vDSP_mmul(mat1_, 1, mat2_, 1, mat_ans_, 1, SIZE, SIZE, SIZE) ;
            //for( int a=SIZE-3 ; a<SIZE ; a++ ) { printf("%f\n", mat_ans_[a+a*SIZE]) ; }
            free(mat_ans_) ;
        }
        NSTimeInterval time_stop = [NSDate timeIntervalSinceReferenceDate] ;
        time_dsp += time_stop - time_start ;
    } ;
    
    // multiplication by CBLAS
    void (^testCBLAS)() = ^{
        NSTimeInterval time_start = [NSDate timeIntervalSinceReferenceDate] ;
        for( int i=0 ; i<10 ; i++ ) {
            float* volatile mat_ans_ = (float *)malloc(sizeof(float)*SIZE*SIZE) ;
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        SIZE, SIZE, SIZE, 1, mat1_, SIZE, mat2_, SIZE, 0, mat_ans_, SIZE);
            //for( int a=SIZE-3 ; a<SIZE ; a++ ) { printf("%f\n", mat_ans_[a+a*SIZE]) ; }
            free(mat_ans_) ;
        }
        NSTimeInterval time_stop = [NSDate timeIntervalSinceReferenceDate] ;
        time_cblas += time_stop - time_start ;
    } ;
    
    // multiplication by my hand
    void (^testForloop)() = ^{
        NSTimeInterval time_start = [NSDate timeIntervalSinceReferenceDate] ;
        for( int i=0 ; i<10 ; i++ ) {
            float* volatile mat_ans_ = (float *)malloc(sizeof(float)*SIZE*SIZE) ;
            for( int x=0 ; x<SIZE ; x++ ) {
                for( int y=0 ; y<SIZE ; y++ ) {
                    float sum=0 ;
                    for( int xx=0 ; xx<SIZE ; xx++ ) {
                        sum += mat1_[xx+y*SIZE] * mat2_[x+xx*SIZE] ;
                    }
                    mat_ans_[x+y*SIZE] = sum ;
                }
            }
            //for( int a=SIZE-3 ; a<SIZE ; a++ ) { printf("%f\n", mat_ans_[a+a*SIZE]) ; }
            free(mat_ans_) ;
        }
        NSTimeInterval time_stop = [NSDate timeIntervalSinceReferenceDate] ;
        time_forloop += time_stop - time_start ;
    } ;
    
    for( int i=0 ; i<10 ; i++ ) {
        testEigen() ;
        testDSP() ;
        testCBLAS() ;
        testForloop() ;
    }
    
    printf("SIZE : %d\n", SIZE) ;
    printf("Eigen : %f\n", time_eigen) ;
    printf("vDSP : %f\n", time_dsp) ;
    printf("CBLAS : %f\n", time_cblas) ;
    printf("ForLoop : %f\n", time_forloop) ;

　冒頭の記事で書いたものに、新たにtestCBLASを追加しています。
　BLASを使って行列の掛け算を行っている･･･はずです。日本語資料が少なくてよくわからない。

　気になる実行結果はこちら！（第4世代iPod touchで検証）

SIZE : 150
Eigen : 0.794301
vDSP : 4.267753
CBLAS : 0.613059
ForLoop : 3.993476

　BLAS速ぇ！！
　もう少し行列の数を増やしてみます。

SIZE : 200
Eigen : 1.903115
vDSP : 14.832418
CBLAS : 1.096772
ForLoop : 14.233802

　vDSPとかと比べるとEigenも健闘していますが、BLAS使ったほうが倍ほど速いです。
　さすが端末のことも考えて最適化されているだけあります。恐るべし。

　もっと行列の数を増やしてみました。
　（vDSPやforloopを使ったものは実行時間が長くなるため省略しています）

SIZE : 1000
Eigen : 207.576759
CBLAS : 124.308358

　BLAS速いなぁ･･･
　C++でBLASを使いやすくしたライブラリも作られているらしいので見てみたいところですね。

　なんだ、使っていたものが間違っていただけで、Acceralete.frameworkはやっぱり速いんですね。
　移植のことを考えるとEigenの方が楽ですし、迷いどころです。

　Thank you, Steve!