CUDA9-cufft+blas | Echo Nie's Blog

文章目录
A. cuFFT:   FFT变换库 在实数值与复数值之间进行傅里叶变换
    i. 算法特性：
    ○ 当输入大小能被表示成〖(2〗^a+3^b+5^c+7^d)时算法效率最优，基本因子越小速度越快
    ○ 算法复杂度Ο(n log⁡n )
    ○ 单精度比双精度更快
    ○ 支持 C2C, R2C, C2R
    ○ 支持 1D,2D,3D变换
    ○ ……

    ii. 使用方法
    ○ 创建计划：cufftPlan1D() / cufftPlan2D() / cufftPlan3D() 简单plan
            cufftPlanMany()  支持多种batched input and strided data layout
    ○ 执行FFT：   cufftExecC2C() / cufftExecZ2Z() 
            cufftExecR2C() / cufftExecD2Z() 
            cufftExecC2R() / cufftExecZ2D()
    ○ 销毁计划：cufftDestroy() 

    iii. 数据布局：（如图）



    即，为了保证数据对齐，实数变换为复数，将使规模减半；同时，复数变换为实数，数据规模将增加一倍。
        其中多余的一到两个单位将作为闲置单元，起调节作用

    HENT<<
        IFFT( FFT( A ) ) = n A where n is the length of the vector. The length n is in number of samples (not floats or bytes). 
        即： CUDA中矩阵A的正变换的逆变换并不得到矩阵A，它并没有除以数据规模。


B. CUBLAS： 线性代数函数  BLAS
    i. 首先，cuBLAS库使用列主存储方式，并以1开始索引
    因为C/C++都利用行主存储且通常以0开始索引，因此在处理二维数组时需要尤其注意。
    #define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
    #define IDX2C(i,j,ld) (((j)*(ld))+(i))

    ii. 使用方法：
    ○ 创建handle:  cublasStatus_t  cublasCreate(cublasHandle_t *handle)
    ○ 使用cuBlas函数: level-1  level-2  level-3
    ○ 销毁handle:  cublasStatus_t cublasDestroy(cublasHandle_t handle)

    iii. cublasl<t>amin()  /  cublasI<t>amax ()
    形式：cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,const float *x, int incx, int *result)
    功能：the result is the first such that   |real(x[j])|+|imag(x[j])|  is maximum for j=1+(i−1)∗incx
    HENT<<
    使用checkCudaErrors(cublasIzamax(handle_Z_max,width*height, dev_p, 1, &idx));
        “1”表示the stride of dev_p，而不应该用sizeof(double2);
         并且 得到的idx是以1为基的索引值，因此 dev_p(idx - 1) 才是实部和虚部和最大的值；

    iv. cublas<t>asum()
    形式：cublasStatus_t cublasSasum(cublasHandle_t handle, int n, const float *x, int incx, float *result)


    功能：求实部和虚部绝对值之和  （类似min/max）