CUBLAS performance improved 50% to 300% on Fermi architecture
GPUs, for matrix multiplication of all datatypes and transpose
variations
CUFFT performance tuned for radix-3, -5, and -7 transform sizes on Fermi architecture GPUs, now 2x to 10x faster than MKL
New CUSPARSE library of GPU-accelerated sparse matrix routines for
sparse/sparse and dense/sparse operations delivers 5x to 30x faster
performance than MKL
New CURAND library of GPU-accelerated random number generation (RNG)
routines, supporting Sobol quasi-random and XORWOW pseudo-random
routines at 10x to 20x faster than similar routines in MKL
H.264 encode/decode libraries now included in the CUDA Toolkit
deviceQuery 예문은 말그대로 장치에 질의의 던져 어떤 스펙인지
몇개의 thread가 존재하는지를 파악하는 프로그램이다.
아무튼 NVIDIA_CUDA_C_ProgrammingGuide.pdf 문서를 보면 아래의 내용이 나오니 참고를 해서 보자면,
Geforce 8800GT(이하 8800GT)는 14개의 멀티 프로세서를 가졌고 112개의 코어를 지녔다.
Grid는 Block을 포함하고, Block은 Thread를 포함한다.
단순 계산으로는 하나의 프로세서별로 8개의 코어가 존재하며
멀티프로세서는 14개 코어는 총 112개가 존재한다.
그리드(Grid)는 블럭의 2차원 배열로 존재하고,
블럭(Block)은 쓰레드의 2차원 배열로 존재한다.
쓰레드(Thread)는 일을하는 최소단위이다.
한번에 묶이는 최소 쓰레드의 숫자(Warp size)는 32개 이며
하나의 블럭으로 묶을수 있는 최대 쓰레드는 512개 이다.
블럭의 최대 차원은 3차원 512x512x64 이며
그리드의 최대 차원은 2차원 65535x65535x1 이다.
라고 이해하면 되려나?
A multithreaded program is partitioned into blocks of threads that execute independently from each
other, so that a GPU with more cores will automatically execute the program in less time than a GPU
with fewer cores.
=> 멀티쓰레드화된 프로그램은 서로 독립적으로 실행되는 쓰레드의 블럭으로 나누어지며, 그러한 이유로 더욱 많은 코어를 포함하는 CPU는 적은 코어를 지닌 GPU보다 짧은 시간에 프로그램을 실행할 수 있다.
Host는 일반적인 CPU 환경이고, Device는 GPU 환경이다.
컴파일시에 Device 코드만 nvcc에서 처리하고, 나머지는 일반적인 컴파일러에서 처리하는 이원화된 구조이다.
아래는 커널이다. 디바이스 코드를 생성하는 부분이며
MatAdd 함수는 __global__ 선언을 앞에 붙여 device 코드임을 명시해야 한다.
블럭당 쓰레디의 크기를 16x16 thread로, 그리드를 N/16개로 분할하는 예제이다.
물론 2차원은 1차원과 같이 사용이 가능하므로 int형도 허용하는 듯?
B.15 Execution Configuration
Any call to a __global__ function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream (see Section 3.3.10.1 for a description of streams).
When using the driver API, the execution configuration is specified through a series of driver function calls as detailed in Section 3.3.3.
When using the runtime API (Section 3.2), the execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid,
such that Dg.x * Dg.y equals the number of blocks being launched; Dg.z must be equal to 1;
Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block,
such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in Section B.2.3; Ns is an optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
As an example, a function declared as
__global__ void Func(float* parameter);
must be called like this:
Func<<< Dg, Db, Ns >>>(parameter);
The arguments to the execution configuration are evaluated before the actual function arguments and like the function arguments, are currently passed via shared memory to the device. The function call will fail if Dg or Db are greater than the maximum sizes allowed for the device as specified in Appendix G, or if Ns is greater than the maximum
amount of shared memory available on the device, minus the amount of shared memory required for static allocation, functions arguments (for devices of compute capability 1.x), and execution configuration.
아무튼 원래는 아래의 예제 내용을 분석하기 위한 내용인데
점점 미궁으로 빠지는 느낌 -_-
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce 8800 GT" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.10 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 536543232 bytes Number of multiprocessors: 14 Number of cores: 112 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Clock rate: 1.50 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Concurrent kernel execution: No Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers ion = 3.10, NumDevs = 1, Device = GeForce 8800 GT
PASSED
Press <Enter> to Quit... -----------------------------------------------------------
CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.1 NUMBER OF MULTIPROCESSORS: 14 NUMBER OF CUDA CORES: 112 CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192 CL_DEVICE_WARP_SIZE_NV: 32 CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0
--------------------------------- 2D Image Formats Supported (71) --------------------------------- # Channel Order Channel Type