컴파일러가 ATI 쪽만 인식하도록 되어있는지, nVidia의 GPU를 제대로 활용하지는 못한다.
아니면 예제 프로그램들이 openCL을 이용하기는 하지만, nVidia의 openCL과는 달라서 그럴려나?
Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Platform Name: ATI Stream Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_khr_d3d10_sharing
Platform Name: ATI Stream Number of devices: 2 Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 2393Mhz Address bits: 32 Max memory allocation: 536870912 Image support: No Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 427 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00C3D40C Name: Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_printf cl_khr_d3d10_sharing Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 2 Max work items dimensions: 3 Max work items[0]: 128 Max work items[1]: 128 Max work items[2]: 128 Max work group size: 128 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 720Mhz Address bits: 32 Max memory allocation: 134217728 Image support: No Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 268435456 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 16384 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00C3D40C Name: ATI RV710 Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.838 Profile: FULL_PROFILE Version: OpenCL 1.0 ATI-Stream-v2.2 (302) Extensions: cl_khr_icd cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
Passed!
Number of platforms: 2 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.0 CUDA 3.2.1 Platform Name: NVIDIA CUDA Platform Vendor: NVIDIA Corporation Platform Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Platform Name: ATI Stream Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: NVIDIA CUDA Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Device ID: 4318 Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 512 Max work items[1]: 512 Max work items[2]: 64 Max work group size: 512 Preferred vector width char: 1 Preferred vector width short: 1 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 0 Max clock frequency: 1350Mhz Address bits: 5347096844566560 Max memory allocation: 134217728 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 4096 Max image 2D height: 32768 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4352 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 268107776 Constant buffer size: 65536 Max number of constant args: 9 Local memory type: Scratchpad Local memory size: 16384 Profiling timer resolution: 1000 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: Yes Profiling : Yes Platform ID: 003E8750 Name: GeForce 8600 GT Vendor: NVIDIA Corporation Driver version: 260.99 Profile: FULL_PROFILE Version: OpenCL 1.0 CUDA Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics Device Type: CL_DEVICE_TYPE_GPU Device ID: 4318 Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 512 Max work items[1]: 512 Max work items[2]: 64 Max work group size: 512 Preferred vector width char: 1 Preferred vector width short: 1 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 0 Max clock frequency: 1188Mhz Address bits: 5347096844566560 Max memory allocation: 134217728 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 4096 Max image 2D height: 32768 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4352 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 268107776 Constant buffer size: 65536 Max number of constant args: 9 Local memory type: Scratchpad Local memory size: 16384 Profiling timer resolution: 1000 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: Yes Profiling : Yes Platform ID: 003E8750 Name: GeForce 8600 GT Vendor: NVIDIA Corporation Driver version: 260.99 Profile: FULL_PROFILE Version: OpenCL 1.0 CUDA Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics
Error : Bytes mismatch! Error : glSharing mismatch! Error : images mismatch! Error : printf mismatch! Error : deviceAttributeQuery mismatch! Failed! Platform Name: ATI Stream Number of devices: 1 Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 2 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 2211Mhz Address bits: 32 Max memory allocation: 536870912 Image support: No Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 65536 Global memory size: 1073741824 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 279 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 01DFD40C Name: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Vendor: AuthenticAMD Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_printf
어찌된게... windows 용으로는 readme 파일이 부실한데, 리눅스는 이렇게나 빵빵할꼬?
아무튼, SLI로 돌리면 3가지 모드로 사용이 가능하다고 한다.
첫째는 더블 버퍼링 처럼 교대로 렌더링을 하는 방식이고
둘째는 화면을 수직 1/n 개로 나누어 서로 렌더링하는 방식이고(물론 성능에 따라 비율이 달라질 수 있음)
셋째는 계단현상 제거이다(통칭 안티알리아싱)
25A. RENDERING MODES
In Linux, with two GPUs SLI and Multi-GPU can both operate in one of three
modes: Alternate Frame Rendering (AFR), Split Frame Rendering (SFR), and Antialiasing (AA). When AFR mode is active, one GPU draws the next frame while
the other one works on the frame after that. In SFR mode, each frame is split
horizontally into two pieces, with one GPU rendering each piece. The split
line is adjusted to balance the load between the two GPUs. AA mode splits
antialiasing work between the two GPUs. Both GPUs work on the same scene and
the result is blended together to produce the final frame. This mode is useful
for applications that spend most of their time processing with the CPU and
cannot benefit from AFR.
With four GPUs, the same options are applicable. AFR mode cycles through all
four GPUs, each GPU rendering a frame in turn. SFR mode splits the frame
horizontally into four pieces. AA mode splits the work between the four GPUs,
allowing antialiasing up to 64x. With four GPUs SLI can also operate in an
additional mode, Alternate Frame Rendering of Antialiasing. (AFR of AA). With
AFR of AA, pairs of GPUs render alternate frames, each GPU in a pair doing
half of the antialiasing work. Note that these scenarios apply whether you
have four separate cards or you have two cards, each with two GPUs.
With some GPU configurations, there is in addition a special SLI Mosaic Mode
to extend a single X screen transparently across all of the available display
outputs on each GPU. See below for the exact set of configurations which can
be used with SLI Mosaic Mode.
BOINC SETI@HOME 에서 SLI로 돌릴바에는 독립으로 두개로 돌리는게
효용이 좋다는 말을 들은적이 있는데(아마 영문 게시판이었던듯?)
CUDA 문서를 읽다가 문득 떠올라 검색을 해보니 메모리 할당의 특징으로 인해(이부분은 찾아봐야 하겠지만)
다른 GPU의 메모리 까지 끌어가면서 메모리 부족사태가 발생하여 예상보다 적은 수의 CUDA device만
작동이 되므로 SLI의 효용이 예상보다는 떨어지는게 아닐까 생각을 해본다.
4.3 Multiple Devices
In a system with multiple GPUs, all CUDA-enabled GPUs are accessible via the CUDA driver and runtime as separate devices. There are however special considerations as described below when the system is in SLI mode.
First, an allocation in one CUDA device on one GPU will consume memory on other GPUs. Because of this, allocations may fail earlier than otherwise expected.
(첫째, 하나의 GPU상의 하나의 CUDA 장치에 대한 메모리 할당은 다른 GPU들의 메모리를 소비할 것이다. 이러한 것으로 인해, 예상한것보다 더욱 빨리 메모리 할당이 실패할수 있을지도 모른다. - 직역 첫째, 메모리 할당을 하면 GPU상의 CUDA 장치가 다른 GPU의 메모리까지 소비하기 때문에, 생각보다 더욱 빨리 메모리 부족사태가 벌어질지도 모른다. - 의역)
Second, when a Direct3D application runs in SLI Alternate Frame Rendering mode, the Direct3D device(s) created by that application can be used for CUDA-Direct3D interoperability (i.e., passed as a parameter to cudaD3D[9|10]SetDirect3DDevice() when using the runtime API), but only one CUDA device can be created at a time from one of these Direct3D devices.
This CUDA device only executes the CUDA work on one of the GPUs in the SLI configuration.
As a consequence, real interoperability only happens with the copy of a Direct3D resource in that GPU
(note: in AFR mode Direct3D resources that must be in GPU memory are duplicated in the GPU memory of each GPU in the SLI configuration).
In some cases this is not the desired behavior and an application may need to forfeit use of the CUDA-Direct3D interoperability API and manually copy the output of its CUDA work to Direct3D resources using the existing CUDA and
Direct3D API.
[출처 : NVIDIA_CUDA_C_ProgrammingGuide.pdf 파일에서 발췌]
두번째는 interoperability가 모르니 일단 패스 -_-
검색해보니 제목도 거의 유사한 내용 -_-
Posted 31 Jan 2009 19:22:14 UTC
SLI basically combines 2 (or more) matched GPU devices into 1 logical
GPU device. When in SLI mode, the system sees only 1 logical GPU and
unfortunately for CUDA this means that it only has visibility to 1
physical device (not 2, 3 or 4). Disabling SLI mode for CUDA is best
because it allows SETI to take advantage of each GPU as its own device.
CUDA Toolkit에는 간단한 문서와 개발환경만 존재할뿐 예제 코드들이 들어있지 않다.
그렇기 때문에, 공부하려면 Toolkit과 더불어 SDK code samples를 설치해주어야 한다.
CUDA Toolkit
C/C++ compiler
CUDA Visual Profiler
OpenCL Visual Profiler
GPU-accelerated BLAS library
GPU-accelerated FFT library
Additional tools and documentation
*New* Updated versions of the
CUDA C Programming Guide (Version 3.1.1) and the Fermi Tuning Guide
(Version 1.2) are available via the links to the right.
그래서 부랴부랴 엔비디아 홈페이지가서 다운로드 시작!
버전이 무려 60가까이 차이가 나는구나 -_-
+ 드라이버 설치에는 리부팅 필수
별 특이한건 없으므로 CUDA Toolkit 설치 화면은 패스
기본적으로 C:\CUDA에 설치가 되며 아래는 C:\CUDA\bin 의 내용
nvcc가 컴파일러이며 cudart의 저렴한 사이즈와 cuBLAS 그리고 cuFFT의 크고 아름다운(!) 사이즈를 감상!
파일 갯수도 얼마 안되면서 용량은 졸라 먹네 -_-