You are on page 1of 19

Open Computing Language

Introduction
OpenCL (Open Computing Language) Is an open royalty-free standard For general purpose parallel programming across CPUs, GPUs and other processors

OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform

OpenCL consists of..


API for coordinating parallel computation across heterogeneous processors. A cross-platform programming language
Supports both data- and task-based parallel programming models

Defines a configuration profile for handheld and embedded devices

Utilizes a subset of ISO C99 with extensions for parallelism

The BIG Idea behind OpenCL


OpenCL execution model execute a kernel at each point in a problem domain - E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

To use OpenCL, you must


Define the platform Execute code on the platform Move data around in memory Write (and build) programs

OpenCL Platform Model


One Host + one or more Compute Devices - Each Compute Device is composed of one or more Compute Units - Each Compute Unit is further divided into one or more Processing Elements

OpenCL Execution Model


An OpenCL application runs on a host which submits work to the compute devices Work item: the basic unit of work on an OpenCL device Kernel: the code for a work item. Basically a C function Program: Collection of kernels and other functions (Analogous to a dynamic library) Context: The environment within which workitems executes includes devices and their memories and command queues Applications queue kernel execution instances
Queued in-order one queue to a device Executed in-order or out-of-order

Example of the NDRange organization..


SIMT: SINGLE INSTRUCTION MULTIPLE THREAD. The same code is executed in parallel by a different thread, and each thread executes the code with different data. Work-item: are equivalent to the CUDA threads. Work-group: allow communication and cooperation between workitems. They reflect how work-items are organized . Equivalent to CUDA thread blocks ND-Range: the ND-Range is the next organization level, specifying how work-groups are organized

Example of the NDRange organization..

OpenCL Memory Model

OpenCL programs
OpenCL programs are divided in two part:
qOne that executes on the device (in our case, on the GPU).
write Kernels The device program is the one you may be concerned about

qOne that executes on the host (in our case, the CPU).
Offers an API so that you can manage your device execution. Can be programmed in C or C++ and it controls the OpenCL environment (context, command-queue,...).

Sample: a kernel that adds two vectors


This kernel should take four parameters: two vectors to be added, another to store the result, and the vectors size. If you write a program that solves this problem on the CPU it will be something like this:

void vector_add_cpu (const float* src_a, const float* src_b, float* res, const int num) { for (int i = 0; i < num; i++) res[i] = src_a[i] + src_b[i]; }

Sample: a kernel that adds two vectors


However, on the GPU the logic would be slightly different. Instead of having one thread iterating through all elements, we could have each thread computing one element, which index is the same of the thread.
__kernel void vectorAdd(__global const float* src_a, __global const float* src_b, __global, float* res, const int num) { /* get_global_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.*/ const int idx = get_global_id(0); /* Now each work-item asks itself: "is my ID inside the vector's range?" If the answer is YES, the work-item performs the corresponding computation*/ if (idx < num) res[idx] = src_a[idx] + src_b[idx]; }

Sample..
// Some interesting data for the vectors int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17}; int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15}; // Number of elements in the vectors to be added #define SIZE 2048 // Main function // ********************************************************************* int main(int argc, char **argv) { // Two integer source vectors in Host memory int HostVector1[SIZE], HostVector2[SIZE]; // Initialize with some interesting repeating data for(int c = 0; c < SIZE; c++) { HostVector1[c] = InitialData1[c%20]; HostVector2[c] = InitialData2[c%20]; }

Sample..
// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // Get the list of GPU devices associated with this context size_t ParmDataBytes; clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes); cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes); clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL); // Create a command-queue on the first GPU device cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext, GPUDevices[0], 0, NULL); // Allocate GPU memory for source vectors AND initialize from CPU memory cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL); cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);

// Allocate output memory on GPU cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, NULL);

Sample..

// Create OpenCL program with source code cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7, OpenCLSource, NULL, NULL); // Build the program (OpenCL JIT compilation) clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL); // Create a handle to the compiled OpenCL function (Kernel) cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL); // In the next step we associate the GPU memory with the Kernel arguments clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector); clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1); clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

Sample..
// Launch the Kernel on the GPU size_t WorkSize[1] = {SIZE}; // one dimensional Range

clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL,

WorkSize, NULL, 0, NULL, NULL);

// Copy the output in GPU memory back to CPU memory int HostOutputVector[SIZE]; clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0, SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL); // Cleanup free(GPUDevices); clReleaseKernel(OpenCLVectorAdd); clReleaseProgram(OpenCLProgram); clReleaseCommandQueue(GPUCommandQueue); clReleaseContext(GPUContext); clReleaseMemObject(GPUVector1); clReleaseMemObject(GPUVector2); clReleaseMemObject(GPUOutputVector);

Sample
// Print out the results for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n")) { for(int c = 0; c <20; c++) { printf("%c",(char)HostOutputVector[Rows * 20 + c]); } }

Thanks! Thanks!

You might also like