uncomplicate.clojurecuda.core

Core ClojureCUDA functions for CUDA host programming. The kernels should be provided as strings (that may be stored and read from files) or binaries, written in CUDA C/C++.

Many examples are available in ClojureCUDA core test. You can see how to write CUDA kernels here and here and examples of how to load them here.

For more advanced examples, please read the source code of the CUDA engine of Neanderthal linear algebra library (mainly general CUDA and cuBLAS are used there), and the Deep Diamond tensor and linear algebra library (for extensive use of cuDNN).

Here’s a categorized map of core functions. Most functions throw ExceptionInfo in case of errors thrown by the CUDA driver.

Please see CUDA Driver API for details not discussed in ClojureCUDA documentation.

add-host-fn!

(add-host-fn! hstream f data)(add-host-fn! hstream f)

Adds host function f to a compute stream, with optional data related to the call. If data is not provided, places hstream under data.

attach-mem!

(attach-mem! hstream mem byte-size flag)(attach-mem! mem byte-size flag)

Attaches memory mem of size size, specified by flag to a hstream asynchronously. For available flags, see internal.constants/mem-attach-flags. Te default is :single. If :global flag is specified, the memory can be accessed by any stream on any device. If :host flag is specified, the program makes a guarantee that it won’t access the memory on the device from any stream on a device that has no concurrent-managed-access capability. If :single flag is specified and hStream is associated with a device that has no concurrent-managed-access capability, the program makes a guarantee that it will only access the memory on the device from hStream. It is illegal to attach singly to the nil stream, because the nil stream is a virtual global stream and not a specific stream. An error will be returned in this case.

When memory is associated with a single stream, the Unified Memory system will allow CPU access to this memory region so long as all operations in hStream have completed, regardless of whether other streams are active. In effect, this constrains exclusive ownership of the managed memory region by an active GPU to per-stream activity instead of whole-GPU activity.

See CUDA Stream Management.

can-access-peer

(can-access-peer dev peer)

Queries if a device may directly access a peer device’s memory. See CUDA Peer Access Management

compile!

(compile! prog options)(compile! prog)

Compiles the given prog using a list of string options.

context

(context dev flag)(context dev)

Creates a CUDA context on the device using a keyword flag. For available flags, see internal.constants/ctx-flags. The default is none. The context must be released after use.

See CUDA Context Management.

cuda-free!

(cuda-free! dptr)

Frees the runtime device memory that has been created by cuda-malloc. See CUDA Runtime API Memory Management

cuda-malloc

(cuda-malloc byte-size)(cuda-malloc byte-size type)

Returns a Pointer to byte-size bytes of uninitialized memory that will be automatically managed by the Unified Memory system. The pointer is managed by the CUDA runtime API. Optionally, accepts a type of the pointer as a keyword (:float or Float/TYPE for FloatPointer, etc.). This pointer has to be manually released by cuda-free!. For a more seamless experience, use the wrapper provided by the mem-alloc-runtime function. See CUDA Runtime API Memory Management

current-context

(current-context)

Returns the CUDA context bound to the calling CPU thread. See CUDA Context Management.

current-context!

(current-context! ctx)

Binds the specified CUDA context ctx to the calling CPU thread. See CUDA Context Management.

default-stream

device

(device id)(device)

Returns a device specified with its ordinal number id or string PCI Bus id. See CUDA Device Management.

device-count

(device-count)

Returns the number of CUDA devices on the system. See CUDA Device Management.

disable-peer-access!

(disable-peer-access! ctx)(disable-peer-access!)

Disables direct access to memory allocations in a peer context and unregisters any registered allocations. See CUDA Peer Access Management

elapsed-time!

(elapsed-time! start-event end-event)

Computes the elapsed time in milliseconds between start-event and end-event. See CUDA Event Management

enable-peer-access!

(enable-peer-access! ctx)(enable-peer-access!)

Enables direct access to memory allocations in a peer context and unregisters any registered allocations. See CUDA Peer Access Management

event

(event)(event flag & flags)

Creates an event specified by keyword flags. For available flags, see internal.constants/event-flags. See CUDA Event Management

function

(function m name)

Returns CUDA kernel function named name located in module m. See CUDA Module Management

global

(global m name)

Returns CUDA global device memory object named name from module m. Global memory is typically defined in C++ source files of CUDA kernels. See CUDA Module Management

grid-1d

(grid-1d dim-x)(grid-1d dim-x block-x)

Creates a 1-dimensional GridDim record with grid and block dimensions x. Note: dim-x is the total number of threads globally, not the number of blocks.

grid-2d

(grid-2d dim-x dim-y)(grid-2d dim-x dim-y block-x block-y)

Creates a 2-dimensional GridDim record with grid and block dimensions x and y. Note: dim-x is the total number of threads globally, not the number of blocks.

grid-3d

(grid-3d dim-x dim-y dim-z)(grid-3d dim-x dim-y dim-z block-x block-y block-z)

Creates a 3-dimensional GridDim record with grid and block dimensions x, y, and z. Note: dim-x is the total number of threads globally, not the number of blocks.

in-context

macro

(in-context ctx & body)

Pushes the context ctx to the top of the context stack, evaluates the body with ctx as the current context, and pops the context from the stack. Does NOT release the context, unlike with-context. See CUDA Context Management.

init

(init)

Initializes the CUDA driver. This function must be called before any other function from ClojureCUDA in the current process. See CUDA Initialization

launch!

(launch! fun grid-dim shared-mem-bytes hstream params)(launch! fun grid-dim hstream params)(launch! fun grid-dim params)

Invokes the kernel fun on a grid-dim grid of blocks, usinng params PointerPointer. Optionally, you can specify the amount of shared memory that will be available to each thread block, and hstream to use for execution. See CUDA Module Management

listen!

(listen! hstream ch data)(listen! hstream ch)

Adds a host function listener to a compute stream, with optional data related to the call, and connects it to a Clojure channel chan. If data is not provided, places hstream under data.

load!

(load! m data)

Load module’s data from a ptx string, nvrtc program, java path, or binary data. Please see relevant examples from the test folder. See CUDA Module Management

mem-alloc-driver

(mem-alloc-driver byte-size flag)(mem-alloc-driver byte-size)

Allocates the byte-size bytes of uninitialized memory that will be automatically managed by the Unified Memory system, specified by a keyword flag. For available flags, see internal.constants/mem-attach-flags. Returns a CUDA device memory object, which can NOT be extracted as a Pointer, but can be accessed directly through its address in the device memory. See CUDA Driver API Memory Management

mem-alloc-mapped

(mem-alloc-mapped byte-size)(mem-alloc-mapped byte-size type)

Allocates byte-size bytes of uninitialized host memory, ‘mapped’ to the device. Optionally, accepts a type of the pointer as a keyword (:float or Float/TYPE for FloatPointer, etc.). Mapped memory is optimized for the memcpy! operation, while ‘pinned’ memory is optimized for memcpy-host!. See CUDA Driver API Memory Management

mem-alloc-pinned

(mem-alloc-pinned byte-size)(mem-alloc-pinned byte-size type-or-flags)(mem-alloc-pinned byte-size type flags)

Allocates byte-size bytes of uninitialized page-locked memory, ‘pinned’ on the host, using keyword flags. For available flags, see internal.constants/mem-host-alloc-flags; the default is :none. Optionally, accepts a type of the pointer as a keyword (:float or Float/TYPE for FloatPointer, etc.). Pinned memory is optimized for the memcpy-host! function, while ‘mapped’ memory is optimized for memcpy!. See CUDA Device Driver API Memory Management

mem-alloc-runtime

(mem-alloc-runtime byte-size type)(mem-alloc-runtime byte-size)

Allocates the byte-size bytes of uninitialized memory that will be automatically managed by the Unified Memory system. Returns a CUDA device memory object managed by the CUDA runtime API, which can be extracted as a Pointer. Equivalent unwrapped Pointer can be created by cuda-malloc. See CUDA Runtime API Memory Management

mem-register-pinned!

(mem-register-pinned! memory flags)(mem-register-pinned! memory)

Registers previously instantiated host pointer, ‘pinned’ from the device, using keyword flags. For available flags, see internal.constants/mem-host-register-flags; the default is :none. Returns the pinned object equivalent to the one created by mem-alloc-pinned. Pinned memory is optimized for the memcpy-host! function, while ‘mapped’ memory is optimized for memcpy!. See CUDA Device Driver API Memory Management

mem-sub-region

(mem-sub-region mem origin byte-count)(mem-sub-region mem origin)

Creates CUDA device memory object that references a sub-region of mem from origin to byte-count, or maximum available byte size.

memcpy!

(memcpy! src dst)(memcpy! src dst byte-count-or-stream)(memcpy! src dst byte-count hstream)

Copies byte-count or maximum available device memory from src to dst. TODO mapped, pinned If hstream is provided, executes asynchronously. See CUDA Memory Management

memcpy-host!

(memcpy-host! src dst byte-count hstream)(memcpy-host! src dst count-or-stream)(memcpy-host! src dst)

Copies byte-count or all possible memory from src to dst, one of which has to be accessible from the host. If hstream is provided, executes asynchronously. A polymorphic function that figures out what needs to be done. Supports everything except pointers created by cuda-malloc!. See CUDA Memory Management

memcpy-to-device!

(memcpy-to-device! src dst byte-count hstream)(memcpy-to-device! src dst count-or-stream)(memcpy-to-device! src dst)

Copies byte-count or all possible memory from host src to device dst. Useful when src or dst is a generic pointer for which it cannot be determined whether it manages memory on host or on device (see cuda-malloc!). If hstream is provided, executes asynchronously. See CUDA Memory Management

memcpy-to-host!

(memcpy-to-host! src dst byte-count hstream)(memcpy-to-host! src dst count-or-stream)(memcpy-to-host! src dst)

Copies byte-count or maximum available memory from device src to host dst. Useful when src or dst is a generic pointer for which it cannot be determined whether it manages memory on host or on device (see cuda-malloc!). If hstream is provided, executes asynchronously. See CUDA Memory Management

memset!

(memset! dptr value)(memset! dptr value n-or-hstream)(memset! dptr value n hstream)

Sets n elements or all segments of dptr memory to value (supports all Java primitive number types except double, and long with value larger than Integer/MAX_VALUE). If hstream is provided, executes asynchronously. See CUDA Memory Management

module

(module)(module data)

Creates a new CUDA module and loads a string, nvrtc program, or binary data. See CUDA Module Management

p2p-attribute

(p2p-attribute dev peer attribute)

Queries attributes of the link between two devices. See CUDA Peer Access Management

parameters

(parameters parameter & parameters)

Creates an PointerPointers to CUDA parameter’s. parameter can be any object on device (Device API memory, Runtime API memory, JavaCPP pointers), or host (arrays, numbers, JavaCPP pointers) that makes sense as a kernel parameter per CUDA specification. Use the result as a parameter argument in launch!.

pop-context!

(pop-context!)

Pops the current CUDA context ctx from the current CPU thread. See CUDA Context Management.

program

(program name source-code headers)(program source-code headers)(program source-code)

Creates a CUDA program from the source-code, with an optional name and an optional hash map of headers (as strings) and their names.

program-log

(program-log prog)

Returns the log string generated by the previous compilation of prog.

ptx

(ptx prog)

Returns the PTX generated by the previous compilation of prog.

push-context!

(push-context! ctx)

Pushes a context ctx on the current CPU thread. See CUDA Context Management.

ready?

(ready? obj)

Determines status (ready or not) of a compute stream or event obj. See CUDA Stream Management and CUDA Event Management

record!

(record! stream event)(record! event)

Records an even! ev on optional stream. See CUDA Event Management

set-parameter!

(set-parameter! pp i parameter & parameters)

Sets the ith parameter in a parameter array pp and the rest of parameters in places after i.

stream

(stream)(stream flag)(stream priority flag)

Creates a stream using an optional integer priority and a keyword flag. For available flags, see internal.constants/stream-flags See CUDA Stream Management

synchronize!

(synchronize!)(synchronize! hstream)

Blocks the current thread until the context’s or hstream’s tasks complete.

wait-event!

(wait-event! hstream ev)

Makes a compute stream hstream wait on an event ev. See CUDA Event Management

with-context

macro

(with-context ctx & body)

Pushes the context ctx to the top of the context stack, evaluates the body, and pops the context from the stack. Releases the context, unlike in-context. See CUDA Context Management.

with-default

macro

(with-default & body)

Initializes CUDA, creates the default context and executes the body in it. See CUDA Context Management.