uncomplicate.clojurecuda.core
Core ClojureCUDA functions for CUDA host programming. The kernels should be provided as strings (that may be stored and read from files) or binaries, written in CUDA C/C++.
Many examples are available in ClojureCUDA core test. You can see how to write CUDA kernels here and here and examples of how to load them here.
For more advanced examples, please read the source code of the CUDA engine of Neanderthal linear algebra library (mainly general CUDA and cuBLAS are used there), and the Deep Diamond tensor and linear algebra library (for extensive use of cuDNN).
Here’s a categorized map of core functions. Most functions throw ExceptionInfo
in case of errors thrown by the CUDA driver.
- Device management: init, device-count, device.
- Context management: context, current-context, current-context!, put-context!, push-context!, in-context, with-context, with-default.
- Memory management: memcpy!, mumcpy-to-host!, memcpy-to-device!, memset!. mem-sub-region, mem-alloc-driver, mem-alloc-runtime, cuda-malloc, cuda-free! mem-alloc-pinned, mem-register-pinned!, mem-alloc-mapped,
- Module management: link, link-complete!, load!, module.
- Execution control: gdid-1d, grid-2d, grid-3d, global, set-parameter!, parameters, function, launch!.
- Stream management: stream, default-stream, ready?, synchronize!, add-host-fn!, listen!, wait-event!, attach-mem!.
- Event management: event, elapsed-time!, record!, can-access-peer, p2p-attribute, disable-peer-access!, enable-peer-access!.
- NVRTC program JIT: program, program-log, compile!, ptx.
Please see CUDA Driver API for details not discussed in ClojureCUDA documentation.
add-host-fn!
(add-host-fn! hstream f data)
(add-host-fn! hstream f)
Adds host function f
to a compute stream, with optional data
related to the call. If data
is not provided, places hstream
under data.
attach-mem!
(attach-mem! hstream mem byte-size flag)
(attach-mem! mem byte-size flag)
Attaches memory mem
of size size
, specified by flag
to a hstream
asynchronously. For available flags, see internal.constants/mem-attach-flags. Te default is :single
. If :global flag is specified, the memory can be accessed by any stream on any device. If :host flag is specified, the program makes a guarantee that it won’t access the memory on the device from any stream on a device that has no concurrent-managed-access
capability. If :single flag is specified and hStream
is associated with a device that has no concurrent-managed-access
capability, the program makes a guarantee that it will only access the memory on the device from hStream
. It is illegal to attach singly to the nil stream, because the nil stream is a virtual global stream and not a specific stream. An error will be returned in this case.
When memory is associated with a single stream, the Unified Memory system will allow CPU access to this memory region so long as all operations in hStream have completed, regardless of whether other streams are active. In effect, this constrains exclusive ownership of the managed memory region by an active GPU to per-stream activity instead of whole-GPU activity.
can-access-peer
(can-access-peer dev peer)
Queries if a device may directly access a peer device’s memory. See CUDA Peer Access Management
compile!
(compile! prog options)
(compile! prog)
Compiles the given prog
using a list of string options
.
context
(context dev flag)
(context dev)
Creates a CUDA context on the device
using a keyword flag
. For available flags, see internal.constants/ctx-flags. The default is none. The context must be released after use.
cuda-free!
(cuda-free! dptr)
Frees the runtime device memory that has been created by cuda-malloc. See CUDA Runtime API Memory Management
cuda-malloc
(cuda-malloc byte-size)
(cuda-malloc byte-size type)
Returns a Pointer
to byte-size
bytes of uninitialized memory that will be automatically managed by the Unified Memory system. The pointer is managed by the CUDA runtime API. Optionally, accepts a type
of the pointer as a keyword (:float
or Float/TYPE
for FloatPointer
, etc.). This pointer has to be manually released by cuda-free!. For a more seamless experience, use the wrapper provided by the mem-alloc-runtime function. See CUDA Runtime API Memory Management
current-context
(current-context)
Returns the CUDA context bound to the calling CPU thread. See CUDA Context Management.
current-context!
(current-context! ctx)
Binds the specified CUDA context ctx
to the calling CPU thread. See CUDA Context Management.
device
(device id)
(device)
Returns a device specified with its ordinal number id
or string PCI Bus id
. See CUDA Device Management.
device-count
(device-count)
Returns the number of CUDA devices on the system. See CUDA Device Management.
disable-peer-access!
(disable-peer-access! ctx)
(disable-peer-access!)
Disables direct access to memory allocations in a peer context and unregisters any registered allocations. See CUDA Peer Access Management
elapsed-time!
(elapsed-time! start-event end-event)
Computes the elapsed time in milliseconds between start-event
and end-event
. See CUDA Event Management
enable-peer-access!
(enable-peer-access! ctx)
(enable-peer-access!)
Enables direct access to memory allocations in a peer context and unregisters any registered allocations. See CUDA Peer Access Management
event
(event)
(event flag & flags)
Creates an event specified by keyword flags
. For available flags, see internal.constants/event-flags. See CUDA Event Management
function
(function m name)
Returns CUDA kernel function named name
located in module m
. See CUDA Module Management
global
(global m name)
Returns CUDA global device memory object named name
from module m
. Global memory is typically defined in C++ source files of CUDA kernels. See CUDA Module Management
grid-1d
(grid-1d dim-x)
(grid-1d dim-x block-x)
Creates a 1-dimensional GridDim record with grid and block dimensions x. Note: dim-x is the total number of threads globally, not the number of blocks.
grid-2d
(grid-2d dim-x dim-y)
(grid-2d dim-x dim-y block-x block-y)
Creates a 2-dimensional GridDim record with grid and block dimensions x and y. Note: dim-x is the total number of threads globally, not the number of blocks.
grid-3d
(grid-3d dim-x dim-y dim-z)
(grid-3d dim-x dim-y dim-z block-x block-y block-z)
Creates a 3-dimensional GridDim record with grid and block dimensions x, y, and z. Note: dim-x is the total number of threads globally, not the number of blocks.
in-context
macro
(in-context ctx & body)
Pushes the context ctx
to the top of the context stack, evaluates the body with ctx
as the current context, and pops the context from the stack. Does NOT release the context, unlike with-context. See CUDA Context Management.
init
(init)
Initializes the CUDA driver. This function must be called before any other function from ClojureCUDA in the current process. See CUDA Initialization
launch!
(launch! fun grid-dim shared-mem-bytes hstream params)
(launch! fun grid-dim hstream params)
(launch! fun grid-dim params)
Invokes the kernel fun
on a grid-dim
grid of blocks, usinng params
PointerPointer
. Optionally, you can specify the amount of shared memory that will be available to each thread block, and hstream
to use for execution. See CUDA Module Management
link
(link data options)
(link data)
(link)
Invokes the CUDA linker on data provided as a vector [[type source <options> <name>], ...]
. Produces a cubin compiled for a particular Nvidia architecture. Please see relevant examples from the test folder. See CUDA Module Management
link-complete!
(link-complete! link-state)
listen!
(listen! hstream ch data)
(listen! hstream ch)
Adds a host function listener to a compute stream, with optional data
related to the call, and connects it to a Clojure channel chan
. If data
is not provided, places hstream
under data.
load!
(load! m data)
Load module’s data from a ptx string, nvrtc program, java path, or binary data
. Please see relevant examples from the test folder. See CUDA Module Management
mem-alloc-driver
(mem-alloc-driver byte-size flag)
(mem-alloc-driver byte-size)
Allocates the byte-size
bytes of uninitialized memory that will be automatically managed by the Unified Memory system, specified by a keyword flag
. For available flags, see internal.constants/mem-attach-flags. Returns a CUDA device memory object, which can NOT be extracted as a Pointer
, but can be accessed directly through its address in the device memory. See CUDA Driver API Memory Management
mem-alloc-mapped
(mem-alloc-mapped byte-size)
(mem-alloc-mapped byte-size type)
Allocates byte-size
bytes of uninitialized host memory, ‘mapped’ to the device. Optionally, accepts a type
of the pointer as a keyword (:float
or Float/TYPE
for FloatPointer
, etc.). Mapped memory is optimized for the memcpy! operation, while ‘pinned’ memory is optimized for memcpy-host!. See CUDA Driver API Memory Management
mem-alloc-pinned
(mem-alloc-pinned byte-size)
(mem-alloc-pinned byte-size type-or-flags)
(mem-alloc-pinned byte-size type flags)
Allocates byte-size
bytes of uninitialized page-locked memory, ‘pinned’ on the host, using keyword flags
. For available flags, see internal.constants/mem-host-alloc-flags; the default is :none
. Optionally, accepts a type
of the pointer as a keyword (:float
or Float/TYPE
for FloatPointer
, etc.). Pinned memory is optimized for the memcpy-host! function, while ‘mapped’ memory is optimized for memcpy!. See CUDA Device Driver API Memory Management
mem-alloc-runtime
(mem-alloc-runtime byte-size type)
(mem-alloc-runtime byte-size)
Allocates the byte-size
bytes of uninitialized memory that will be automatically managed by the Unified Memory system. Returns a CUDA device memory object managed by the CUDA runtime API, which can be extracted as a Pointer
. Equivalent unwrapped Pointer
can be created by cuda-malloc. See CUDA Runtime API Memory Management
mem-register-pinned!
(mem-register-pinned! memory flags)
(mem-register-pinned! memory)
Registers previously instantiated host pointer, ‘pinned’ from the device, using keyword flags
. For available flags, see internal.constants/mem-host-register-flags; the default is :none
. Returns the pinned object equivalent to the one created by mem-alloc-pinned. Pinned memory is optimized for the memcpy-host! function, while ‘mapped’ memory is optimized for memcpy!. See CUDA Device Driver API Memory Management
mem-sub-region
(mem-sub-region mem origin byte-count)
(mem-sub-region mem origin)
Creates CUDA device memory object that references a sub-region of mem
from origin
to byte-count
, or maximum available byte size.
memcpy!
(memcpy! src dst)
(memcpy! src dst byte-count-or-stream)
(memcpy! src dst byte-count hstream)
Copies byte-count
or maximum available device memory from src
to dst
. TODO mapped, pinned If hstream
is provided, executes asynchronously. See CUDA Memory Management
memcpy-host!
(memcpy-host! src dst byte-count hstream)
(memcpy-host! src dst count-or-stream)
(memcpy-host! src dst)
Copies byte-count
or all possible memory from src
to dst
, one of which has to be accessible from the host. If hstream
is provided, executes asynchronously. A polymorphic function that figures out what needs to be done. Supports everything except pointers created by cuda-malloc!. See CUDA Memory Management
memcpy-to-device!
(memcpy-to-device! src dst byte-count hstream)
(memcpy-to-device! src dst count-or-stream)
(memcpy-to-device! src dst)
Copies byte-count
or all possible memory from host src
to device dst
. Useful when src
or dst
is a generic pointer for which it cannot be determined whether it manages memory on host or on device (see cuda-malloc!). If hstream
is provided, executes asynchronously. See CUDA Memory Management
memcpy-to-host!
(memcpy-to-host! src dst byte-count hstream)
(memcpy-to-host! src dst count-or-stream)
(memcpy-to-host! src dst)
Copies byte-count
or maximum available memory from device src
to host dst
. Useful when src
or dst
is a generic pointer for which it cannot be determined whether it manages memory on host or on device (see cuda-malloc!). If hstream
is provided, executes asynchronously. See CUDA Memory Management
memset!
(memset! dptr value)
(memset! dptr value n-or-hstream)
(memset! dptr value n hstream)
Sets n
elements or all segments of dptr
memory to value
(supports all Java primitive number types except double
, and long
with value larger than Integer/MAX_VALUE
). If hstream
is provided, executes asynchronously. See CUDA Memory Management
module
(module)
(module data)
Creates a new CUDA module and loads a string, nvrtc program, or binary data
. See CUDA Module Management
p2p-attribute
(p2p-attribute dev peer attribute)
Queries attributes of the link between two devices. See CUDA Peer Access Management
parameters
(parameters parameter & parameters)
Creates an PointerPointer
s to CUDA parameter
’s. parameter
can be any object on device (Device API memory, Runtime API memory, JavaCPP pointers), or host (arrays, numbers, JavaCPP pointers) that makes sense as a kernel parameter per CUDA specification. Use the result as a parameter argument in launch!.
pop-context!
(pop-context!)
Pops the current CUDA context ctx
from the current CPU thread. See CUDA Context Management.
program
(program name source-code headers)
(program source-code headers)
(program source-code)
Creates a CUDA program from the source-code
, with an optional name
and an optional hash map of headers
(as strings) and their names.
program-log
(program-log prog)
Returns the log string generated by the previous compilation of prog
.
push-context!
(push-context! ctx)
Pushes a context ctx
on the current CPU thread. See CUDA Context Management.
ready?
(ready? obj)
Determines status (ready or not) of a compute stream or event obj
. See CUDA Stream Management and CUDA Event Management
record!
(record! stream event)
(record! event)
Records an even! ev
on optional stream
. See CUDA Event Management
set-parameter!
(set-parameter! pp i parameter & parameters)
Sets the i
th parameter in a parameter array pp
and the rest of parameters
in places after i
.
stream
(stream)
(stream flag)
(stream priority flag)
Creates a stream using an optional integer priority
and a keyword flag
. For available flags, see internal.constants/stream-flags See CUDA Stream Management
synchronize!
(synchronize!)
(synchronize! hstream)
Blocks the current thread until the context’s or hstream
’s tasks complete.
wait-event!
(wait-event! hstream ev)
Makes a compute stream hstream
wait on an event ev
. See CUDA Event Management
with-context
macro
(with-context ctx & body)
Pushes the context ctx
to the top of the context stack, evaluates the body, and pops the context from the stack. Releases the context, unlike in-context. See CUDA Context Management.
with-default
macro
(with-default & body)
Initializes CUDA, creates the default context and executes the body in it. See CUDA Context Management.