ILGPU

C#

Getting Started

Create a new C# or VB.Net project based on the .Net Framework 4.6 or .Net Core 2.0 and install the required ILGPU Nuget package. We recommend that you disable the "Prefer 32bit" option in the application build-settings panel. This ensures that the application is typically executed in native-OS mode (e.g. 64bit on 64bit OS). Several functions are available only in native-OS mode since they require direct interaction with the graphics-driver API.

While GPU programming can be done using only the ILGPU package, we recommend using the ILGPU.Lightning library that realizes useful functions like scan, reduce and sort.

Note that all available samples can be found in the GitHub repository.


// Create new C# or VB.Net project with .Net 4.6 (or higher) or .Net Core 2.0.
nuget install ILGPU

Kernels

Kernels are static functions that can work on value types and can invoke other functions that work on value types. Class (reference) types are currently not supported in the scope of kernels. Note that exception handling, boxing and recursive programs are also not supported and will be rejected by the ILGPU compiler. The type of the first parameter must always be a supported index type. The other parameters are uniform constants that are passed from the CPU to the GPU via constant memory. All parameter types must be value types and must not be passed by reference (e.g. via out or ref keywords in C#).

Since memory buffers are classes that are allocated and disposed on the CPU, they cannot be passed directly to kernels. However, you can pass array views to these buffers by value as kernel arguments (see Array Views).

Note that you must not pass pointers to non-accessible memory regions since these are also passed by value.


class ...
{
    static void Kernel(
        [IndexType] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Index Types

Index types implement (the often required) index computations and hide them from the user.

The pre-defined index types

Index
A simple 1D index of type int.
Index2
A simple 2D index consisting of two ints.
Index3
A simple 3D index consisting of three ints.
GroupedIndex
An index type that differentiates between global grid and group indices in 1D.
GroupedIndex2
An index type that differentiates between global grid and group indices in 2D.
GroupedIndex3
An index type that differentiates between global grid and group indices in 3D.

A Grouped Index

GridIdx
Index Type The grid index in the scope of the dispatched grid.
GroupIdx
Index Type The thread index in the scope of the current execution group.
Hint
Use index.ComputeGlobalIndex() to compute a global ungrouped-index to access global memory.

using namespace ILGPU;

...
Index i1 = 42;
Index2 i2 = new Index2(1, 2);
Index3 i3 = new Index2(1, 2, 3);

GroupedIndex gi1 = new GroupedIndex(i1, 23);
GroupedIndex2 gi2 = new GroupedIndex(i2, new Index2(3, 4));
GroupedIndex3 gi3 = new GroupedIndex(new Index3(4, 5), i3);

i1 = gi1.ComputeGlobalIndex();
i2 = gi2.ComputeGlobalIndex();
i3 = gi3.ComputeGlobalIndex();

var size1 = i1.Size; // i1.X;
var size2 = i2.Size; // i2.X * i2.Y;
var size3 = i3.Size; // i3.X * i3.Y * i3.Z;

Implicitly Grouped Kernels

Implicitly grouped kernels allow very convenient high-level kernel programming. They can be launched with automatically configured group sizes (that are determined by ILGPU) or manually defined group sizes.

Such kernels must not use shared memory, group or warp functionality since there is no guaranteed group size or thread participation inside a warp. The details of the kernel invocation are hidden from the programmer and managed by ILGPU. There is no way to access or manipulate the low-level peculiarities from the user's point of view. Use explicitly grouped kernels for full control over GPU-kernel dispatching.


class ...
{
    static void ImplicitlyGrouped_Kernel(
        [Index|Index2|Index3] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Explicitly Grouped Kernels

Explicitily grouped kernels offer the full kernel-programming power and behave similarly to Cuda kernels. These kernels receive grouped index types as first parameter that reflect the grid and group sizes. Moreover, these kernel offer access to shared memory, Group and other Warp-specific intrinsics. However, the kernel-dispatch dimensions have to be managed manually.


class ...
{
    static void ExplicitlyGrouped_Kernel(
        [GroupedIndex|GroupedIndex2|GroupedIndex3] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Simple Kernels

The simple kernel MyKernel on the right represents a simple kernel that works on float values. Note that this kernel relies on the high-level functionality of implicitly grouped kernels to avoid custom grouping and custom bounds checks (assuming the dispatched kernel dimension is equal to the minimum of the array views lengths a, b and c). In contrast to this high-level kernel, MyGroupedKernel realizes the same functionality with the help of explicitly grouped kernels. Note that the bounds check is required in general, since we cannot ensure at this point that the views a, b and c have the required dimensions that are a multiple of the dispatched group size. If we know that these views will always have the right dimensions, we can remove the bounds check.

Note that in Debug mode, every access to an ArrayView is bounds-checked. Hence, the kernel versions MyKernel and MyGroupedKernelAssert automatically rely on assertions in Debug mode. However, an out-of-bounds access in release mode causes undefined program behavior.


class ...
{
    static void MyKernel(
        Index idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c
        float d)
    {
        a[idx] = b[index] * c[index] + d;
    }

    static void MyGroupedKernel(
        GroupedIndex idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c
        float d)
    {
        var globalIdx = idx.ComputeGlobalIdx();
        if (globalIdx >= a.Length)
            return;
        a[globalIdx] = b[globalIdx] * c[globalIdx] + d;
    }

    static void MyGroupedKernelAssert(
        GroupedIndex idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c
        float d)
    {
        var globalIdx = idx.ComputeGlobalIdx();
        a[globalIdx] = b[globalIdx] * c[globalIdx] + d;
    }
}

TLDR - Quick Start

Create a new ILGPU Context instance that initializes ILGPU. Create Accelerator instances that target specific hardware devices. Compile and load the desired kernels and launch them with allocated chunks of memory. Retrieve the data and you're done :)

Refer to the related ILGPU sample for additional insights.


class ...
{
    static void MyKernel(
        Index index, // The global thread index (1D in this case)
        ArrayView<int> dataView, // A view to a chunk of memory (1D in this case)
        int constant) // A sample uniform constant
    {
        dataView[index] = index + constant;
    }

    public static void Main(string[] args)
    {
        // Create the required ILGPU context
        using (var context = new Context())
        {
            using (var accelerator = new CPUAccelerator(context))
            {
                // accelerator.LoadAutoGroupedStreamKernel creates a typed launcher
                // that implicitly uses the default accelerator stream.
                // In order to create a launcher that receives a custom accelerator stream
                // use: accelerator.LoadAutoGroupedKernel<Index, ArrayView<int> int>(...)
                var myKernel = accelerator.LoadAutoGroupedStreamKernel<Index, ArrayView<int> int>(MyKernel);

                // Allocate some memory
                using (var buffer = accelerator.Allocate<int>(1024))
                {
                    // Launch buffer.Length many threads and pass a view to buffer
                    myKernel(buffer.Length, buffer.View, 42);

                    // Wait for the kernel to finish...
                    accelerator.Synchronize();

                    // Resolve data
                    var data = buffer.GetAsArray();
                    // ...
                }
            }
        }
    }
}


ILGPU Context

All ILGPU classes and functions rely on the global ILGPU Context. Instances of classes that require a context reference have to be disposed before disposing of the main context. Note that all operations on a context and its children must be considered as not thread safe.

The ILGPU.Lightning library provides many useful functions to simplify GPU programming. However, they also require a valid ILGPU Context to work.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            // ILGPU functionality
            // Dispose all other classes before disposing the ILGPU context
        }
    }
}

Math Functions

The default math functions in .Net are realized with static methods from the Math class. However, many operations work on doubles by default (like Math.Sin) and there is often no float overload. This causes many floating-point operations to be performed on 64bit floats, even when this precision is not required. ILGPU offers the GPUMath class that includes 32bit-float overloads for all math functions. Invocation of these functions ensure that the operations are performed on 32bit-floats on the GPU hardware.

Fast-math can be enabled using the CompileUnitFlags.FastMath flag and enables the use of fast (and unprecise) math functions.

Default math operations like x * y or x / y are mapped directly to the corresponding instructions. This means that, by default, fast-math flags do not apply to these operations. For instance, using the flag CompileUnitFlags.UseGPUMath forces ILGPU to treat these operations like GPUMath.Mul or GPUMath.Div, which are affected by fast-math flags.

Your kernels might rely on third-party functions that are not under your control. These functions typically depend on the default .Net Math class, and thus, work on 64bit floating-point operations. You can force the use of 32bit floating-point operations in all cases using the CompileUnitFlags.Force32BitMath flag. Caution: all doubles will be consideres as floats to circumvent issues with third-party code. However, this also affects the address computations of array-view elements. Avoid the use of this flag unless you know exactly what you are doing.



Accelerators

Accelerators represent hardware or software GPU devices. They store information about different devices and allow memory allocation and kernel loading on a particular device. A launch of a kernel on an accelerator is performed asynchronously by default. Synchronization with the accelerator or the associated stream is required in order to to wait for completion and to fetch results.

Note that instances of classes that depend on an accelerator reference have to be disposed before disposing of the associated accelerator object. However, this does not apply to automatically managed kernels, which are cached inside the accelerator object.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (var cpuAccelerator = new CPUAccelerator(context))
            { }

            using (var cudaAccelerator = new CudaAccelerator(context))
            { }

            foreach (var acceleratorId in Accelerator.Accelerators)
            {
                using (var accl = Accelerator.Create(context, acceleratorId))
                {
                    // Perform operations
                }
            }
        }
    }
}

Memory Buffers

MemoryBuffer represent allocated memory regions (allocated arrays) of a given value type on specific accelerators. Data can be copied to and from any accelerator using sync or async copy operations (see Streams). ILGPU supports linear, 2D and 3D buffers out of the box, whereas nD-buffers can also be allocated and managed using custom index types.

Note that MemoryBuffers have to be disposed manually and cannot be passed to kernels; only views to memory regions can be passed to kernels.


class ...
{
    public static void MyKernel(Index index, ...)
    {
        // ...
    }

    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (var accelerator = ... )
            {
                using (var buffer = accelerator.Allocat<int>(1024))
                {
                    ...
                }
            }
        }
    }
}

Array Views

ArrayViews realize views to specific memory-buffer regions. Views comprise pointers and length information. They can be passed to kernels and simplify index computations.

Similar to memory buffers, there are specialized views for 1D, 2D and 3D scenarios. However, it is also possible to use the generic structure ArrayView<Type, IndexType> to create views to nD-regions.

Accesses on ArrayViews are bounds-checked via Debug assertions. Hence, these checks are not performed in Release mode, which benefits performance.


class ...
{
    static void MyKernel(Index index, ArrayView<int> view1, ArrayView<float> view2)
    {
        ConvertToFloatSample(
            view1.GetSubView(0, view1.Length / 2),
            view2.GetSubView(0, view2.Length / 2));
    }

    static void ConvertToFloatSample(ArrayView<int> source, ArrayView<float> target)
    {
        for (Index i = 0, e = source.Extent; i < e; ++i)
            target[i] = source[i];
    }

    static void Main(string[] args)
    {
        ...
        using (var buffer = accelerator.Allocat<...>(...))
        {
            var mainView = buffer.View;
            var subView = mainView.GetSubView(0, 1024);
        }
    }
}

Variable Views

A VariableView is a specialized array view that points to exactly one element. VariableViews are used by atomics, for instance, to ensure that the target address points to a single element.


class ...
{
    static void MyKernel(Index index, ArrayView<int> view)
    {
        // Perform atomic increment on the i-th element
        var ithElementView = view.GetVariableView(index);
        Atomic.Add(ithElementView, 1);
    }

    static void Main(string[] args)
    {
        using (var buffer = accelerator.Allocat<...>(...))
        {
            var mainView = buffer.View;
            var firstElementView = mainView.GetVariableView(0);
        }
    }
}

Accelerator Streams

AcceleratorStreams represent async operation queues, which operations can be submitted to. Custom accelerator streams have to be synchronized manually. Using streams increases the parallellism of applications. Every accelerator encapsulates a default accelerator stream that is used for all operations by default.


class ...
{
    static void Main(string[] args)
    {
        ...

        var defaultStream = accelerator.DefaultStream;
        using (var secondStream = accelerator.CreateStream())
        {

            // Perform actions using default stream...

            // Perform actions on second stream...

            // Wait for results from the first stream.
            defaultStream.Synchronize();

            // Use results async compared to operations on the second stream...

            // Wait for results from the second stream
            secondStream.Synchronize();

            ...
        }
    }
}

Loading & Launching Kernels

Kernels have to be loaded by an accelerator first before they can be executed. See the ILGPU kernel sample for details. There are two possibilities in general: using the high-level (described here) and the low-level loading API. We strongly recommend to use the high-level API that simplifies programming, is less error prone and features automatic kernel caching and disposal.

An accelerator object offers different functions to load and configure kernels:

  • LoadAutoGroupedStreamKernel

    Loads an implicitly grouped kernel with an automatically determined group size (uses a the default accelerator stream)

  • LoadAutoGroupedKernel

    Loads an implicitly grouped kernel with an automatically determined group size (requires an accelerator stream)

  • LoadImplicitlyGroupedStreamKernel

    Loads an implicitly grouped kernel with a custom group size (uses the default accelerator stream)

  • LoadImplicitlyGroupedKernel

    Loads an implicitly grouped kernel with a custom group size (requires an accelerator stream)

  • LoadStreamKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size (uses the default accelerator stream)

  • LoadKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size (requires an accelerator stream)

Functions following the naming pattern LoadXXXStreamKernel use the default accelerator stream for all operations. If you want to specifiy the associated accelerator stream, you will have to use the LoadXXXKernel functions.

Every function returns a typed delegate (a kernel launcher) that can be called in order to invoke the actual kernel execution. These launchers are specialized methods that are dynamically generated and specialized for every kernel. They avoid boxing and realize high-performance kernel dispatching. In contrast to older versions of ILGPU, all kernels loaded with these functions will be managed by their associated accelerator instances.

Note that a kernel-loading operation will trigger a kernel compilation in the case of an uncached kernel. The compilation step will happen in the background and is transparent for the user. However, if you require custom control over the low-level kernel-compilation process refer to Advanced Low-Level Functionality.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

         // Load a sample kernel MyKernel using one of the available overloads
        var kernelWithDefaultStream = accelerator.LoadAutoGroupedStreamKernel<
                     Index, ArrayView<int>, int>(MyKernel);
        kernelWithDefaultStream(buffer.Extent, buffer.View, 1);

         // Load a sample kernel MyKernel using one of the available overloads
        var kernelWithStream = accelerator.LoadAutoGroupedKernel<
                     Index, ArrayView<int>, int>(MyKernel);
        kernelWithStream(someStream, buffer.Extent, buffer.View, 1);

        ...
    }
}

Backends

A Backend represents target-specific code-generation functionality for a specific target device. It can be used to manually compile kernels for a specific platform.

Note that you do not have to create custom backend instances on your own when using the ILGPU runtime. Accelerators already carry associated and configured backends that are used for high-level kernel loading.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            // Creats a user-defined MSIL backend for .Net code generation
            using (var cpuBackend = new MSILBackend(context))
            {
                // Use custom backend
            }

            // Creates a user-defined backend for NVIDIA GPUs using compute capability 5.0
            using (var ptxBackend = new PTXBackend(context, PTXArchitecture.SM_50))
            {
                // Use custom backend
            }
        }
    }
}

Compile Units

A CompileUnit caches intermediate-repesentation (IR) code, which can be reused during the compilation process. It can be created using a Backend instance and CompileUnitFlags that potentially influence all types and methods in the scope of a single CompileUnit. Furthermore, it can handle custom intrinsic types and intrinsic functions which can be used by experts to add user-defined intrinsic functionality to ILGPU. Note that a compile unit is linked to a specific backend which provides specific intrinsic functionality. Using a different backend at the same time requires using another compile unit.

Note that an accelerator already has an associated CompileUnits that is used for all high-level kernel-loading functions. Consequenctly, users are not required to manage their own compile units in general.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (Backend b = ...)
            {
                using (var unit = context.CreateCompileUnit(b, CompileUnitFlags.None))
                {
                    // ...
                }
            }
        }
    }
}

Compiling Kernels

Kernels can be compiled manually by requesting a code-generation operation from the backend yielding a CompiledKernel object. The resulting kernel object can be loaded by an Accelerator instance from the runtime system. Alternatively, you can query the GetBuffer method in order to access the generated and target-specific assembly code.

Note that the MSILBackend does not provide additional insights when executing the GetBuffer method, since the MSILBackend does not require custom assembly code.

We recommend that you use the high-level kernel-loading concepts of ILGPU instead of the low-level interface.


class ...
{
    public static void MyKernel(Index index, ...)
    {
        // ...
    }

    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (Backend b = new PTXBackend(context, ...)
            {
                using (var unit = context.CreateCompileUnit(b, ...)
                {
                    var compiledKernel = b.Compile(unit, typeof(...).GetMethod(nameof(MyKernel), BindingFlags.Public | BindingFlags.Static));
                    System.IO.File.WriteAllBytes("MyKernel.ptx", compiledKernel.GetBuffer());
                }
            }
        }
    }
}

Loading Compiled Kernels

Compiled kernels have to be loaded by an accelerator first before they can be executed. See the ILGPU low-level kernel sample for details. Caution: manually loaded kernels have to be disposed before the associated accelerator object is disposed.

An accelerator object offers different functions to load and configure kernels:

  • LoadAutoGroupedKernel

    Loads an implicitly grouped kernel with an automatically determined group size

  • LoadImplicitlyGroupedKernel

    Loads an implicitly grouped kernel with a custom group size

  • LoadKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size


class ...
{
    static void Main(string[] args)
    {
        ...
        var compiledKernel = backend.Compile(...);

        // Load implicitly grouped kernel with an automatically determined group size
        var k1 = accelerator.LoadAutoGroupedKernel(compiledKernel);

        // Load implicitly grouped kernel with custom group size
        var k2 = accelerator.LoadImplicitlyGroupedKernel(compiledKernel);

        // Load any kernel (explicitly and implicitly grouped kernels).
        // However, implicitly grouped kernels will be dispatched with a group size
        // that is equal to the warp size of its associated accelerator
        var k3 = accelerator.LoadKernel(compiledKernel);

        ...

        k1.Dispose();
        k2.Dispose();
        k3.Dispose();
    }
}

Direct Kernel Launching

A loaded kernel can be dispatched using the Launch method. However, the dispatch method takes an object-array as an argument, all arguments are boxed upon invocation and there is not type-safety at this point. For performance reasons, we strongly recommend the use of typed kernel launchers that avoid boxing.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

        // Load a sample kernel MyKernel
        var compiledKernel = backend.Compile(...);
        using (var k = accelerator.LoadAutoGroupedKernel(compiledKernel))
        {
            k.Launch(buffer.Extent, buffer.View, 1);

            ...

            accelerator.Synchronize();
        }

        ...
    }
}

Typed Kernel Launchers

Kernel launchers are delegates that provide an alternative to direct kernel invocations. These launchers are specialized methods that are dynamically generated and specialized for every kernel. They avoid boxing and realize high-performance kernel dispatching (Link). There are two possibilities to create a kernel launcher:

  • CreateLauncherDelegate

    Creates a specialized launcher for the associated kernel. Besides all required kernel parameters, it also receives a parameter of type AcceleratorStream as an argument. This allows to attach the kernel to an arbitrary accelerator stream upon invocation.

  • CreateStreamLauncherDelegate

    Behaves similarly to the previously discussed launcher. However, this launcher does not receive an accelerator stream; instead, it is linked to the default accelerator stream of the associated accelerator. This often simplifies the code required to launch a kernel.

Note that high-level API kernel loading functionality that simply returns a launcher delegate instead of a kernel object. These loading methods work similarly to the these versions, e.g. LoadAutoGroupedStreamKernel loads a kernel with a custom delegate type that is linked to the default accelerator stream.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

        // Load a sample kernel MyKernel
        var compiledKernel = backend.Compile(...);
        using (var k = accelerator.LoadAutoGroupedKernel(compiledKernel))
        {
            var launcherWithCustomAcceleratorStream = k.CreateLauncherDelegate<AcceleratorStream, Index, ArrayView<int>>();
            launcherWithCustomAcceleratorStream(someStream, buffer.Extent, buffer.View, 1);

            var launcherWithLinkedAcceleratorStream = k.CreateStreamLauncherDelegate<Index, ArrayView<int>>();
            launcherWithLinkedAcceleratorStream(buffer.Extent, buffer.View, 1);

            // Note that the previous invocation of launcherWithLinkedAcceleratorStream is
            // equivalent to the following invocation: 
            launcherWithCustomAcceleratorStream(accelerator.DefaultStream, buffer.Extent, buffer.View, 1);

            ...
        }

        ...
    }
}