Writing a custom operator in TFLite — GPU

Avinash
3 min readMay 8, 2022

All that is needed to get a custom operator running on GPU is to write its shader code and to let TFLite know that there is a new op here.

All of our work will be in the tensorflow/tensorflow/lite/delegates/gpu path of the tensorflow repo.

You’ll also need to keep track of all the new files that you create in this process and add them to the BUILD files at appropriate places. Also import appropriate header files when necessary.

Registering the op

Look for the custom_resitry.cc . In there you’ll find an empty function RegisterCustomOps. This will be called in registry.cc after registering the inbuilt ops. We need to append our custom op to the hash-map of shaders passed to the function. Add this line to the function.

(*shaders)["sin"].push_back(NewSinNodeShader()) 

Writing the custom parser

In order to let TFLite know how to read the attributes and output size (for memory allocation) of the op, we need to write an op parser. This will sit in tensorflow/tensorflow/lite/delegates/gpu/common/default .

Create a header file, define a struct that captures the op attributes and a class that inherits TFLiteOperationParser .

struct SinAttributes {
float frequency;
float phase;
}
class SinOperationParser: public TFLiteOperationParser {
// 2 functions to be overriden
IsSupported(some args);
Parse(some args);
}

In the corresponding cpp file, you can write IsSupported to have any checks* before proceeding. With Parse we need to get the attributes as raw data (from a format called flatbuffer) and put it into the struct that we wrote.

// data = tflite_node->data
// where tflite_node is a parameter of the function
auto cast_data = reinterpreted_cast<const uint8_t*>(data)
const flexbuffers::Map map = flexbuffers::GetRoot(cast_data, data_size).AsMap();
SinAttributes attr;// freq is the parameter name in python api of the op
attr.frequency = map["freq"].AsFloat();
attr.phase = map["phase"].AsFloat();

In custom_parsers.cc we need to tell TFLite we have a parser for our new op.

// Inside NewCustomOperationParser()
if (op_name == "sin") {
return std::make_unique<SinOperationParser>();
}

Writing the shader

All gpu ops need to inherit the NodeShader class and override the GenerateCode function, which is where your shader code written in GLSL will reside in.

This will sit in tensorflow/tensorflow/lite/delegates/gpu/common/gl

The first parameter is context which provides us with essential information such as attributes, input and output shapes of the op. The attributes is of type std::any , so you’ll need to cast it to a custom struct before accessing the data. The input and output shapes are a vector of vectors.

The second parameter is a GenerateCode struct passed as a pointer, which you have to supply with in the function body. It has the following members

  • parameters — key-value pairs for simple data, like kernel_size , that will be hard coded in the glsl shader
  • objects — key-value pairs for read only data other than input tensors, such as kernel weights, binded as uniform buffer objects* in the glsl shader
  • shared_variables — key-value pairs for data to be read/written to but not the output tensor, maybe to store intermediate compute results, binded as SSBO* in the glsl shader
  • workload — 3D int vector regarding how many threads to launch
  • workgroup — 3D int vector such that workload.x / workgroup.x gives number of threads in each work group. Usually workload and workgroup will be the output shape of the operator with 1 thread per workgroup
  • source_code — the actual GLSL source code
  • input, output — enum that specifies how you would want to access input and output tensors. If you were writing an element wise op, where you would only require to read/write the element corresponding to each thread, then use value_n = op(value_n) . If you want to access any element of the input tensor then use output_data_n = op(input_data_n[x,y], input_data_n[x+dx,y+dy]) , where x and y are obtained using InvocationID of the thread. Here n is the index of the input/output tensor.
const parameters = {
"frequency": ctx.attr.frequency,
"phase": ctx.attr.phase,
}
*generated_code = {
parameters,
{}, // objects
{}, // shared variables
uint3(), // workload (0,0,0) means one thread for one output element
uint3(), // workgroup
R"value_0 = sin(value_0/$frequency$ + $phase$)", // GLSL code
IOStructure::AUTO, // input access type
IOStructure::AUTO // output access type
}

I have only shared the critical parts of the code. For a complete reference look at how inbuilt ops are implemented and copy their structure.

If you’re looking to implement custom op on cpu read this.

--

--

Avinash

Interested in ML & Software. Prev at Inito, ShareChat. Ola. Graduated from IIT Madras.