Two project ideas focus on DeepRT
(see the paper),
created by Zhe Yang in Prof. Klara Nahrstedt's group.
Zhe would be the contact person and would provide access to their existing
code base.
Briefly, they have designed a soft real time scheduler to decide
how we can batch process images from different requests of video
processing, and then in what order we process these batches on
GPU. The design
is a "middleware" layer between user requests and GPU, so for
implementation they use Pytorch (other platforms or even CUDA may all be
possible, but they chose Pytorch for simplicity).
-
One project would involve scheduling memory copies from CPU to
GPU. In DeepRT, all images are assumed to be already loaded on GPU, and
analysis ignores the data transfer from CPU to GPU.
Zhe is interested in having a group extend DeepRT to
also consider the memory copy time from CPU to GPU, or
design something that can fetch images while GPU is busy doing other
works, kind of like cache prefetching.
A starting point here would be to use streams to overlap DMA over the PCIe
with GPU computation as well as DMA of results back from GPU to CPU.
Use of PyTorch may also imply many additional transfers,
which you can get around by migrating some of the simpler
logic into wrapper kernels (as with dynamic parallelism).
-
DeepRT also sacrifices some level of performance
in pursuit of real time abilities (catching up with deadlines).
A second possible project involves assigning each image some sort of
priority and use some optimization techniques (maybe linear
programming) to control the ordering of processing.
As you may know, NVIDIA added the ability to do full-grid
synchronization along with cooperative thread groups. One of the people
there claims there's an example of also sync'ing with CPU code while a
kernel runs--I don't see that part in a quick scan, but the NVIDIA
person should know.
The group may also be able to break up kernels in
a way that enables the sync and processing control without completely
trashing performance.