A few weeks ago, I’ve shared here some tests using a Raspberry Pi 3 and several Movidius Neural Compute Sticks. As a refreshed, one can off-load the run of Artificial Neural Networks (ANN) to perform your classification/recognitions. Although you can add in theory as many NCSes to your setup (I’ve tested up to two), the Pi quickly runs out of steam. Sure, with custom code – but that takes time and requires development skills – and optimization – not trivial either –, you can delay the inevitable. But, eventually, you will need to swap-out the Pi with something more powerful. So, what next? Well, if you have a beefy CPU – or a GPU –, you can up-level your game a notch and improve your algorithms using them.
Release the Xeon
Since my workstation has a Xeon E5-2699 v9 processor, the choice was easy (indeed, my AMD Radeon HD 5800 is not a player here). For the software, I’ve registered into the Intel Computer Vision SDK beta program (https://software.intel.com/en-us/computer-vision-sdk/details). The SDK helps designing Computer Vision algorithms for heterogeneous hardware, by allowing the manipulation & creation of OpenVX (1.1 – 1.0.1) graphs (https://www.khronos.org/openvx). For my tests, I’ve picked the Windows version – so I could use my Xeon and preferred development environments. I really appreciated the simplicity of the IDE named Vision Algorithm Designer (VAD). With it, you simply drag/drop nodes and connect them to design your data flow and processing to be performed at each step of the algorithm. For example, you would start with an acquisition node (from a file, a stream, etc.). The output (VX_TYPE_IMAGE_RGB) can be next sent to a color converter node, and so on, while the same output can is fed into a horizontal line detector node. Remember that your role is to leverage as much parallelism the HW of your platform allows for! And so, on you go, adding, configuring and linking your nodes. Many nodes are provided with the SDK by Intel and Khronos.
Focus on the Algorithm
Once you reached a point where some meaningful processing is done, you can then build your graph and fix all the errors and warnings reported. You then verify & generate OpenVX C code. At this point, the code can be run frame by frame or looped. Detailed profiling information can be generated to detect possible performance issues and bottlenecks in the algorithm. Keep in mind that at the end of the day, you want to leverage all the sources of parallelism available on your platform. Which can bring its own set of scheduling and latency complexities. The SDK makes it as easy as possible to focus on the algorithm masking the bits and bytes of the underlying steps. A really nice approach. The SDK is in beta R3, so it crashes here and there and the VAD needs some re-starts. But nothing that cannot be handled – until fixed. Nevertheless, the concept is great, the VAD is very intuitive and allows for fine-tuning and explorations. But I still cannot fully load my 36 logical cores 😉