NPUs: The Rise of Neural Processing Units

Neural Processing Units (NPUs) have been around for a while, but they’ve recently entered the mainstream. But what exactly are NPUs, why do we need them, and when should we use them?

To understand their role, let’s first take a look at the processors we’ve relied on until now.

The Evolution of Processors

For much of computing history, the Central Processing Unit (CPU) has been the workhorse of general-purpose computing. CPUs handle everything from running operating systems to executing applications. However, they process tasks sequentially—one operation at a time on one piece of data at a time. Even with multi-core CPUs, each core still follows a sequential approach.

Then came Graphics Processing Units (GPUs). Originally designed for rendering graphics, GPUs excel at handling the mathematical operations needed for 2D and 3D images. These operations often involve applying the same calculation to massive amounts of data—such as processing thousands or millions of pixels simultaneously. To achieve this, GPUs contain thousands of smaller cores that work in parallel.

It didn’t take long for AI and machine learning researchers to realise that GPUs were also well-suited for AI workloads. Training and running AI models require massive computational power, and GPUs’ parallel processing capabilities significantly accelerate these tasks. However, GPUs come with a major drawback: they consume a lot of power. Some modern GPUs use over 600 watts when running at full capacity—comparable to a small electric heater—making them expensive to operate and generating substantial heat.

Enter Neural Processing Units

NPUs first emerged around 2016, with some early versions branded as Tensor Processing Units (TPUs). Unlike CPUs, which are designed for general-purpose computing, and GPUs, which provide generic parallel processing, NPUs are purpose-built for AI and machine learning tasks. They specialise in executing neural network operations, such as multiply-add calculations, with exceptional speed and efficiency.

The key advantage of NPUs is their low power consumption. While GPUs deliver impressive AI performance, they are power-hungry. NPUs, on the other hand, provide high-speed AI processing with significantly lower energy usage. This makes them ideal for AI inference—running trained models in real-world applications, including on small hardware such as smartphones. However, when it comes to training AI models, GPUs still hold the edge.

As AI continues to advance, NPUs are poised to play a crucial role, offering a balance of speed and efficiency that bridges the gap between power-hungry GPUs and the slower, more general-purpose CPUs.

Real-World Applications

NPUs are now available in many hardware platforms, including some small form-factors computers such as the OrangePi 5 Pro.

I have recently completed a successful project using a RockChip NPU. (RK3568 and RK3588). The NPU toolkits are good, but initially tricky to work with. Contact me to see if I can help with your NPU-based AI project.

Traditional Computer Vision vs. Deep Learning with Yolo v7

A tricky computer vision project

On a recent computer vision project, we were trying to detect objects in images. Nothing unusual there, but we had some specific problems: The objects could vary in size from ‘very big’ (taking up nearly half the image) down to ‘very small’ (just a few pixels across). The objects were against a very ‘noisy’ background, namely the sky: complete with clouds and the Sun, sometimes rain, and other foreground objects such as trees. The detection accuracy had to be very high, and false-postives had to be very low: There could be no human intervention to correct mistakes. The software had to run on a credit-card sized computer (Jetson NX), alongside other software including a guidance system – the whole system was on-board a drone. And finally, it had to be fast: The camera we used was running at 30 frames-per-second, and the guidance system we were feeding into expects inputs at that rate: We had to keep up.

Traditional Computer Vision pipelines, and Genetic Algorithms

We had developed ‘pipelines’ of ‘traditional’ computer vision techniques, using OpenCV: colour space conversions, blurring, thresholding, contour-finding, etc. The system was working well in about 80% of cases – surprisingly well, given the range of object sizes we were trying to detect against such a noisy background.

But we got stuck chasing down that last 20% or so. Whenever we made a change to get things working on one specific class of image, it would break another that was previously working. We even tried automatically generating pipelines using Genetic Algorithms (not my idea, but I wish it had been!) – this generated some pipelines that worked well, but still we couldn’t achieve a system that worked well enough in all cases we might encounter.

Deep Learning: Yolo v7

The main reason for using traditional techniques was for speed – as I metioned, we had very tight timing constraints – but also because, last time we tried deep-learning models (Yolo V1 and V2), they were very bad at detecting objects that were small in the image.

But having hit a blocker in our progress, as a ‘last throw of the dice’, we decided to review the state of the art in deep-learning detectors.

After a review of the options, we settled for various reasons on Yolo v7: Even then (summer 2024), this wasn’t by any means the newest version of Yolo, but it gave a good combination of being fully open-source, well-documented and supported, and well-integrated with the languages we wanted to use (Python and C++).

The work itself took a while: There were a number of problems that we had to overcome, including some very technical ‘gotchas’ that nearly caused us to give up on it a few times. Of course, we also needed a large set of labeled training data, but we already had that.

Results

In short, the results are staggeringly good. We are now able to detect objects down to just a few pixels across, but more to the point, against very noisy backgrounds: In some cases ‘lens flares’ caused by the camera facing directly into the sun make the target object almost invisible to the human eye – but our system based on Yolo v7 detects the objects successfully in a very high percentage of cases. Also, the performance is exceptional – on a Jetson NX, running on the GPU, we are doing inference in around 8ms, allowing time for pre- and post-processing steps to be added and still achieve 30FPS, which is our camera frame-rate.

Yolo V7 is not a plug-and-play solution straight out of the box: Even just for training, we had to do some careful setup and config, ensure a well-balanced training and testing set, and then train and test until we were satisfied we had a good model that could not only detect our target object, but exclude all others. Inference (i.e. runtime), especially from C++, was far more difficult – there were one or two fairly esoteric problems. In particular, detections were often centered correctly, but with wildly wrong ‘rectangular’ bounding boxes – it took a while to work out the solution to that one.

Summary

There’s still a place for traditional computer vision techniques (and we still use some in this project), but Yolo and other deep-learning detectors are well worth considering.

Contact me (tom@alvervalleysoftware.com) to discuss whether I can help with your computer vision project. If you’re thinking about using Yolo from Python and C++, I probably can…

Genetic Algorithms, Particle Swarm Optimisation, Evolutionary Computing

Genetic algorithms (GAs) are a search and optimisation technique inspired by the ideas of “survival of the fittest” in populations of individuals, as well as other genetic concepts such as crossover and mutation.

GAs can often find good solutions to problems very quickly – often finding solutions in complex, multi-dimensional, non-linear problem “spaces” that other algorithms struggle badly with.

Successfully applying a Genetic Algorithm to a problem involves steps such as:

Identifying whether the problem “space” is suited to a GA.
Encoding the problem into a “genome” that the GA can work with.
Writing a GA (or using a standard library).
Defining and writing a fitness function.
Avoiding pitfalls such as using a weak random number generator, using encodings with big “step” values in them which can block improvements, etc.

Unlike neural networks where I favour a pre-written open source library, with Genetic Algorithms I prefer to write my own – the algorithm itself is small and simple, and it is best to have control over some of the other aspects mentioned above.

I have used my own GAs as part of commercial projects mentioned elsewhere on this website, including Computer Vision, and other data analysis projects.

I have also implemented other evolutionary computing algorithms, such as variations of Particle Swarm Optimisation and Ant Colony Optimisation. Each algorithm has its own “class” of problem that it solves better than most other algorithms.

Please email me to discuss your project and we’ll see if I can help.

ONNX files in OpenCV

I have been aware of OpenCV’s ‘dnn’ module for some time: Last time we tried to use it in a project was a number of years ago, and it didn’t seem to be ready for what we needed – or perhaps we just misunderstood it and didn’t give it a good enough look.

Aside from that, I’ve been using .ONNX (Open Neural Network eXchange) files for a while now. My standard usage of these is to transport a trained model from PyTorch – for example a ResNet classifier – onto a Jetson Nano, NX or Orin. PyTorch can export as .ONNX, and TensorRT on the Jetson can import them, so it’s been literally an ‘exchange’ file format for me.

However, pulling these two things together, I have recently learned that OpenCV’s ‘dnn’ module can load directly from .ONNX files, specifically including ResNet models such as the ResNet18 classifier I have recently trained for a client.

There are a few ‘tricks’ required to prepare images to be classified, and it took me a fair amount of research (including some trial-and-error, and using ChatGPT – that was a day I can never get back…) but it works now: I can classify images, using a ONNX file, in OpenCV, from either C++ or Python.

This means that models that I originally trained for Jetson hardware can now be used on any platform with OpenCV. I will be testing this on a Raspberry Pi 5 shortly to gauge performance.

Currently, it’s using CPU only – but it does use all CPU cores available – but I believe GPU is also supported given a suitably-compiled OpenCV: I may try that next.

Cyber Essentials certification achieved

[Update Sept 2024: I recently successfully renewed my Cyber Essentials certification. This time round I made the internal network even more secure, over and above the base requirements of CE].

I’m pleased to say I recently achieved Cyber Essentials certification – showing that I take protection of customers’ data (including valuable machine learning training data and source code) seriously. I was happy to find I was already compliant in most areas, but I reviewed my policies and procedures, including strengthening them in a couple of specific cases. A useful process to go through.

I chose to work with CSIQ (https://www.csiq.co.uk/) as my certification partner, and strongly recommend their services.

Jetson hardware, and the ‘jetson-inference’ package

I have been involved in several projects very recently (and two ongoing) where we have used NVIDIA ‘Jetson’ hardware (Nano, Xavier / NX, and ConnectTech Rudi NX). These machines are roughly ‘credit-card sized’ (apart from the Rudi, which has a larger but very ‘rugged’ case) and are ideal for ‘edge’ or embedded systems.

The Jetson hardware is basically a small but powerful GPU, but also including a CPU and small ‘motherboard’ providing the usual USB ports, etc. They run a modified version of Ubuntu Linux.

In some cases I developed software in-house using OpenCV (C++ and Python). However, I am also making more and more use of the excellent ‘jetson-inference’ library of deep-learning tools, and have now built up quite a bit of experience in using this library and developing applications and solutions based on it.

In short, it is very good for developing solutions that need:

Image classification (e.g. cat vs dog, or labrador vs poodle, or beach vs park)
Object detection (i.e. accurate location, and classification of objects – can be trained to recognise new objects, including very small/distant)
Pose estimation (e.g. standing, sitting, walking, pointing, waving)

I have now developed a number of solutions that have ‘gone live’ using this hardware and toolkit. I am also experienced in the ‘back end’ tasks of training new ‘models’ to recognise new, specfic classes of objects, and porting those models to the Jetson hardware.

Please contact me to discuss whether I can help you with your Jetson-based project. tom [at] alvervalleysoftware.com.

How to install NVIDIA drivers and CUDA on Ubuntu 18.04

Settings -> Graphics: If it shows something ‘generic’ like ‘NV136’, then it’s not using NVIDIA drivers.

Go to ‘Software & Updates’ -> ‘Additional Drivers’. Select a recent/recommended one – I use ‘nvidia-driver-410’ at the time of writing (Mar 2019).

Let it update then reboot. If it’s worked, the command ‘nvidia-smi’ should show GPU status information.

NOTE: If it doesn’t work (i.e. comes up in low-resolution mode, and the ‘Settings->Graphics’ still shows ‘NV136’ or ‘llvm…’, then it’s not worked. IF THAT HAPPENS, the most likely fix is to DISABLE SECURE BOOT in the BIOS – secure boot stops some drivers being loaded. This was the problem on my machine, and disabling secure boot made the driver I’d already installed work.

Following CUDA installation instructions on NVIDIA site: Use Ubuntu package manager version

sudo apt-get install cuda

CUDA toolkit seems to be separate?

sudo apt install nvidia-cuda-toolkit

As per the CUDA post-install instructions – add the NVIDIA dir to PATH, install samples, compile them, run them.

OpenCV on CUDA

I recently had the opportunity to do some work on an NVidia Jetson TK1 – a customer is hoping to use these (or other powerful devices) for some high-end embedded vision tasks.

First impressions of the TK1 were good. After installing the host environment on a Ubuntu 14.04 64 bit box and ‘flashing’ the TK1 with the latest version of all the software (including Linux4Tegra), I ran some of the NVidia demos – all suitably impressive. It has an ARM quad-core CPU, but the main point of it is the CUDA GPU, with 192 cores, giving a stated peak of 326 GFLOPS – not bad for board that is under 13cm square. It’s a SIMD (Single Instruction Multiple Data) processor, also known as a vector processor – so their claim that it is a ‘minisupercomputer’ isn’t too wildly unrealistic – although just calling it a ‘graphics cards with legs’ would also be fair.

I wrote some sample OpenCV programs using OpenCV4Tegra, utilising the GPU and CPU interchangeably, so we could do some performance benchmarks. The results were OK, but not overwhelming. Some code ran up to 4x faster on the GPU than the CPU, while other programs didn’t see that much benefit. Of course, GPU programming is quite different from CPU programming, and not all tasks will ‘translate’ well to a vector processor. One task we need in particular – stereo matching – might benefit more.

We will do more work on this in due course. We will also be comparing the processing power to a Raspberry Pi 3, and some Odroids, as part of our evaluation of suitable hardware for this demanding embedded project. More results will be posted here as we get them.

Why is the recent ‘Go’ victory so important for AI? (Part 2)

(Since I wrote Part 1 of this article, the ‘AlphaGo’ AI won the 5th game in the series, giving a 4:1 victory over one of the top human players, Lee So-dol).

We have already discussed how ‘Go’ is much more difficult for a computer to play than Chess – mainly because the number of possible different moves per turn is so much bigger (and so the total ‘game space’ is even more vast), and because deciding how ‘good’ a particular board position is, is so much harder with ‘Go’.

First, let’s address one of the points the mainstream press have been making: No, the ‘artificial intelligence’ computers are not coming to get us and annihilate the human race (I’ve seen articles online that pretty much implied this was the obvious next step). Or at least, not because of this result. ‘Go’ is still a ‘full information, deterministic’ game, and these are things computers are good at, for all ‘Go’ is about as hard as such games get. This is very different from forming a good understanding of a ‘real world’ situation such as politics, business or even ‘human’ actions such as finding a joke funny, or enjoying music.

But back to ‘Go’. With Chess, the number of possible moves per turn means that looking at all possible moves beyond about 6 moves out is not a sensible approach. So, pre-programmed approaches (‘heuristics’) are used to decide which moves can safely be ignored, and which need looking at more closely.

With ‘Go’, even this is not possible, as no simple rules can be programmed. So, how did ‘AlphaGo’ tackle the problem?

The basic approach (searching the ‘game tree’) remained similar, but more sophisticated. Decisions about which parts of the tree to analyse in more detail (and which to ignore) were made by neural networks (of which more later).

Similarly, the ‘evaluation function’ which tries to ‘score’ a given board position had to be more sophisticated than for Chess. In Chess, the evaluation function is usually written (i.e. programmed into the software) by humans – indeed, in the 1997 Kasparov match won by IBM’s Deep Blue, the evaluation function was even changed between games by a human Grand Master, a cause of some controversy at the time (i.e. had the computer really won ‘alone’, or had the human operators helped out, albeit only between games).

In ‘AlphaGo’, another neural network (a ‘deep’ NN) was employed to analyse positions. And here lies the real difference. With AlphaGo, the software analysed a vast number of real games, and learned by itself what are features of good board positions. Having done this, it then played against itself in millions more games, and in doing so was able to fine tune this learning even further.

It learned how to play ‘Go’ well, rather than being programmed.

This ‘deep neural network’ approach is the hallmark of many modern ‘deep learning’ systems. ‘Deep’ is really just the latest buzzword, but the underlying concept is that the software was able to learn – and not just learn specific features, like a traditional neural network, but also to learn which features to choose in the first place, rather than having features hand-selected by a programmer.

We’ve probably got to the stage now where the perennial argument – are computers ‘really intelligent’, or just good at computing – has become fairly irrelevant. AI systems are now able to not only learn a given set of features, but to choose those features themselves – this is how human (and other animal) brains work. This is undoubtedly a very powerful technique, which will guide the future of AI for the next few years.

Why is the recent ‘Go’ victory so important for AI? (Part 1)

Anyone who has seen any news in the last few days will know that a computer has for the first time beaten a top human player at the ancient Chinese game of ‘Go’. In fact, at the time of writing, the AI (let’s call it by its name: AlphaGo) has beaten its opponent 3 times, and the human (Lee So-dol) has won one – the fifth in the series takes place shortly. But why is this such important news for AI?

After all, AI has been beating top grand-masters at Chess for a while now – Gary Kasparov was beaten by a computer in 1997, and although the exact ‘fairness’ of those matches has been questioned by some, it’s certainly been the case since about 2006 that a ‘commercially available’ computer running standard software can beat any human player on the planet.

So why is ‘Go’ so different? In many ways, it’s a very similar game. It’s ‘zero-sum’ (meaning one player’s loss exactly matches the other players gain), deterministic (meaning there is no random element to the game), partisan (meaning all moves are available to both players), and ‘perfect-information’ (meaning both players can see the whole game state – there are no hidden elements or information). Just like Chess.

From an AI point of view, two things make ‘Go’ vastly more difficult than Chess.

Firstly, the board is a lot bigger (19×19), meaning that the average number of legal moves per turn is around 200 (compared to an average of 37 for chess). This means that the ‘combinatorial explosion’ (which makes chess difficult enough) is much worse for ‘Go’: to calculate the next 4 moves (2 each for each player) would need 320,000,000,000 board positions to be analysed – and looking ahead 2 moves each would give a pathetically weak game.

The second factor is that for Chess, analysing the ‘strength’ of a board position is fairly easy. The material ‘pieces’ each player owns are all worth something that can be approximated with a simple scoring system, and that can be made more elaborate with some simple extra strategic rules (knights are more valuable near the centre, pawns are best arranged in diagonals, etc). But for ‘Go’, a simple ‘piece counting’ system is nothing like a useful enough indicator of the advantage a player has in the game, and no ‘simple rules’ can be written which help.

Instead, good human players (and even relative amateurs) can assess a board position, more or less just by using their intuition, and that intuition is where a lot of the best play comes from. Computers, of course, are not well known for their use of ‘intuition’.

I’ll write more about the approach ‘AlphaGo’ used – and why this has wider implications for AI in general – in a follow-up article in the next few days.