Computer Vision – a Developer’s Viewpoint

This article, written by me, originally appeared on the website of the Institution of Analysts and Programmers.

Computer Vision A Developer’s Viewpoint

Tom Reader
Alver Valley Software Limited

Computer Vision is an exciting branch of computer science for the programmer, combining mathematics, artificial intelligence and machine learning techniques, with more traditional programming skills and problem solving. It can require large amounts of computing horsepower, and as such the theory has often been ahead of what has been practically possible. These days however, the hardware has caught up, and real, practical applications are now hitting the mainstream. Tom Reader, a computer vision expert at Alver Valley Software Limited, gives us a look back at the history, and a glance at the future of this fascinating field.

Introduction

Computer vision can be summarised as the science of turning an image into usable information. The classic example is probably Automatic Numberplate Recognition (ANPR): Given an image of a car, extract the number plate text. Given a picture of a face, classify it as male or female, identify the emotion they are showing, or maybe even try to recognise the specific person. Given an image of a level crossing, identify whether or not there is a vehicle stranded on the railway line. Given an image of a station platform, estimate the number of people present. Given an image of a part on a production line, is it perfect or does it have a defect? Given a film taken from inside a shop window, estimate the number (and gender, age, etc) of people who stop to look at the window display, and for how long it keeps their attention. In all cases, we’re trying to distil an image (high dimensional data) into a simpler representation (maybe even binary or one-dimensional: yes/no, male/female, etc).

So what’s the problem?

The problems are immense. For a start, a typical image might contain 10MB of data, usually split into 3 separate colour channels. In almost all cases, the images presented will be ‘noisy’ – for example, they may be out of focus, include reflections and other distractions, and some items may be obscured by other items. Camera lenses almost always add their own ‘errors’ (photographers will be familiar with barrel distortion and chromatic aberrations, for example), and an image that has been stored as a JPEG may have extra ‘artefacts’ added due to the compression.

Even given a clear image, the task of explaining to a computer, in a conventional programming language, how to read a line of text or identify a person, can probably be imagined by anyone familiar with programming of any sort. As an added complication, humans see colour in a very different way to how computers process it, so anything to do with colour matching has to take that into account.

And it’s so easy for us humans…

Part of the problem as a computer vision professional is that humans (and other animals) are incredibly good at vision – we have very highly adapted visual systems and powerful brains. In the case of humans, some of this begins from a very early age – a baby will look towards a human face almost from birth. A few years later, when a bit of knowledge and common sense has developed, a small child will be capable of looking at a photo (or real-world scene) and answering questions like “what colour is the car?”, or “how many cats can you see?”. The ease with which we do this makes it hard to explain to clients that these are very difficult problems for a computer system.

Basic techniques

The basic building blocks of computer (and probably animal) vision include low-level mathematical techniques. Changes between colour and brightness levels are analysed with the aim of detecting ‘edges’ and the ‘regions’ that they separate. Even this is a difficult problem, involving significant computing power, and some of the theory was only developed surprising recently. In humans, some of this is achieved by very low level processing, including some in the retina itself before the signal even reaches the brain. In the computer, it’s pixels all the way, and everything has to be programmed from there.

Having done the simple processing, things get worse. Given the set of edges and/or regions, which are ‘real’, and which are image artefacts? For example, given a photo of a bowl of fruit (one of the computer vision staple test images), which edges are the actual edges of pieces of fruit, and which are just the lines of shadows? A yellow grapefruit may look dark grey when viewed in deep shadow, and almost white at the point where the light reflects off it. A child of 4 could point at the grapefruit, but it’s not a simple problem to solve in software.

Other basic techniques include blurring, sharpening, resizing, contour finding, and converting to different image representations.

Looking for ‘features’

Often, the points of interest in an image boil down to a set of ‘features’. For example, a letter ‘A’ has a sharp point, a ‘B’ and a ‘D’ have some rounded corners and some sharp corners, a ‘C’ has a rounded corner and open ends, an ‘O’ has rounded corners but no point or ends, a ‘4’ and an ‘X’ both have crossed lines. Those ‘features’ can be extracted, and then used as inputs for higher-level processes to work with.

Higher level techniques

Having extracted features from an image, interesting techniques can then be applied. Artificial Intelligence (AI) techniques such as neural networks and support vector machines can be used to automatically ‘learn’ from a large enough set of training data, and then to classify future, previously unseen cases. But choosing the correct features in the first place (and writing the code to extract them) is essential, and then all the normal AI problems (e.g. how to train and test a neural network) still apply.

Vision libraries

Not all the code has to be written from scratch. There are libraries (both commercial, and open source) that can help with a lot of the leg-work. I use the OpenCV library, which is open source, and contains a large number of highly optimised routines for doing some of the low-level work, and also has some higher-level tools in the box. It is still just a tool-box though – although it includes some ready-to-go solutions straight ‘out of the box’, it is mostly a programming library.

Development techniques

Computer vision projects present some unique challenges to the developer. I work mostly as part of a team, but I tend to be the main (or only) computer vision programmer, with other people doing other parts of the task – user interface, back-end integration, communications, etc. From the computer vision perspective, I have found myself concentrating on a number of areas apart from the problem-solving aspects of the vision work itself.

Firstly, keep the basics in place. For example, I have become increasingly keen on my software writing a proper log, either to file on disk, or via callbacks to the calling program in the case of an API. Either way, I need to be able to see everything my program is doing, at a configurable level of detail – this is essential when trying to identify problems which will inevitably crop up. Also, documentation is essential, especially when working on multiple projects concurrently. I use Doxygen (with standard in-code comments, which I use without thinking) to give me a good overview of each program down to the class, function and parameter level.

Secondly, the computer vision process is almost always a workflow. Images pass through many processes (not always the same pathway, as images get processed), and the information gradually gets extracted to higher levels. Bearing this in mind, I write a series of ‘debug’ images to disk at pre-determined points, to aid understanding problems that will occur with some images later on. For example, did a problem occur because the image was too blurred or distorted, or was the image fine but the neural network came up with a wrong classification?

Thirdly, computer vision usually involves some aspect of machine learning. One thing this entails is huge volumes of data for training and testing. A typical sub-project may require training with 10,000 images, half of which do contain a certain object, and half of which don’t. This data all has to be generated or extracted somehow. I spend quite a bit of time designing, writing, and using a variety of in-house ‘data wrangling’ tools to handle and manage all this data, and of course also using standard tools such as those built into Linux where possible.

At the ‘project’ level, things also need managing. Often, I will initially be given a few hours to see if a task is feasible at all, and I will produce a ‘quick and dirty’ test to try out a few concepts with some simple test images. If it goes well, things may proceed to the next stage – basically to see how far we can get with this idea – so the code develops further. Later on, the project will get more serious, the images will become ‘noisier’ and more realistic, and the expectation of correct results will get closer to 100%. Finally, things get ready to go into production, almost always as part of a larger software infrastructure, with all normal expectations of stability, scalability (maybe into the cloud), manageability, performance, error handling, etc.

During this process, it is essential to remember that the code needs to match the stage in the project it embodies. During the ‘proof of concept’ stage, a few lines of uncommented Python may be enough to try things out, and we can let the error checking slide for a while. But if the project proceeds to later stages, it is essential to make sure the code quality keeps up. At some point the language decision needs to be made (computer vision is compute-intensive, both in terms of CPU and RAM, so C++ is popular), integration decisions need to be made, and documentation needs to be kept up to date.

Future applications

Away from the ‘traditional’ applications, computer vision is now finding new niche markets. For example ‘augmented reality’, where a computer superimposes certain information onto a ‘Head Up Display’ in the user’s vision, can be further improved with computer vision. Smartphones are now powerful enough to run some computer vision tasks. Other platforms also are, or will be: although Google Glass was only a prototype, and appears to have gone back to the drawing board for now, I wrote a simple ANPR application on that platform two years ago. No doubt similar platforms will hit the market eventually, and there are many computer vision possibilities.

Deep learning

As computing power and storage continues to grow, artificial intelligence improves, and computer vision advances. Given every image on the Internet, almost infinite computing power, and a few hundred experts in the field, it should be possible to create some really good computer vision applications – keep an eye on what Google are coming up with for details. They would love to be able to classify a set of images into folders such as ‘the beach’, ‘the kids’, ‘Christmas photos’, etc, and they have made huge progress towards that recently. That kind of ‘deep learning’ is almost a higherlevel of abstraction still – if and how those techniques will ‘filter down’ to the handling of computer vision for specific tasks, remains to be seen.

For now, with ever increasing computing power and research results, I can’t think of any more rewarding branch of computer science in which to work.

Tom Reader

Alver Valley Software Limited

www.alvervalleysoftware.com