Computer Vision – a Developer’s Viewpoint

This article, written by me, originally appeared on the website of the Institution of Analysts and Programmers.

Computer Vision A Developer’s Viewpoint

Tom Reader
Alver Valley Software Limited

Computer Vision is an exciting branch of computer science for the programmer, combining mathematics, artificial intelligence and machine learning techniques, with more traditional programming skills and problem solving. It can require large amounts of computing horsepower, and as such the theory has often been ahead of what has been practically possible. These days however, the hardware has caught up, and real, practical applications are now hitting the mainstream. Tom Reader, a computer vision expert at Alver Valley Software Limited, gives us a look back at the history, and a glance at the future of this fascinating field.

Introduction

Computer vision can be summarised as the science of turning an image into usable information. The classic example is probably Automatic Numberplate Recognition (ANPR): Given an image of a car, extract the number plate text. Given a picture of a face, classify it as male or female, identify the emotion they are showing, or maybe even try to recognise the specific person. Given an image of a level crossing, identify whether or not there is a vehicle stranded on the railway line. Given an image of a station platform, estimate the number of people present. Given an image of a part on a production line, is it perfect or does it have a defect? Given a film taken from inside a shop window, estimate the number (and gender, age, etc) of people who stop to look at the window display, and for how long it keeps their attention. In all cases, we’re trying to distil an image (high dimensional data) into a simpler representation (maybe even binary or one-dimensional: yes/no, male/female, etc).

So what’s the problem?

The problems are immense. For a start, a typical image might contain 10MB of data, usually split into 3 separate colour channels. In almost all cases, the images presented will be ‘noisy’ – for example, they may be out of focus, include reflections and other distractions, and some items may be obscured by other items. Camera lenses almost always add their own ‘errors’ (photographers will be familiar with barrel distortion and chromatic aberrations, for example), and an image that has been stored as a JPEG may have extra ‘artefacts’ added due to the compression.

Even given a clear image, the task of explaining to a computer, in a conventional programming language, how to read a line of text or identify a person, can probably be imagined by anyone familiar with programming of any sort. As an added complication, humans see colour in a very different way to how computers process it, so anything to do with colour matching has to take that into account.

And it’s so easy for us humans…

Part of the problem as a computer vision professional is that humans (and other animals) are incredibly good at vision – we have very highly adapted visual systems and powerful brains. In the case of humans, some of this begins from a very early age – a baby will look towards a human face almost from birth. A few years later, when a bit of knowledge and common sense has developed, a small child will be capable of looking at a photo (or real-world scene) and answering questions like “what colour is the car?”, or “how many cats can you see?”. The ease with which we do this makes it hard to explain to clients that these are very difficult problems for a computer system.

Basic techniques

The basic building blocks of computer (and probably animal) vision include low-level mathematical techniques. Changes between colour and brightness levels are analysed with the aim of detecting ‘edges’ and the ‘regions’ that they separate. Even this is a difficult problem, involving significant computing power, and some of the theory was only developed surprising recently. In humans, some of this is achieved by very low level processing, including some in the retina itself before the signal even reaches the brain. In the computer, it’s pixels all the way, and everything has to be programmed from there.

Having done the simple processing, things get worse. Given the set of edges and/or regions, which are ‘real’, and which are image artefacts? For example, given a photo of a bowl of fruit (one of the computer vision staple test images), which edges are the actual edges of pieces of fruit, and which are just the lines of shadows? A yellow grapefruit may look dark grey when viewed in deep shadow, and almost white at the point where the light reflects off it. A child of 4 could point at the grapefruit, but it’s not a simple problem to solve in software.

Other basic techniques include blurring, sharpening, resizing, contour finding, and converting to different image representations.

Looking for ‘features’

Often, the points of interest in an image boil down to a set of ‘features’. For example, a letter ‘A’ has a sharp point, a ‘B’ and a ‘D’ have some rounded corners and some sharp corners, a ‘C’ has a rounded corner and open ends, an ‘O’ has rounded corners but no point or ends, a ‘4’ and an ‘X’ both have crossed lines. Those ‘features’ can be extracted, and then used as inputs for higher-level processes to work with.

Higher level techniques

Having extracted features from an image, interesting techniques can then be applied. Artificial Intelligence (AI) techniques such as neural networks and support vector machines can be used to automatically ‘learn’ from a large enough set of training data, and then to classify future, previously unseen cases. But choosing the correct features in the first place (and writing the code to extract them) is essential, and then all the normal AI problems (e.g. how to train and test a neural network) still apply.

Vision libraries

Not all the code has to be written from scratch. There are libraries (both commercial, and open source) that can help with a lot of the leg-work. I use the OpenCV library, which is open source, and contains a large number of highly optimised routines for doing some of the low-level work, and also has some higher-level tools in the box. It is still just a tool-box though – although it includes some ready-to-go solutions straight ‘out of the box’, it is mostly a programming library.

Development techniques

Computer vision projects present some unique challenges to the developer. I work mostly as part of a team, but I tend to be the main (or only) computer vision programmer, with other people doing other parts of the task – user interface, back-end integration, communications, etc. From the computer vision perspective, I have found myself concentrating on a number of areas apart from the problem-solving aspects of the vision work itself.

Firstly, keep the basics in place. For example, I have become increasingly keen on my software writing a proper log, either to file on disk, or via callbacks to the calling program in the case of an API. Either way, I need to be able to see everything my program is doing, at a configurable level of detail – this is essential when trying to identify problems which will inevitably crop up. Also, documentation is essential, especially when working on multiple projects concurrently. I use Doxygen (with standard in-code comments, which I use without thinking) to give me a good overview of each program down to the class, function and parameter level.

Secondly, the computer vision process is almost always a workflow. Images pass through many processes (not always the same pathway, as images get processed), and the information gradually gets extracted to higher levels. Bearing this in mind, I write a series of ‘debug’ images to disk at pre-determined points, to aid understanding problems that will occur with some images later on. For example, did a problem occur because the image was too blurred or distorted, or was the image fine but the neural network came up with a wrong classification?

Thirdly, computer vision usually involves some aspect of machine learning. One thing this entails is huge volumes of data for training and testing. A typical sub-project may require training with 10,000 images, half of which do contain a certain object, and half of which don’t. This data all has to be generated or extracted somehow. I spend quite a bit of time designing, writing, and using a variety of in-house ‘data wrangling’ tools to handle and manage all this data, and of course also using standard tools such as those built into Linux where possible.

At the ‘project’ level, things also need managing. Often, I will initially be given a few hours to see if a task is feasible at all, and I will produce a ‘quick and dirty’ test to try out a few concepts with some simple test images. If it goes well, things may proceed to the next stage – basically to see how far we can get with this idea – so the code develops further. Later on, the project will get more serious, the images will become ‘noisier’ and more realistic, and the expectation of correct results will get closer to 100%. Finally, things get ready to go into production, almost always as part of a larger software infrastructure, with all normal expectations of stability, scalability (maybe into the cloud), manageability, performance, error handling, etc.

During this process, it is essential to remember that the code needs to match the stage in the project it embodies. During the ‘proof of concept’ stage, a few lines of uncommented Python may be enough to try things out, and we can let the error checking slide for a while. But if the project proceeds to later stages, it is essential to make sure the code quality keeps up. At some point the language decision needs to be made (computer vision is compute-intensive, both in terms of CPU and RAM, so C++ is popular), integration decisions need to be made, and documentation needs to be kept up to date.

Future applications

Away from the ‘traditional’ applications, computer vision is now finding new niche markets. For example ‘augmented reality’, where a computer superimposes certain information onto a ‘Head Up Display’ in the user’s vision, can be further improved with computer vision. Smartphones are now powerful enough to run some computer vision tasks. Other platforms also are, or will be: although Google Glass was only a prototype, and appears to have gone back to the drawing board for now, I wrote a simple ANPR application on that platform two years ago. No doubt similar platforms will hit the market eventually, and there are many computer vision possibilities.

Deep learning

As computing power and storage continues to grow, artificial intelligence improves, and computer vision advances. Given every image on the Internet, almost infinite computing power, and a few hundred experts in the field, it should be possible to create some really good computer vision applications – keep an eye on what Google are coming up with for details. They would love to be able to classify a set of images into folders such as ‘the beach’, ‘the kids’, ‘Christmas photos’, etc, and they have made huge progress towards that recently. That kind of ‘deep learning’ is almost a higherlevel of abstraction still – if and how those techniques will ‘filter down’ to the handling of computer vision for specific tasks, remains to be seen.

For now, with ever increasing computing power and research results, I can’t think of any more rewarding branch of computer science in which to work.

Tom Reader

Alver Valley Software Limited

www.alvervalleysoftware.com

Distributed Computing

I have been an enthusiastic supporter of distributed computing projects since the early days of SETI@Home. The computing power in a modern PC is too useful to waste, and can now be harnessed to help research some of mankind’s most serious problems. As such, I now ‘crunch’ mostly for the World Community Grid – all Alver Valley Software machines are on 24×365, running life sciences projects (particularly cancer research). This work is done in the background, i.e. in addition to the normal usage of my computer, with no direct involvement from me at all. My output is usually somewhere around 100 Gflops – my statistics summary can be seen below. More importantly, if you have a PC that is switched on for more than a couple of hours a day, please consider joining the project – just click on the ‘World Community Grid’ graphic below to read more.

Computer Vision development

Often computer vision projects begin with a very early proof-of-concept, or at best a prototype, to see whether something is going to be feasible. Hopefully it turns out to be, so things then progress to a ‘see how much we can get working’ stage. As long as that goes well, then things eventually move towards a live production release. During this process, the images I’m expected to handle can become more and more ‘real world’: I start to get more noisy images, focus problems, reflections – the nice clean images I was originally testing with seem trivial now. In response, the code (and the computer vision workflow it embodies) gets ever more complex, often evolving from the original prototype, sometimes as a complete rewrite. This is a familiar pattern I’ve experienced across many projects.

Managing this process in a sensible way is something I’ve learned is very important. It’s essential to be aware of how the project is moving from proof-of-concept to production, and to make sure the code-base keeps up. Running multiple projects often in parallel, it’s even more important to keep documentation up to date, and to generally ‘run a tight ship’ on the development front.

Much of this is just good development practice, while some is more specific to computer vision projects. Some specific things I do include:

Write a program log, with variable-level tracing showing what is going on. I do this in a formalised way so I can identify problem areas in the workflow for a given image very quickly.
Write debug versions of the images as they move through the workflow, again in a defined and standard way.
Comment code in a standard way.
Use tools such as Doxygen to document the program automatically. The class and function dependencies diagrams alone are worth the small amount of effort.
Use Git to manage the version control, locally, and on the client’s system as well if appropriate. This one has been a learning curve for me, but is paying off now.
Maintain a document giving a brief overview of the whole system layout.

I don’t use truly ‘formal’ development methods – they don’t bring benefits to most projects of the type I’m involved with – but a little bit of good development practice goes a long way, and means the client ends up getting a documented, maintainable code-base and a production-ready solution.

Colour and pattern matching

Lots of work for a client recently in the area of colour matching, and moving on from there, the more challenging problem of matching patterns (i.e. groups of colours, arranged into shapes or regions of different sizes).

Although OpenCV provides the basic tools needed to work with colour (primarily, the ability to switch between colour spaces like RGB and HSV), matching colours is more challenging than it first seems. In this instance, we were trying to match colours as the human eye would perceive them, which is different from the simple measure of how ‘close’ two colours are in a simple RGB cube. From there, a further complication is that much of the matching was done from photographs taken in the ‘real world’, where lighting and shading varies widely – even something which is the ‘same colour’ across all its surface may appear very different at different points in the photo, or across several photos. Deciding what is a difference in colour, as opposed to a difference in lighting, is tricky, but there are techniques that can be used to help. In our case, we also had to decide which part of the image was the ‘object’, and which was the background to be ignored completely.

Moving on to matching patterns adds further complexity. Is it more important to match colours as exactly as possible, regardless of size/shape of the pattern, or vice versa? We created a test set containing around 20,000 images to experiment with the best ways of finding a pattern that most closely matches another pattern, and have found a good balance – but it’s not a simple problem.

Chess – the classic AI problem

To settle a pub argument that took place around 20 years ago (I’ve been busy, OK?!) about artificial intelligence and emergent behaviour, and because I have some work coming up in this area, I have finally got around to writing a simple chess engine.

In terms of the pub argument, it has immediately proved that (a) yes, a chess engine can be very easily capable of beating the person who wrote it, and (b) even though it has no built-in knowledge of tactics such as forks, skewers or discovered attacks, it can and will use all such techniques during a game to very good effect. In other words, that behaviour ’emerges’ from the raw computation.

Neither of those things will come as any surprise to anyone who understands anything
about AI, of course. But it was fun to write, and it’s actually quite fun to play against, but also fun to play *with*. For example, it’s interesting to see how it chooses different moves depending on how many moves ahead it’s allowed to look.

When away from the field of computer vision, the speed of a modern computer is astounding. My code is currently very unoptimised, and was written mostly for ease of writing rather than with performance in mind – and yet, on a normal desktop PC, running on one core only, it is capable of generating, and analysing, approximately one million moves (board positions) per second.

Of course, chess is a famous example of a problem with a ‘combinatorial explosion’, meaning that for even relatively small numbers of moves to look ahead, the number of possible board positions rapidly becomes fantastically large. At 5 moves ahead, it is looking at around 100 million board positions – suddenly a millions moves per second doesn’t seem so much.

I am currently playing it ‘against’ another computer chess game, which after a few games will provide an estimated ELO score – I’ll report back in a day or two as to the results of that.

The current algorithm is pure ‘Minimax’, with no modifications. It has no opening book or endgame database, so the play is a bit ragged at those stages. The first couple of moves are fairly random, but as the two sides ‘engage’ it starts making proper moves. By the end-game (at least when playing against me) it’s usually got enough of an advantage to be fairly decisive – it usually manages a check-mate during the ‘main’ part of the game, rather than waiting for a typical ‘endgame’ situation where we’ve only got 2-3 pieces each.

When time allows, I have other plans for this – after building in simple alpha-beta pruning to speed it up, I want to start to work on strategies – but from an AI point of view, I want it to be able to learn strategies itself, rather than be pre-programmed with them. I have some ideas in mind, and it should be interesting.

More OCR work, and the start of some ‘pure’ AI

Another busy month getting some OCR work up to production standard – there’s a big difference between ‘basically working’ and ‘industrial grade’. It’s taken a lot of work, but is there now.

I have written a test harness that allows bulk amounts of test data to be tested and retested, in as little as 1 second each time, and in an automated way after each round of training. A bit of ‘infrastructure’ and formal testing goes as long way at this stage.

Success rates in the test suite are now up to 100%, and the real prodution environment beckons.

In other news, I have some work to do on a ‘pure’ AI project – some software to take one player’s turn in a ‘full information’ two-player game, and associated research on scoring algorithms. The idea is to see what level of ‘intelligent behaviour’ emerges from just the raw computation, given enough time and CPU cores. It sounds like my idea of fun 🙂

2D point matching (or pose estimation) based just on relative position

For a client’s project recently I needed to be able to correlate the positions of a number of points in 2D space from one image to another. These aren’t ‘features’ as usually handled by routines such as SIFT/SURF/ORB, etc, but just 2D points with no other attributes at all. Most of the points in image A will be present in image B, but some may be missing, there may be extras, and image B may be rotated, scaled, and translated in x,y, by any amount.

It turns out to be quite an interesting problem. Luckily the number of points in the images were fairly small (<50), so a brute force approach works – and it does work well. As long as the most of the points in image A are present in image B, in something close to the same positions relative to each other, then they are found correctly in image B – and importantly the algorithm knows which points correlates with which, so the position of each point can be ‘followed’ into the new image. Finally, it returns the amount of rotation, scaling and translation applied.

If there were a large number of points, the efficiency of the algorithm would be a problem, and a different approach would be needed, but for this application it has worked very well.

Computer Vision with OpenCV on a Raspberry Pi

This week I have taken delivery of a Raspberry Pi 2, and a Pi camera module: Total cost around UKP50. The aim of the experiment is to see whether the Pi is powerful enough to be used for computer vision applications in the real world. More of that over the coming days, but the short version is: Yes it is.

I also needed several other Pi-related components (again, more details of the fun we’re having at a later date). For various reasons mostly to do with who had what in stock, I split the purchases between two UK companies – 4Tronix, who supply all sorts of superb robotics stuff for Pi and Arduino, and The Pi Hut, who as the name implies sell all things Pi-related. Both orders were handled quickly, and I recommend both companies highly.

Setting up the new Pi took 2 minutes, and attaching the camera module is easy, if slightly fiddly.

I used the ‘picamera’ module and was getting images displayed on screen, and saved to the filesystem, all within a further few minutes. The ‘picamera’ module appears to be a very well written library, and the API is certainly powerful.

It was then time to build OpenCV. This is a slightly more involved process (build it from the source code), which took a few minutes of hands-on time, followed by about 4 hours of waiting for it to compile. A quick experiment then showed OpenCV working properly from both C++ and Python.

The picamera module can process images in such a way that they can be handled by OpenCV – the interface between the two is straightforward. As such, within a few more minutes I was grabbing images live from the Pi camera module, and processing them with normal OpenCV Python calls. I don’t yet know what would be involved in getting images from the camera from C++, but with a Python interface this good, it may not be necessary to worry about it (Python can of course call C/C++ routines anyway).

Initial impressions are that it all works beautifully. On the *initial* setup, it seems to take about one second to capture a frame from the camera, but the good news is that OpenCV processing (standard pre-processing such as blurring, and Canny edge detection) are faster than I’d expect from a computer this size. After playing with a few settings, I am now able to increase the frame rate to many frames per second at capture, and around 4 FPS even including some OpenCV work (colour conversion, blur, and Canny edge detection) – bearing in mind some of those are compute-intensive tasks, I think that’s impressive.

So yes: The Raspberry Pi 2 and the Pi camera module are certainly suitable for computer vision tasks using OpenCV, and I have two contracts lined up already to work on this.

Some more OpenCV tricks

A busy month of OpenCV contracting for a number of clients, including some work in areas of OpenCV I’ve not used much, if at all, before (non-chargeable, of course – I only charge for productive time).

I am now more familiar than I ever thought I’d be with the HoughLines(P) and HoughCircles functions – the former of which is more complex than it first seems. Like many things in computer vision, it takes some coaxing to get good results, and even more coaxing to get really robust results across a range of ‘real live’ images in the problem domain.

I have also worked a lot this month with the whole ‘camera calibration’ suite of functions, and then followed that up by gaining experience with the ‘project image points into the plane’ routines, which can lead to some interesting ‘augmented reality’ applications. However, in my case, I’ve used them to simply determine exactly where (in the 2D image) a specific point in the 3D space would appear. It works very well, and I have a project lined up ready to put this into action.

I’ve revisited one of my ‘favourite’ (i.e. most used) parts of the library: contour finding, and associated pre- and post-processing, but this time all from Python.

During the last few days, I’ve started looking at 2D pose estimation: specifically in this case, trying to determine the location of a known set of 2D points in a target image, given possible translation, rotation and scale invariance. Not finished with that one, yet.

Last (but not least – this isn’t going to go away) I’ve been making an effort to learn Git. I was pleased to find this simple guide, which at least let me get on with my work while I learn the rest.

OpenCV with Python – first impressions

I’ve spent a month or so trying to make an effort to learn Python, mostly by forcing myself to do any new ‘prototype’ vision / OpenCV work in the language. This has cost me some money – I only charge for ‘productive’ time, not ‘learning’ time, and at times the temptation to go back to ‘nice familiar C++’ has been great. But I’ve made good progress with Python, and I’m glad I’ve stuck at it. Apart from anything else, the language itself isn’t hard to pick up.

The pros and cons from a computer vision perspective are roughly as expected. It can be slower to run, but depending on how the code is written, it’s not a big difference. Once ‘inside’ the OpenCV functions, the speed appears to be about the same (as you’d expect: it’s just a wrapper for the same code), but any code run actually in Python needs careful planning, and if large amounts of compution were going to be done, C++ would no doubt still be the best bet.

But anything it lacks in runtime speed, it certainly makes up for in speed of development. As a prototyping language, I think I’m already more productive in Python than C++ (and that’s after 20+ years of C++, and a month of part-time Python). There will always be more to learn, of course, but I think I’m at the point where the learning curve is beginning to get less steep.