Pixel-perfect perception: an intro to Computer Vision using Artificial Neural Networks

Published in

MPB Tech

8 min readOct 1, 2021

By MPB’s Senior Software Engineer, Romain Beaugrand.

Cross the Thames between Dartford and Thurrock these days and you’re automatically charged, whether you know it or not. Forget to pay and you’ll receive a penalty charge notice by post. Irritating for sure, but you can’t help admiring the technology that accurately reads thousands of number plates every hour as they pass at 50mph-plus.

In the office, Photoshop’s subject-selection tools seem to just keep on improving. Fully driverless cars are around the corner, figuratively if not literally, and face recognition looks set to change society yet again.
Computer vision is very much A Thing, but how does it really work? Come to think of it, how does human vision work once you get beyond the retina? And what kinds of technology will power the automated future of labour?

In this article I’ll look at the surprisingly simple calculations that underpin some visual perception algorithms. I’ll also discuss how neural networks can break down the world of pixels into one of moving vector shapes, perhaps modelling the way humans see. And while I love experimenting with this kind of problem, I’m nobody’s idea of an expert — I’m only qualified to offer an introduction and I’ll finish with some references to the leading lights in Computer Vision. But first, let’s define just what that term means.

Making computers ‘see’

Computer Vision (CV) isn’t a single discipline. Here I’ll be talking about artificial perception but many more technologies might sit on top of that. Object detection and recognition are out of scope, as are human vision, scene reconstruction, image restoration and many more.
What I will look at is how a computer can turn an image or video sequence into vectors that can be used for most of the above.
And it’s a challenge, of course. To computers, an image is simply a pixel grid, rows and columns of numbers representing colours, shapes and “things”.
Dealing with pixels is laborious and underwhelming. To model vision we must go beyond and interpret the image as a surface in a 3D world.

First principles

A pixel’s value includes information about its colour and position. If you remember The Dress, you’ll also be familiar with the idea that any single point in an image is partly defined by its surroundings. No pixel is an island.

Computer vision works best with ‘noisier’ images

In fact, Computer Vision likes noise — it’s much easier to pick out objects from a busy background than from a (not-very-realistic) low-contrast image.
A starting point for many CV algorithms is feature-detection. Here we take an image and look for the big contrasts between areas of pixels — edges, corners, sharp lines between dark and light.

Feature-matching two similar images. Picture: Steven Calhoun

An example of this approach might be a program that compares two images to determine whether they represent the same subject. To do that, the program looks for common areas of similar pixel groupings. This can be haphazard. In the image above we have a positive match despite the Christmas tree, but there are plenty of false positives too. Usually as programmers we like things to be binary but in CV we have to accept that there will always be errors. Optical illusions happen!

A world in motion

A video is really only a succession of stills, so what applies to a photo can apply equally to moving images. We can run feature-matching on each frame, then use the changing positions of these features to detect motion. Here’s an example:

Feature-detection plus motion-detection starts to look lifelike

In the video I’ve run a feature detection routine, then used colour to mark pixels moving left (yellow-green) and right (red-purple), as well as making faster-moving (ie closer) pixels brighter. We can already see that we’re in a moving car and begin to perceive a 3D landscape.

Getting convoluted

Convolution lets us multiply together two arrays to produce a third. It’s simpler than it sounds, and easier to demonstrate than to describe:

Convolution lets us multiply two arrays to create a third

On the left is an 8x8 grid representing a monochrome ‘image’. We’ll multiply it by the values in the 9x9 grid, or kernel. Our kernel creates a weighting — the central pixel is ranked twice as highly as the eight surrounding it.
We ‘read’ the image using the kernel, covering nine pixels at a time then moving it one ‘pixel’ at a time, left to right and top to bottom.

For each position, the central number is multiplied by 0.2, the others by 0.1, and our output is the mean of the resulting nine values.
We end up with a 6x6 table of pixel values, each of which has been adjusted using information from the surrounding pixels. It may not look much, but here we’ve actually created a simple blurring algorithm.

So far, so Photoshop. But things get interesting when we use convolution for edge detection. This needs two kernels, one for vertical edges and a second for horizontal. Here’s an example of the vertical edge detection algorithm.

Using convolution to detect vertical edges

On the left, our 8x8 image is half black (‘0’) and half white (‘255’). We’re looking for vertical lines, which we define using a kernel that gives weights pixels to the left negatively and those on the right positively.

Now we run the convolution. The resulting 6x6 grid is immediately striking — two columns of 765s book-ended by zeroes. We’ve detected an edge. We can then run the horizontal edge algorithm by rotating our kernel through 90 degrees.

Here we’ve done exactly that:

Our simple algorithm can produce surprisingly recognisable outlines

So we’ve solved the vision problem! Haven’t we … ?
Not entirely. Remember the moving car video? What if, instead of first detecting features and then tracing their movement, we try to do the same with every single pixel in the movie?

Running motion-detection on an entire video is extremely processor-intensive

Here’s where we hit trouble. The calculations are now so huge that my PC delivers results at the ground-breaking speed of … one frame every 5 minutes. For real-time processing we need faster. Eight thousand times faster, in fact. Refine this algorithm all you like, there’s no way it will reach real-time speeds.

And if it blows your mind that your … um … mind can apparently run real-time edge and motion detection, consider that it learned to do so while you were still an infant.

We’re clearly nowhere near ready to get into this particular self-driving car. But is it really possible that your still-unformed mind was so much more powerful than the best consumer-grade equipment 2021 has to offer?
Yes and no. This is where Artificial Neural Networks (ANNs) leap in to save the day.

Artificial Neural Networks

ANNs are artificial in that they mimic some characteristics of real, biological networked neurons (nerve cells).
For present purposes, the important properties of your own neurons are these:

1. They have multiple inputs
2. A minimum amount of stimulation is required to ‘fire’ the neuron (ie pass a signal to the next neuron)
3. Neurons either fire or they don’t, so their output is Boolean (true or false).
4. They exist in, and are enabled by, huge networks.

Without getting sidetracked by the biology (fun fact: birds’ “singing brain” is coded in unary) we can model this behaviour mathematically.
Our artificial neuron consists of an input layer (multiple weighted inputs), a middle ‘calculating’ layer that sums and adjusts the input values, and an output layer (though this is a floating-point number, not a Boolean).
To demonstrate how this produces perception, let’s take the surprisingly knotty problem of recognising handwritten digits.
We’ll feed our program an input — a letter image from the MNIST database — and train it to output the correct number.

Training an ANN to recognise handwritten digits. Animation: 3Blue1Brown

We start by feeding our ANN a 28px square image of a handwritten digit. That’s 784 pixels, so our input layer has 784 neurons. Each neuron is connected to every neuron in the subsequent processing layer and each connection is weighted.

The network performs its magic and outputs a digit from 0 to 9. The program compares that with what it’s told the result should be, adjusts weightings and runs the calculations again. The best-performing set of weightings gives us our recognition algorithm.

If that still sounds like magic, here’s a vastly simplified example:

In the animation we have three input neurons, two calculation layers (let’s call them Layer A and Layer B), and two output neurons.

The input values are always the same (1, -3, 2) and the expected result is known for both output neurons.

Each neuron is connected to every neuron in the subsequent layer, and a weighting factor is applied to each connection.

Each input neuron passes its value to every Layer A neuron, multiplying by the weighting factor.

Each Layer A neuron then sums its three inputs, passing its new value to each neuron in Layer B, again multiplying by the weighting factor. Finally, the output neurons return the sum of their weighted inputs from Layer B.

First time around our answer is wrong, but our program then adjusts the weighting factors and re-runs, refining until it achieves the expected result.
Obviously that’s not a very useful algorithm in daily life. But if we extend it to our character-recognition algorithm, with its 784 input neurons and copious processing layers, we have a massively adjustable network that can learn to recognise characters independently.

So now that we have a shortcut to creating a reusable algorithm, let’s see how that works with Computer Vision.

Closing the loop: Convolutional Neural Networks

To process visual information, our network stores its values as convolutional kernels rather than simple string values. That means that each pixel’s value contains information about which other pixels are nearby.

A real-world convolutional ANN to recognise handwritten digits. *Image:* *Code to Light*

Using convolutional networks we’re able to create algorithms that are 99.5% accurate on MNIST character recognition, compared with 98% for traditional ANNs.

Final thoughts

I’m fascinated by the possibilities of these technologies but I’m an enthusiast, not an expert. If you made it this far, you really should look up some of the leaders in this field. Here are some resources to help broaden your horizons.

OpenCV-Python: understanding features
The Python port of OpenCV (Open Source Computer Vision Library) has some great in-depth tutorials
Convolutional kernels in image-processing
Wikipedia’s entry is clear and well-referenced
But what is a neural network?
YouTube maths enthusiast Grant Sanderson explains
An intuitive explanation of Convolutional Neural Networks
Ujjwal Karn puts it all together, clearly and expertly

Romain Beaugrand is Senior Developer at MPB, the UK’s leading reseller of photographic equipment with operations in Britain, Europe and North America. https://www.mpb.com