How to learn Vision Computing from scratch??

Hi there,

Instead of looking through, analysing and learning the code of openCV from a top down approach, HOW ON EARTH can I learn vision computing with C++ *without libraries* from a bottom up approach. Ie, step by step, from iostream, to opening different file types (mp4, jpg, etc), manipulating pixel and binary data, writing them, etc.

I really want to learn bottom up, please if you could guide me I'd be so thankful
> HOW ON EARTH can I learn vision computing with C++ *without libraries* from a bottom up approach

https://en.wikipedia.org/wiki/JPEG
Without library support, that should take you a year or two.

https://en.wikipedia.org/wiki/Feature_extraction
Add a few more years to get up to speed on this.

100's of years of experience has gone into developing these libraries.
You're not going to replicate that in any meaningful way in a few months.

IMO, if CV is your thing, then start with openCV and start producing something new and unique. If openCV is deficient in some way, your energy will be best utilised by extending the library.
closed account (z05DSL3A)
I would recommend books...however the books that cover topics in Image Processing and Computer Vision that don't (over*) use libraries such as openCV do tend to be expensive. [* use the libraries for boilerplate code for that is not relevant for the topic at hand] I suggest seeing if you are allowed to use the library of a local university.

this stuff gets very complicated in a hurry.
my recommendation is to use the web for now.
here is my getting started guide..
1) get a paint type program that can open / edit /etc RAW files (that is, an image file of bytes in the RGB, BGR, or HSL formats as groups of 3 (and later, 4) bytes per pixel. ) You will NEED this for a while.
2) just write some basic c++ code to do some simple things, using the well known simple algorithms out there like resize the image up and down, look at the interpolation and approaches used to retain the data and so on.
3) grow this. cut out a rectangle and move it around in the image, that sort of thing.
4) now that you have a handle on the absolute basics, try to read and write a couple of the common image formats.
5) now build your objects :) You know what you need now ... a size (width, height), a buffer of bytes (vector unsigned char for example) or a buffer of pixels (I find this VERY clunky, but many people like to work at that level), a header (for the file types, many have headers some full of junk like where /when taken, comments, sub-formats, etc) and so on. Start to build your own tools...
6) learn the graphics to display and such on screen, mouse select and things
7) build from here. soon you will start needing the books on the more advanced techniques, its a HUGE field, with a fair amount of fun math.



I'm a tad unsure of exactly your goal here. The reason I'm unsure is that "Vision Computing" and OpenCV itself are subjects in the branch of AI in the recognition of image content, like identifying objects in a scene, or finding the edges of an object, or even taking several images of an object taken from different angles and forming a 3D model of the object.

This quote, however, suggests something quite different:

from iostream, to opening different file types (mp4, jpg, etc), manipulating pixel and binary data, writing them, etc.


These are topics in the general domain of C++ (iostream), or file formats (mp4), or simple image convolution (manipulating pixels), or file manipulation (writing them, which is related to iostream).

The reason I'm uncertain is that these subjects are in the study of beginning and intermediate programming and basic computer science, while "Vision Computing" is a much, much higher level of computer science in the domain of AI.

It sounds like you're asking how a beginner can approach computer vision. First, you must learn the basic computer science and programming in the line of those points like image manipulation and the use of iostream. Factually, without those basics, even relying upon OpenCV would be impossible.

What I think you are asking, ultimately, is how to learn (here must assert 'after the computer programming basics') the science of AI recognizing images.

I would suggest, at that point, to start with character recognition of a simple set, and how a simple neural net can be trained to recognize the digits 0 to 9 to about 94% accuracy. This one learning step can be taken early in programming, without really understanding file formats like jpg, mp4 or image processing (like brightness and contrast). All that must be understood is the basic nature of a black and white or grayscale image, where each pixel in the image is a number representing how "white" it is (0 is black, 127 is 50% gray, and 255 is all white, for example). From there you can witness what happens when a neural net is fed images of the digits 0 to 9, carefully framed in a fixed image size, for recognition TRAINING.

To be clear, at the risk of being too long, you first need to understand neurons and connections between neurons. An artificial neuron is implemented as a piece of data representing an input (that number from 0 to 255 indicating how white a pixel is, for example), and will "fire" an output based on a function (one that may 'interpret' the input). That function can be so many options that the statement is vague, but leave that as an "algebraic unknown" in your mind for the moment. Picture the neuron as a circle. Assume an image (for character recognition) that is 28 x 28 pixels. This image will be swapped repeatedly with various images of digits in a loop (perhaps scanned from handwritten samples). Assume each character is centered and scaled to fill this 28 x 28 pixel image.

Fashion 784 neurons in a column, with one pixel attached to each neuron. The neuron's input is the "whiteness" value of that pixel. This is an AI 'retina', receiving the image.

Fashion a second column of neurons to the right of the input 'retina'. The number should exceed the input volume, but is not critical. 1200 will do, but it could be 2000, or any value between. Perhaps the system will eventually be more accurate with more neurons, but not by much. Let's say 1200 for now.

Now, with programming techniques taken from intermediate skills in C++, or C#, or Java, CONNECT the output of the retina neurons to this second column (usually called a layer, more specifically named a hidden layer - a poor choice for the name, but that's what it's called). The connections follow a simple pattern. Take the first (top) neuron of the retina, and connect one "line" or "wire" from that neuron to each of the 1200 neurons in the second layer. Move to the second 'retina' neuron and repeat.

In this way each 'retina' neuron will connect to 1200 'hidden layer' neurons. Each 'hidden layer' neuron will receive 784 'retina' outputs. For each of these connections you will fashion a 'weight' as it is called. This 'weight' is like a volume control. If the value is 100%, the entire output from the connected 'retina' neuron is received. If the value is 0%, that 'retina' neuron is ignored. The setting is to be used later. They are the "magic" of training a neural net.

Now, fashion an output layer. For this goal, recognizing digits 0 to 9, you'll need only 10 neurons. Each represents a digit, the first one is 0, the second is 1, etc....

Connect the 'hidden layer' neurons as before. Each of the 1200 'hidden layer' neurons will have a 'wire' connected to each of the output neurons, each with their own 'weight' (another volume control). Each 'output' neuron will have a connection coming from each of the 1200 'hidden layer' neurons.

This is a simple neural net construction. Now, the 'magic' (if that's how one views the subject) begins. All of the weights are initially randomized around 50%. That is, all weights between the 'retina' layer and the 'hidden layer' will be set to random values centered around 50%, with no more than 15% variance. These are not critical (it could be 20%). Do the same for the weights between the 'hidden layer' and the output layer.

This is an untrained network.

Now, feed in one image. The output layers will "fire" - based on that vague function I mentioned (there are some standards used which you'll find in research literature), the output neurons will "turn on" (or stay off) in meaningless ways (they'll be wrong). Feed in a 4, the net may think it could be a 5 or a 7, maybe a 2. It's untrained.

You then correct the output with a technique called back propagation (more study, a bit of math, a process too complex to list here). This basically is a way to work backwards from the output layer to the hidden layer, figuring out which 'weights' are "too high" or "too low" to give the correct result. This "training" step will make some correction, turning off the wrong answers, turning on the right answers in the output layer.

This is repeated for every character. Typical training sessions run this training test some 60,000 times, making small corrections in the weights.

Eventually, the neural net "learns" (it is adjusted by back propagation) to recognize the characters.

This is the fundamental beginning of image recognition. It is too simple to recognize the face of one person from another, but it is enough to recognize hand written digits to about 94% or more.

Your own study would be, thereafter, to learn how to improve the neural functions, the training methods to improve accuracy.

So, what makes this work?

Have you ever taken aluminum foil, placed it over a coil and then 'rubbed' the foil until it looks just like the coin?

Imagine the neural net is a stiff net cloth (sprayed with a glue). The 'concept' (in this case digits 0 to 9) is the coin. The net cloth is the aluminum foil.

Training is, then, pushing on this net repeatedly, over every detail of the 'concept', until the 'shape' of the net fits the concept.

That's what adjusting the weights does. It makes the AI net conform to the concept, so it 'resembles' the concept.

Once you get that part, you're ready to dive into more advanced network configuration and more advanced training methods (deep learning, for example).

Then, you'd learn how to connect multiple nets, each trained for specific purposes, into larger collections performing increasingly complex tasks.

You're in for a wild ride, too. It is surprising how simple nets perform as well as they do, and then just how difficult it can be for some concepts to be represented.

Then, back to the "computer science and programming" subject, you'd need to learn how to code AI processing in the GPU for high performance.

Best of luck.




maybe the term is a bit muddy... I honestly thought computer vision was stereoscopic 2 camera 3-d stuff :) (a subset of what you offered) But his post sounded like basics were needed first.

Last edited on
closed account (z05DSL3A)
Outline of computer vision.
https://en.wikipedia.org/wiki/Outline_of_computer_vision
@Niccolo - epic response, literally read every word twice, a good read indeed :). Thanks for that, it's definitely my next step once I can freely manipulate and analyse pixel data.

@jonnin - I'll seriously give that list a go, very clear steps. Chrz

@Grey Wolf - I'll happily $ave!! any chance you actually know of such books? Asking because references from fellow devs tend to be much better than well written synopses :)
closed account (z05DSL3A)
johnpaoletto, it has been a while since I studied computer vision but I have been having a look around.

There are a few books that look interesting but one that seems to be regarded well is
Computer Vision: Algorithms and Applications
by Richard Szeliski
http://szeliski.org/Book/

I haven't looked at it yet but the reason I mention it with out checking it out is that you can download it free (for personal use) from the books website so you can judge it for yourself.

Hope it helps.

Edit:A quick skim, looks heavy on the theory leaving the implementation up to you.
Last edited on
@Grey Wolf. Alright! Thx!
@Grey Wolf, thanks for the link. :)

The PDF is a draft from 2010, so using it as a how-to learning reference does come with the caveat the field has gone beyond what is in the book.

Which isn't a bad thing. Many of the programming books in my library are older than 2010, and I still find them useful as reference material.

The PDF being free doesn't hurt either. Using a free though outdated reference is better IMO than paying for a new book that ends up not being useful. More than a few books in my library are in that category.
Recently I’ve been reading and experimenting a lot with computer vision, here is an introduction of what is interesting to learn and use in that domain.
Computer vision has advanced a lot in recent years. Those are the topics I will mention here :

Technologies :

Face detection : Haar, HOG, MTCNN, Mobilenet
Face recognition : CNN, Facenet
Object recognition : alexnet, inceptionnet, resnet
Transfer learning : re-training big neural network with little resources on a new topic
Image segmentation : rcnn
GAN
Hardware for computer vision : what to choose, GPU is important
UI apps integrating vision : ownphotos

https://www.simpliv.com/search?query=computing
Topic archived. No new replies allowed.