标签云

微信群

扫码加入我们

WeChat QR Code

One of the most interesting projects I've worked on in the past couple of years was a project about image processing. The goal was to develop a system to be able to recognize Coca-Cola 'cans' (note that I'm stressing the word 'cans', you'll see why in a minute). You can see a sample below, with the can recognized in the green rectangle with scale and rotation.Some constraints on the project:The background could be very noisy.The can could have any scale or rotation or even orientation (within reasonable limits).The image could have some degree of fuzziness (contours might not be entirely straight).There could be Coca-Cola bottles in the image, and the algorithm should only detect the can!The brightness of the image could vary a lot (so you can't rely "too much" on color detection).The can could be partly hidden on the sides or the middle and possibly partly hidden behind a bottle.There could be no can at all in the image, in which case you had to find nothing and write a message saying so.So you could end up with tricky things like this (which in this case had my algorithm totally fail):I did this project a while ago, and had a lot of fun doing it, and I had a decent implementation. Here are some details about my implementation:Language: Done in C++ using OpenCV library.Pre-processing: For the image pre-processing, i.e. transforming the image into a more raw form to give to the algorithm, I used 2 methods:Changing color domain from RGB to HSV and filtering based on "red" hue, saturation above a certain threshold to avoid orange-like colors, and filtering of low value to avoid dark tones. The end result was a binary black and white image, where all white pixels would represent the pixels that match this threshold. Obviously there is still a lot of crap in the image, but this reduces the number of dimensions you have to work with.Noise filtering using median filtering (taking the median pixel value of all neighbors and replace the pixel by this value) to reduce noise.Using Canny Edge Detection Filter to get the contours of all items after 2 precedent steps.Algorithm: The algorithm itself I chose for this task was taken from this awesome book on feature extraction and called Generalized Hough Transform (pretty different from the regular Hough Transform). It basically says a few things:You can describe an object in space without knowing its analytical equation (which is the case here).It is resistant to image deformations such as scaling and rotation, as it will basically test your image for every combination of scale factor and rotation factor.It uses a base model (a template) that the algorithm will "learn".Each pixel remaining in the contour image will vote for another pixel which will supposedly be the center (in terms of gravity) of your object, based on what it learned from the model.In the end, you end up with a heat map of the votes, for example here all the pixels of the contour of the can will vote for its gravitational center, so you'll have a lot of votes in the same pixel corresponding to the center, and will see a peak in the heat map as below:Once you have that, a simple threshold-based heuristic can give you the location of the center pixel, from which you can derive the scale and rotation and then plot your little rectangle around it (final scale and rotation factor will obviously be relative to your original template). In theory at least...Results: Now, while this approach worked in the basic cases, it was severely lacking in some areas:It is extremely slow! I'm not stressing this enough. Almost a full day was needed to process the 30 test images, obviously because I had a very high scaling factor for rotation and translation, since some of the cans were very small.It was completely lost when bottles were in the image, and for some reason almost always found the bottle instead of the can (perhaps because bottles were bigger, thus had more pixels, thus more votes)Fuzzy images were also no good, since the votes ended up in pixel at random locations around the center, thus ending with a very noisy heat map.In-variance in translation and rotation was achieved, but not in orientation, meaning that a can that was not directly facing the camera objective wasn't recognized.Can you help me improve my specific algorithm, using exclusively OpenCV features, to resolve the four specific issues mentioned?I hope some people will also learn something out of it as well, after all I think not only people who ask questions should learn. :)


It might be said that this question is more appropriate at dsp.stackexchange.com, or stats.stackexchange.com, and you certainly should consider re-asking at those sites too.

2019年06月25日00分28秒

The first thing to do here is to analyze why the different failure cases are happening. E.g., isolate examples of places where bottles win, where the images are fuzzy, etc., and perform some statistical analysis to learn the difference between their Hough representations and the ones you wish it would detect. Some great places to learn about alternative approaches are here and here

2019年06月25日00分28秒

stacker makes a good point. For speed you want to get cheap-to-compute features, like histograms of oriented gradients. A really naive first approach would be to manually label a bunch of can rectangles in some training images, and use these plus random negative examples to train an SVM or decision-tree classifier. The training will take longer, but the execution on novel images will be much faster. I'm planning to write this method up when I get more free time to include the right references.

2019年06月25日00分28秒

How about an approach similar to reCAPTCHA? ;)

2019年06月25日00分28秒

Why was this moved from dsp.stackexchange.com?It seems like that site would be an even better fit than stackoverflow o_O

2019年06月25日00分28秒

I agree with stacker - SIFT is an excellent choice.It's very robust against scale and rotation operations. It 's somewhat robust against perspective deformation (this can be improved as suggested by stacker: a template database with different perspective views of the desired object).Its Achilles' heel in my experience would be strong lighting variations and very expensive computation. I don't know of any Java implementations. I'm aware of an OpenCV implementation and have used a GPU c++/Windows (SiftGPU) implementation suitable for realtime performance.

2019年06月25日00分28秒

A note of warning: as much as I love SIFT/SURF and what they have done to me, they are patent encumbered. This might be a problem, depending on a number of conditions including geographic location AFAIK.

2019年06月26日00分28秒

So try OpenCV's ORB or FREAK which have no patent issues. ORB is much faster than SIFT. ORB it is a bit poor with scale and light variations in my experience but test it yourself.

2019年06月26日00分28秒

How can you accept this as an answer... None of the feature descriptors can differentiate bottles from a cans.. They all just view invariant local pattern descriptors. I agree that SIFT,SURF,ORB,FREAK etc. can help you in feature matching but.. What about your other parts of the question like occlusions, Bottle vs Can etc. I hope this is not a complete solution in fact if you would have GOOGLED your problem probably the first result would be this answer only.

2019年06月25日00分28秒

G453 you are absolutely right! Probably he was fascinated by the performance of SHIFT and forgot that feature extraction and matching was NOT THE PROBLEM...

1970年01月01日00分03秒

That's a great suggestion, I especially like the fact that this algorithm should be pretty fast, even if it will probably have many false negatives. One of my hidden goals is to use this detection in real-time for robotics, so that could be a good compromise !

2019年06月26日00分28秒

Yes, it is often forgotten (in a field characterized by precision) that approximation algorithms are essential for most real-time, real-world-modeling tasks. (I based my thesis on this concept.) Save your time-demanding algorithms for limited regions (to prune false positives). And remember: in robotics you're usually not limited to a single image. Assuming a mobile robot, a fast alg can search dozens of images from different angles in less time than sophisticated algs spend on one, significantly reducing false negatives.

2019年06月25日00分28秒

I like the idea of using what amounts to a barcode scanner for extremely fast detection of Coca-Cola logos. +1!

2019年06月25日00分28秒

The problem of looking for signatures in this case is that if we turn the can to the other side, i.e. hiding the signature, the algorithm will fail to detect the can.

2019年06月25日00分28秒

karlphillip: If you hide the signature, i.e. the logo, then any method based on looking for the logo is going to fail.

2019年06月25日00分28秒

Yep I've thought about that too, but didn't have much time to do it. How would you recognize a bottle, since it's main part will look like a scaled can? I was thinking looking for the red plug as well and see if it's aligned with the bottled center, but that doesn't seem very robust.

2019年06月25日00分28秒

If there is a red cap (or ring) parallel to the "Coca cola" it is most likely a bottle.

2019年06月26日00分28秒

linker How did you train your algorithm for cans? Did you have examples of cans? How about training with examples of bottles?

2019年06月25日00分28秒

The strength of this algorithm is that you only need one template to train on, and then it applies all transformations to match it to other potential cans. I was using a binarized and contour-based version of this template to train, so the only difference between can and bottle would be the plug, but I'm afraid it would bring more false positives since the gravity center would be somewhere on the edge or outside of the bottle. It's worth giving it a try I guess. But that will double my processing time and I'm going to cry ;)

2019年06月25日00分28秒

Essentially this is a reasonable direction. I'd phrase it slightly different: First find all candidates, and then for each candidate determine whether it's a bottle, a can, or something else.

2019年06月25日00分28秒

+1 I thought about this and was in my way to implement this approach. However, linker should share his set of images so we can try to do more educated guesses.

2019年06月25日00分28秒

yeah.. i am too thinking it was good if there were more images.

2019年06月25日00分28秒

updated the links..

2019年06月25日00分28秒

Considering if we have only the labels for bottles / cans and none of the other distinguishing factors of bottle cap or transparency or can top/bottom - The width of the bottle is different than the width of the can.

2019年06月26日00分28秒

That's an interesting approach which is at least worth a try, I really like your reasoning on the problem

2019年06月26日00分28秒

This is kind of what I was thinking: don't rule out particular kinds of false positives. Rule in more features of what makes a coke can. But I'm wondering: what do you do about a squished can? I mean, if you step on a coke can it's still a coke can. But it won't have the same shape anymore. Or is that problem AI-Complete?

2019年06月26日00分28秒

I like the idea, but it seems like you'd need some really really good lighting conditions. In the example image where there is both can and bottle for example this seems a bit hard to make the distinction.

2019年06月25日00分28秒

In your example, notice how the specularity for the plastic label is much more diffuse than the very bright spots on the can? That's how you can tell.

2019年06月26日00分28秒

I see, which kind of color space representation would you use in this case to capture specularity in your algorithm? This seems quite tough to get in RGB or HSV

2019年06月25日00分28秒

What if the light source was behind the can? I think you would not see the highlight.

2019年06月26日00分28秒

Thanks for the link that looks interesting. Regarding the training, what is the size of the training set that would be reasonable to achieve reasonable results? If you have an implementation even in c# that would be very helpful as well !

2019年06月26日00分28秒

While researching TLD, I found another user looking for a C# implementation--- is there any reason not to put your work on Github?stackoverflow.com/questions/29436719/…

2019年06月25日00分28秒

N.B. Years, later, link is now dead

2019年06月25日00分28秒

New link: kahlan.eps.surrey.ac.uk/featurespace/tld

2019年06月25日00分28秒

Actually I didn't explain that in the post, but for this assignment I was given a set of roughly 30 images, and had to do an algorithm who would match them all in various situations as described. Of course some images were held out to test the algorithm in the end. But I like the idea of Kinect sensors, and I'd love to read more on the topic !

2019年06月26日00分28秒

What would roughly be the size of the training set with a neural network to have satisfying results? What's nice with this method also is that I only need one template to match almost everything.

2019年06月26日00分28秒

If your set of images is predefined and limited, just hardcore perfect results in your prog ;)

2019年06月26日00分28秒

Yeah if I train on the dataset I'm going to run the algorithm against, sure I'll get perfect results :) But for example for this assignment, the program was tested by the teacher in the end on a set of held out images. I'd like to do something that would be robust and not overfit to the training data.

2019年06月25日00分28秒

The number of training sets varies, you have to be careful of a few things though: Don't over train, you probably want a test set to show how your accuracy is going. Also the number of training sets will depend on the number of layers you will use.

2019年06月25日00分28秒

Like this was being discussed on DSP in the short time when it was moved, some bottles may not have plugs ;) or the plug could partially hidden.

2019年06月25日00分28秒

Actually no : there is no constraint of size or orientation (or orientation but i didn't really handle that), so you can have a bottle very far in the background, and a can in the foreground, and the can would be way bigger than the bottle.

2019年06月26日00分28秒

I've also checked that the width to height ratio is pretty similar for bottle and can, so that's not really an option as well.

1970年01月01日00分03秒

The label ratio (being it a trademark) is the same. So if the (bigger) bottle is slightly further away on the picture, its size will be exactly the same as that of the can.

2019年06月25日00分28秒

To explain a bit more. Suppose can is at z=0 and bottle at z=-100. Since bottle is far behind it will look smaller. But if I know that the bottle is at z=-100 and can at z=0, then I can calculate the expected size of the can/bottle if both are translated to z=0. So now they are at the same depth and hence I can make decisions based on size.

2019年06月25日00分28秒

This is just a comment, not an answer, but it is much closer to being an answer than the comment-as-an-answer above with 120 votes.

2019年06月26日00分28秒

The particular shade of red is mostly subjective and strongly influenced by lighting considerations and white balance. You might be surprised by how much those can change. Consider, for example, this checkerboard illusion.

2019年06月25日00分28秒

I had the same thought, but I think the silver lining on top of the can changes dramatically depending on the angle of the can on the picture. It can be a straight line or a circle. Maybe he could use both as reference?

2019年06月25日00分28秒

Link not working.

2019年06月25日00分28秒

OP said there were 30 high-res images, which is probably not the best scenario for training ConvNets. Not only are they too few (even augmented), the high-res part would destroy ConvNets.

2019年06月26日00分28秒

Interesting project but it only applies to your very specific setup.

2019年06月25日00分28秒

Hot dog or not hot dog?

2019年06月25日00分28秒