Perceptual Hashing

Perceptual Hashing

Disclaimer: This is not a post about how you can get high and see things in a new light. Or is it?

You may assume that computers know everything and they don’t need people to do anything. We have assistants on our phones that we can talk to and ask questions, so who cares about how things actually work. Well, I do care, and I think this is pretty nifty.

I was recently tasked with writing a program to find duplicate images. I thought, no big deal, I’ll use a cryptographic hash and find the duplicates that way. And that way would work, but what about images that are similar to each other? Unfortunately using a cryptographic hash, like MD5 or SHA512, wouldn’t help you. The cryptographic hash algorithms use a waterfall effect to make the result completely different when only a few bits are changed in the source file.

After a little bit of research, I decided to use perceptual hashing.

What is Perceptual Hashing?

A perceptual hash is made up of a sample of the data from the original file. If you were to examine two pictures like you would read a book (from the top down and left to right), stopping every inch and recording the color; you would be creating a perceptual hash for that image. To keep it short, let’s say you end up with:

• Blue
• Blue
• Blue
• Yellow
• Green
• Green

Then if we examined another picture that is similar to the first, we might get:

• Blue
• Blue
• Blue
• Yellow
• Orange
• Green

As an intelligent human person, you can obviously tell they are similar. Now you can also use a systematic way to tell they’re similar since you can say there is only one color different. Of course, actual pictures are made up of a lot more pixels and colors.

Hamming Distance

Just how similar are they? Are they close? Are they the same? How would you measure the similarity? Well, you can use the Hamming distance. You would first XOR the two perceptual hashes; then you would count the number of ones in that binary string. That will give you the Hamming distance. Some people may say the Levenshtein distance would work better, but I’ll disagree. I’m not going to put an example of finding the Hamming distance on here because I don’t want to get too technical in this post.

Conclusion

These explanations may not be 100% accurate, but it is how I decided to explain them. I feel as if they accurately represent the underlying foundation of the concepts. If you would like to see how to measure the Hamming distance, send me an email, and I’ll send you some code I wrote along with an explanation. I’m okay if you want to create a new email address and pretend to be someone else ;).

I went ahead and wrote a function in C that finds the Hamming distance and I put it in a Github Gist. I’m not sure it is working correctly, so don’t use it for anything in production. I have PHP version that is working if you would like to see it.