What is hashing? A look at unique identifiers in software
By Luke Mcbride
10 minute read time
In software, the term "hash" has several meanings, but what we discuss here is loosely focused on what Wikipedia calls a "cryptographic hash function."
What is hashing?
In short, hashes are strings of letters and numbers meant to identify a set of information by a smaller, unique code. You may have seen articles here on Sonatype's blog or elsewhere referring to hashing. If you've seen a random-looking text string like the one below, it may have been a "hash."
The various hash identifier formats come with a long list of odd-sounding names like:
-
MD5
-
SHA1
-
Whirlpool
-
CRC32
… but they all do similar things. Hashing identifiers are something everyone can use, from average users to cybersecurity experts.
Hashing software is a surprisingly simple technology
Hashing software might seem strange and complex at first, but it's actually very simple. Hashed identifiers are a bit like image thumbnails in that they are tiny compared to the files they identify.
The file can be any size from 1 kilobyte or 100 terabytes, and the hash identifier will always be the same size. And the hash value is always the same; no matter how large the file or what computer is used to compute it.
The task of hashing focuses on one thing: Assigning a unique value.
Why are unique values so important in hashing?
I started long ago with hashes while trying to make sure my company report had no issues. I was working at a bank and using Microsoft Excel to find old data, and that started by looking for duplicate entries.
Fortunately, Excel has an easy option for highlighting duplicate values:
But finding individual cells was not useful. There were a lot of similar numbers throughout.
Instead, I needed to find duplicate rows.
There are many tricks to enable this, but at the time, I was in a rush to catch these embarrassing extras. I decided to just multiply an entire row together (as below) and check the results column for a duplicate result.
Multiplying all cells together to get a unique value.
Because the result was always unique, I could easily flag duplicate rows.
A row with the same inputs and the same outputs (in red).
Unfortunately, they weren't always unique. I came across an issue where two very obviously different rows happened to get the same multiplied result, or a "false positive."
A additional row with different inputs but the same output as the other two (in red).
I needed to find a way to show an absolutely unique value for every unique row in the spreadsheet.
Unfortunately, I ended up doing a lot of extra work manually checking each duplicate row. It was better than submitting a bad report, but I knew there was a better method.
Not long afterwards I learned about a trick that could deliver a unique number for each row: hashing software. And it's a technique in use throughout computing.
Why would I use a hash file?
First, no matter how large the file or what computer is used to compute it, the hash value is always the same.
And these unique values carry valuable information that lets you:
-
Find duplicate files such as finding and deleting duplicate photos. Any files with the same hash are duplicates – you don't need to open and compare them.
-
Identify a file - You and a coworker are updating the same file and upload it to a server. If the server doesn't show who posted what, how do you determine which one was yours without going line-by-line for changes? Just compare your machine's hash with the remote file hash.
-
Ensure the file you've downloaded is the right one. For example, if you get a software program from a website, how do you know the website or the upload was hijacked or corrupted? Hashes can help detect problems.
-
Assign a reputation to a file. If an older version of a program worked better than the latest, knowing the hash lets you identify which one to use.
Although hashed identifiers have been around since the early days of computers, more recently they have been used as a way to quickly fingerprint files on the internet.
How are hashed identifiers used in security?
The primary task of hash files by security software and professionals is to determine the status of a file, whether good or bad. For example, a hashed identifier that shows up in a virus database should get blocked from your computer. Hashes considered safe and well-known (such as the Firefox and Chrome browsers) can be installed without issue.
Most of these tools for checking reputation are built right into the software, meaning programs check hashes as a normal part of their operations.
Firefox uses hashes in the background to know if a file is malicious
How Sonatype uses hashes
One important job that Sonatype Repository Firewall performs for our customers is keeping bad, outdated, or malicious software out of the development process.
When a new program is analyzed, it's checked against our database for problems. If it's a known-good file, it's passed along as normal. If a file is unknown or has a bad reputation (the objects in red and yellow below), it's blocked. After they're fully analyzed, any files with that same hash identifier will always get treated the same way.
Whether that's given the green light as great software or blocked from ever being used, hashing software helps make sure it's cataloged and managed according to your policy.
You can also manually check the hash values within the software:
An example of a hash listed inside a Sonatype image.
How you can use hashes today
Although many hashing tools are often built-in, it's possible to manually check the results.
One way to use a hash identifier is to check the downloads from an untrusted website. Some security researchers will check hash values on files even from trusted locations, especially when saved to a critical workstation or server.
While there are dozens of tools that can do this, I use the open source PeaZip archive manager for Windows, Mac, and Linux.
To view hashes, right-click on a file, choose File Manager – File Tools – Checksum/hash file(s) and select the "Clipboard" tab.
From there, you can double-click on the SHA256 value and copy (CTRL+C or Apple+C). This value is the standard for security analysis.
Using VirusTotal
Now that you have this long string of text, you can view its reputation in services like VirusTotal.com. This will show whether the file is considered good, bad, or unknown. Just click the Search tab and paste in the value.
Interpreting your score
A good reputation is a score of 0, meaning "zero threats." Choosing to use a file above score 0 comes with some caveats. Where a score of 1 or 2 may be considered "false positives," or over-cautious anti-virus tools, scores higher than 3 should take additional steps. These could include researching the author, interaction within a secure sandbox, or other precautions.
The file may not have been evaluated if there's no reputation ("No matches found") as pictured below.
At this point, you can set aside the file and either check later or assume it's unsafe and delete.
Are hashes related to digital signatures, cryptography, or cryptocurrency?
Although all of these tools use hashing as part of their operations, they are separate topics.
In short:
- Cryptography and digital signatures use hashes to ensure the encrypted files are not changed between sender and receiver.
- Cryptocurrency uses a complex form of digital signatures for transactions.
Hashed identifiers are simple tools with many uses, including duplication, security, and reputation. The capabilities are built into many software programs and tools, but you can use them to solve problems in computing today.
Software development teams interested in learning about how Sonatype uses AI analysis to build file reputations can schedule a demo today.
Note: One of this article's readers reached out to me and let me know that it's more accurate to call it "distinct" rather than unique. Just like there are only so many possible PIN numbers a person could choose for their bank account, there is a limit (in the millions or more) of possible hash values. As such, it's possible for the same hash value to be assigned to different files. This is known as a "hash collision."
The best way to approach values that are totally unique is using high quality hashing tools that use SHA 256, a format with 256^32 possible combinations. Here, a collusion is extremely unlikely.
Written by Luke Mcbride
Luke is a writer at Sonatype covering everything from open source licenses and liability to DevSecOps trends and container security.
Explore All Posts by Luke Mcbride