News and Notes from the Makers of Nexus | Sonatype Blog

Efficient Interaction With the Central Maven Repository: Downloading the Nexus Index

Written by Tim OBrien | December 10, 2008

The Central Maven repository had some recent bandwidth issues which were related to the Nexus index. There was a tool (which will remain nameless) which was configured to download the Nexus index at a regular interval of five hours even though this 28 MB file only changes once a week. This post isn't an exercise in blame, and we're not going to identify the project or the tool that was misconfigured. This post is more of an attempt to illustrate the problems that can be caused by something as insignificant as a regular 28 MB download when it is applied to something as universal as the Central Maven Repository. Because the Central Maven Repository is such a popular and universal resource, small bandwidth inefficiencies can turn into unwitting Denial of Service attacks. For this reason, it is important that everyone who depends on the Central Maven Repository understand some of the challenges faced by the people maintaining this essential resource.

What is the Nexus Index?

The Central Maven repository maintains a Nexus index of all of the repository contents. The Nexus index is a Lucene index which provides a fast index for searching. Instead of grabbing all 70 GB of the repository, a tool like the Nexus repository manager can read 28 MB of index data and then search this index for a particular artifact. The index is generated once a week on a Sunday night. Need to know which artifact contains classes that match the wildcard expression "Hibernate*"? Want to see a list of versions for commons-lang? Search the Nexus index, it is a Lucene index, and code to create a Nexus Index or query a Nexus Index is freely available under the Eclipse Public License (here's the source). The Nexus index is a good thing, it cuts down on the amount of data required to find and locate artifacts.

It is such a good thing that it currently accounts for more than half of the bandwidth used for the Central Maven repository.


...Too Much of a Good Thing

A few weeks ago, there was a problem with a specific tool that was configured to download the 28 MB Nexus index every five hours even though it only changed once every week. You might wonder.. "28 MB, what's the big deal, so something gets downloaded more often than it should, so what?" If you write a tool that downloads a 28 MB file once every 5 hours, here's the math for a single instance of that tool:

Designing a product to download 3.4 GB of data over the span of a month doesn't really seem like a big deal, right? But, you probably don't pay metered bandwidth or get charged extra if you transfer some extra bits over your cable modem. Other than the fact that your network is clogged up every five hours to download something that only needs to be downloaded once every seven days, there's no harm to you at all. Right?

Inefficiency Multiplied a Thousandfold

The real harm was happening on the server-side. Let's assume for the sake of discussion that you have 1500 instances of this tool installed worldwide. If one third of them wake up on the hour and try to download a 28 MB index file, you've got 500 things all trying to download 28 MB all at once. That's 14 GB of data you'll need to serve at the top of the hour. Carry this out for the whole month and this misconfiguration translates to something like 4.8 TB of wasted bandwidth...

Have you ever paid the bill for 4.8 TB of bandwidth? How about 30 TB? You get the idea. The problem was eventually identified, but because the tool in question didn't have the appropriate User-Agent headers, members of the Maven PMC had to work overtime (on Thanksgiving Night) to track down a problem which ultimately reduced the availability of a shared public resource.

At the top of every hour, bandwidth was maxed out and we were seeing dropped connections due to timeouts and over-capacity. As a result, the people responsible for the repository had to temporarily remove the Nexus Index until the source of the problem was found. Once the problem was found, the maintainers of the central repository had to find a way to differentiate between legitimate uses of the repository (a client hitting the Nexus index once a week or once every few days) and requests from misconfigured clients (clients asking to download the same index once every five hours). One subtle inefficiency multiplied a thousandfold translated to connection problems and failed downloads for a world of developers.

Planning for the Future

Sonatype employees who are involved with the effort to maintain the Central Maven repository are thinking about ways to push the distribution of the Nexus Index onto Internet-scale distribution channels such as Amazon S3 or Cloudfront. The magnitude of the challenge demands new approaches, and Sonatype is playing a part in helping to evolve this essential public resource. If every organization that uses Maven would download Nexus, it would go a long way toward addressing some of these issues and it would increase the availability and speed of the Central Maven repository for everyone else. (It would also give you more stable and faster builds, but that's for another blog post...)