Open Source Supply, Demand, and Security
A monk by the name of John of Salisbury wrote a famous phrase in a 12th century manuscript, borrowed by Sir Isaac Newton and hundreds of others since:
The meaning of the passage is simple: The progress we make only happens because of the progress in learning and understanding others have made before us.
Nowhere else is this seen more than in the adoption of open source. Nearly all of the software shipped today relies on previous innovation that is distributed freely on scaffolding built by the utmost experts in the world, available to all developers free of charge.
In past State of the Software Supply Chain reports, we estimated that up to 90% of the code we run in production is of open source origin. Therefore, the economics of open source are good indicators of trends and challenges in the wider software market.
For the 9th consecutive year, we continue to track the growth of open source adoption across the top four major open source ecosystems. These collectively account for four of the top five languages in GitHub, and a 60% share of the most popular programming languages according to PYPL language popularity index1. Leveraging our continued monitoring, we present the combined statistics of each ecosystem in the table below.
Figure 1.1. SOFTWARE SUPPLY CHAIN STATISTICS, 2023
Ecosystem | Total Projects | Total Project Versions | 2023 Annual Request Volume Estimate | YoY Project Growth | YoY Download Growth | Average Versions Released per Project |
---|---|---|---|---|---|---|
Java (Maven) | 557K | 12.2M | 1.0T | 28% | 25% | 22 |
JavaScript (npm) | 2.5M | 37M | 2.6T2 | 27% | 18% | 15 |
Python (PyPI) | 475K | 4.8M | 261B3 | 28% | 31% | 10 |
.NET (NuGet Gallery) | 367K | 6M | 162B4 | 28% | 43% | 17 |
Totals/Averages | 3.9M | 60M | 4T | 29% | 33% | 15 |
Open source supply sees a resurgence
The supply side of open source is an interesting metric to gauge the pace and scale of innovation that occurs in a given ecosystem. The more open source projects are published every year, the more innovation occurs in a given ecosystem.
New open source projects across the monitored ecosystems have been published at a relatively steady 15% average rate5 in recent years, which was a significant reduction in pace from highs seen in 2019 and before.
This two-year slump is most likely related to the COVID-19 pandemic period and associated slowdown. While some studies suggest productivity did increase during the 2020-2023 period in the U.S., a negative correlation emerges in open source production trends. This is further supported by another study that found productivity rates in information and communication technology did decline towards 2022. One other explanation could be that a lot of these projects are in fact coming from commercial activity and not people with spare time, which was abundant during the pandemic.
To date, the data in 2023 shows the innovation slowdown is now over. Each monitored ecosystem showed a remarkably consistent project growth rate, varying just 2% across all four monitored ecosystems to a total average growth rate of 29% year-over-year.
The rate of production growth is recovering across the board, and both Maven Central and NuGet are on track to exceed the rate of growth seen in 2020.
PyPI and npm, although growing, have not yet caught up to their original rate of growth but are on an upward trend. In a later section, we will see how breakthroughs and interest in AI and its related tooling are fueling the rate of growth in these ecosystems.
FIGURE 1.2. OPEN SOURCE NEW PROJECT GROWTH RATE OVER THE PAST 4 YEARS
FIGURE 1.3. OPEN SOURCE PROJECTS AND VERSIONS GROWTH, 2023
Open source consumption is decelerating
Despite this, both of the largest ecosystems, Maven and npm, are each estimated to reach over a trillion requests in 2023, with npm reaching a staggering 2.6 trillion requests in total, continuing a modest growth that surpasses the total request rate of PyPI in 2022.
These two ecosystems account for 90% of the requests served with the remaining two growing at above average pace.
FIGURE 1.4 CUMULATIVE ESTIMATED REQUESTS PER ECOSYSTEM OVER 6 YEARS
Annual request growth rate of each ecosystem
Requests are the fundamental measure of how popular an open source ecosystem is and how lively its usage is. Other factors within an ecosystem may vary, such as the larger size and complexity of Java packages compared to JavaScript packages.
Investigating the rate of growth for requests can reveal information about the state of open source adoption, as well as the growth of the software industry at large.
Figure 1.5 charts these individual growth rates over time and displays an average across all four major ecosystems.
FIGURE 1.5. GROWTH RATE OF THE MONITORED OPEN SOURCE ECOSYSTEMS OVER 5 YEARS
We can see a clear delineation between the stabilization of large ecosystems like Maven and npm, and continued accelerated growth in PyPI and NuGet.
Figure 1.6 charts the overall aggregate request growth across all ecosystems. It illustrates that although the pace of growth is slowing, the absolute scale of growth continues to compound on previous years' rates. To put it simply, the pace of open source adoption still shows no signs of stopping.
FIGURE 1.6. TOTAL OPEN SOURCE REQUESTS OVER YEARS
Individual ecosystem analysis
Java continues to grow at a healthy pace, hitting an estimated 25% YoY request growth rate. If previous years are any indication, we may well see a spike towards the end of the year.
JAVA 2023 BY THE NUMBERS:
The growth of npm is the slowest of all the monitored ecosystems - estimated to be at 18% YoY. Nevertheless, owing to npm's substantial footprint, this translates to a staggering 400 billion requests, surpassing the combined total of requests served by PyPI and NuGet.
JAVASCRIPT 2023 BY THE NUMBERS:
PYTHON 2023 BY THE NUMBERS:
.NET 2023 BY THE NUMBERS:
Open source software security concerns see no sign of slowing
In 2022, we reported a massive increase in the growth of malicious attacks on the software supply chain. Since our last report, this method of propagating security threats using trusted developer utilities and ecosystems has continued to evolve and flourish.
A troubling trend has emerged in the software supply chain over the past few years of tailor-made packages designed to run a malicious payload on download — without any developer interaction. This form of intrusion relies on developers not recognizing that the build breakage resulting from the fake package might be an indication that something nefarious has already happened on their system. We did a deep dive into types of malicious attacks in last year’s report.
In our YoY monitoring, at the time of writing in September 2023, we have logged 245,032 malicious packages — meaning in the last year, we’ve seen the number of malicious packages triple. Looking at it a different way, it also indicates that in one year alone, we’ve seen twice as many supply chain attacks to the cumulative numbers in previous years.
This pace of growth is astonishing. It signals the role of the software supply chain as one of the fastest growing vectors for adversaries to execute malicious code. Furthermore, we have seen an increase in nation-state actors leveraging these vectors (see our deep dive section below).
FIGURE 1.7. NEXT GENERATION SOFTWARE SUPPLY CHAIN ATTACKS (2019-2023)
245,000
Malicious packages discovered, 2x all previous years combined
This is alarming news. Even though many open source ecosystems have implemented new security policies, such as mandatory MFA, they usually only address the issue of protecting existing open source publishers from attack. Oftentimes, packages containing malicious code are treated very similarly to packages with new security vulnerabilities, and they are taken down entirely based on a volunteer effort following a vulnerability removal process which is not appropriate when the code is designed to be malicious from the start. This approach can lead to the malicious packages being up longer than necessary, leaving developers at risk.
Notable malicious packages and vulnerabilities
As we continue to document an overall rise in malicious attacks on open source ecosystems, the monitored 2022-2023 period has also seen more professional criminal campaigns emerge. The software supply chain lends itself well to the cybercriminal ecosystem, either as an initial access vector to Initial Access brokers or even as a means of distributing initial access malware for Advanced Persistent Threat groups. Here are several examples we’ve seen this year:
Lazarus created PyPI package 'VMConnect' imitates VMware vSphere connector
In August 2023, Sonatype discovered a malicious Python package, 'VMConnect,' which mimics a legitimate VMware module on PyPI. This is part of a wider cyber campaign called "PaperPin," and is widely thought to originate from the Lazarus Group, a North Korean state-affiliated organization. The packages aim to download further malicious payloads from attacker-controlled URLs. The focus on VMware, a widely used virtualization platform, is particularly concerning, as a successful compromise could have far-reaching implications for enterprise networks and is widely attractive to state-affiliated actors.
ChatGPT histories uncovered due to a vulnerability in Redis component used by OpenAI
In March 2023, ChatGPT users experienced a data leak where chat histories displayed other people's queries. OpenAI identified the issue as a race condition vulnerability in an open source component called Redis, which they use for caching user data. This flaw made sensitive data of about 1.2% of ChatGPT Plus subscribers accessible to others. The vulnerability was exacerbated by a recent server change that increased the probability of the race condition occurring. The issue underscores the importance of even rarely occurring vulnerabilities, especially in widely used components like Redis, given their potential to cause widespread disruption and data exposure.
PyTorch namespace confusion attack targeted utilities aimed at AI developers
In the past couple of holiday seasons, we've seen some big supply chain attacks, including one on PyTorch, a popular machine learning (ML) framework. The attackers used a tactic known as namespace confusion to specifically go after the experimental "nightly" build of PyTorch. They managed to steal sensitive data, signaling hackers are increasingly setting their sights on AI and ML tools. These tools are becoming more critical in various sectors, making them attractive targets. While only the experimental build was hit, the incident serves as a wake-up call for better security in the booming field of AI.
A timeline of attacks
We have continued to curate a timeline of known malicious packages and software supply chain campaigns. This interactive timeline summarizes notable supply chain incidents, next-gen attacks, and other incidents propagated using the software supply chain.
FIGURE 1.8 SOFTWARE SUPPY CHAIN ATTACKS
AUG 2023
Malicious PyPI package imitates VMware vSphere connector module
A fake PyPI package, ‘VMConnect,’ copied VMware's vSphere connector but harbored hidden malicious code. It was part of an ongoing campaign, "PaperPin," along with similar packages. These packages were removed from PyPI.JULY 2023
A French-meme-inspired PyPI package targets Windows with an info-stealer
A PyPI package called ‘feur’ cleverly disguised a Windows Remote Access Trojan (RAT) behind a meme-related name. This RAT had surveillance features, such as clipboard access, network monitoring, webcam usage, and screenshots.JUNE 2023
PyPI attackers unleash trojans and info-stealers
Sonatype detected malicious PyPI packages posing as npm "colors" library, targeting Windows with trojans hosted on Discord. One package affected Windows and Unix with trojans and Python code. Others used variable obfuscation similar to crypto-miner malware.
JUNE 2023
Manifest confusion in npm
"Manifest confusion" was revealed in the npm ecosystem. A package's metadata (dependencies and scripts) is published separately from its actual contents, stored in a tarball containing package.json. This disconnect can result in issues like cache poisoning or hidden dependencies/scripts.APR 2023
Threat actors compromise 3CX desktop app in software supply chain attack
A software supply chain attack struck 3CX's Mac and Windows client apps, impacting 600,000 users. This month-long, state-actor-led attack prompted 3CX to urge users to uninstall compromised apps and migrate to safer frameworks.MAR 2023
W4SP copycats continue to infiltrate PyPI registry
Microsoft-helper package reveals copycat info-stealer
OpenAI data breach traced to unpatched Redis vulnerability
FEB 2023
https package attempts to sneak in through GTA 5 mods
Info-stealers distributed via Python packages on the PyPI registry
Malware campaign floods PyPI with thousands of malicious packages
JAN 2023
Malicious Python package attempts to download and install a Trojan virus
This malware validates the presence of a VM before attempting to execute. Sonatype confirmed the “minimums” package as malicious. It contains a payload in the setup.py file that attempts to download a Trojan virus from a rogue server, install it, and log the installation result using a Discord webhook.DEC 2022
PyTorch-nightly build compromised
Malicious 'Cabo Custody Restful' attack tries to trick developers using MacOS
NOV 2022
Malicious reverse shell and bind shell scripts taint packages
Sonatype discovered packages tainted with malicious reverse shell and bind shell scripts. Other packages looked for information on the target computer’s OS such as hostnames, IPs, credentials, and other configuration details with the purpose of exfiltrating such data to malicious servers.SEPT 2022
‘JuiceLedger’ tries to catch PyPI maintainers unaware
A phishing attack attempted to distribute a .NET-based malware, dubbed 'JuiceStealer,' that steals credential, browser, and cryptocurrency vault information and feeds the ill-gotten goods to a domain purportedly controlled by JuiceLedger.
AUG 2022
Cryptomining packages flood npm, PyPI
PyPI package ‘secretslib’ drops Linux malware to mine Monero
‘Requests’ library typosquats install ransomware
JULY 2022
PyPI packages steal Telegram cache files, add Windows Remote Desktop accounts
Sonatype discovered malicious PyPI packages that set up new Remote Desktop user accounts on your Windows computer and steal encrypted Telegram data files from your Telegram Desktop client.
Differentiating software vulnerabilities and malware
Up until now, we’ve been talking about malware and malicious attacks on the software supply chain — or maybe better stated as malware propagated using the open source supply chain. In this next section, we’re going to discuss software vulnerabilities. While the two concepts are related, they are very distinct, so we’d like to quickly define the difference between a vulnerability and a piece of malware.
Software vulnerability: A flaw in the code
A software vulnerability is akin to a flaw in code, much like a faulty lock on a door. However, unlike malware, vulnerabilities are not intentional. Instead, they represent weaknesses in software components or projects.
Similar to how a faulty lock compromises the security of a building by allowing unauthorized access, a software vulnerability creates a gap in the software's security perimeter. This gap becomes an entry point for intruders to exploit, gaining unapproved access to the system, application, or component.
Malware: Malicious intent in open source
Malware, short for “malicious software,” poses a significant threat to open source software ecosystems. It encompasses a wide range of malicious programs, such as viruses, worms, trojans, ransomware, spyware, and adware, all designed to gain unauthorized access to information or systems.
With its various forms, malware’s primary purpose is to steal data, install harmful software, gain control of a network, or compromise software or hardware. Threat actors employ diverse distribution methods, such as infected email attachments, malicious websites, or compromised software downloads.
Consumption behavior contributing to security concerns
There is widening evidence that despite the standard practice for avoiding vulnerable components today, the controls are not having the effect needed to reduce the attack surface. For example, as of September 2023, downloads vulnerable to the infamous Log4Shell vulnerability still account for nearly a quarter of all net new downloads of Log4j. It should be highlighted, that almost two years after the initial finding of this vulnerability, we’re seeing this pace continue every week as a quarter of all downloads are of the vulnerable version of Log4j. This is only part of the story.
As we discussed last year, the numbers for other critical vulnerabilities that have not received as much widespread media attention are even more depressing.
total Log4j downloads since Dec 15, 2021 | 29% vulnerable
According to a joint consortium of national operators including CISA, NSA, NCSC-UK and others, attackers are exploiting older well-known vulnerabilities much more frequently than new zero-day vulnerabilities. This is extremely important to understand. While we should of course worry about zero-days, we also know that 96% of vulnerable open source downloads have a non-vulnerable fix available. Those 96% need to be addressed.
Vulnerable components consumed
Let’s start off by looking at the top level. In 2022, we saw 12% of downloads served by Maven Central6 contained at least one known security vulnerability.
This number is important when considering that the easiest way to reduce risk of a supply chain incident caused by a vulnerability is to simply choose a better, non-vulnerable version of a component.7 However, there is some improvement here. The number of vulnerable downloads in 2021 was 14% — and the number to date in 2023 sits around 10%.
FIGURE 1.9. PERCENTAGE OF COMPONENTS WITH KNOWN VULNERABILITIES SERVED FROM MAVEN CENTRAL
FIGURE 1.10. VULNERABLE DOWNLOADS BY SEVERITY
FIGURE 1.11. NVD KNOWN VULNERABILITY SCORE
The increase of critically vulnerable components being consumed could be due to the fact that these vulnerabilities are found and reported primarily in more popular and widely adopted open source software. Popularity begets more attention from good and bad actors, resulting in increased likelihood of a critical issue being present. It’s also worth noting that these more popular components have an official disclosure process. This means, on average, these critical vulnerabilities should be the ones that are most noticed. But, as we’ve seen with the vulnerable version of Log4j, “knowing” is only half the battle. Organizations have to care, and they have to have an automated way to address this issue.
A global view of vulnerable open source downloads
Software development has evolved into one of the most globally influential industries, shaping various sectors and regions in unique ways. However, not all regions share the same level of emphasis on software development. To gain insight into how the trends we've explored thus far manifest on a global scale, we conducted an analysis that looks at open source vulnerability consumption by country.
Our study focused on countries that collectively downloaded over 100 million open source components from Maven Central in the past year. By scrutinizing the percentage of vulnerabilities associated with the software downloaded in each region, we start to gain insights into how different parts of the world manage their software supply chains.
In Figure 1.12, we delineate those that have stronger management programs from those who don’t by plotting the percentage of vulnerabilities against the average number of vulnerable downloads (approximately 22%) and applying a ranking based on how countries compare to that average. But it’s important to consider the context, and this is one of the most important figures to come out of Sonatype’s research: 96% of known vulnerabilities downloaded from Maven Central have a non-vulnerable version available.
The countries covered in the graph below include twenty of the largest consumers of open source software in the world. Even at the low end of our criteria (around 100 million downloads), 9.5% of those downloads are vulnerable components. When you consider juggernauts of open source consumption like the United States, the European Union (collectively), and China, tens of billions of vulnerabilities have entered the supply chains that produce the software we all use and our governments run on.
FIGURE 1.12 AVERAGE VULNERABILITIES BY COUNTRY WITH OVER 1 BILLION DOWNLOAD VOLUME
As we’re only scratching the surface with this regional view of vulnerable downloads, you can explore a deeper dive into open source consumption patterns within specific economic regions in Chapter 3 of this report, where we further unravel the intricacies of dependency management on a global scale. We also summarize the role regulations are having on the industry in Chapter 5.