:
Skip Navigation
Resources Blog Intro to malware analysis: Analyzing Python malware

Intro to malware analysis: Analyzing Python malware

Sonatype's next-generation AI behavioral analysis systems are constantly on the search for malicious packages published to open source repositories. Once a package is flagged by these systems they are passed on to our Security Research team where we verify what is truly malicious.

In this article, we are going to dive into the waters of malware analysis, starting with some basics and slowly going into the deep end as we see fit along the way.

A very popular attack vector for malicious authors is typosquatting, a technique we've mentioned quite a bit in some of our other articles. This consists of authors publishing malicious packages with names very similar to legitimate ones such that a small typo would result in the malicious package name. This way authors can prey on unsuspecting victims as they attempt to install what they believe is legitimate software, that is, right until they notice something has gone terribly wrong.

`Views` is a Python package meant to make generators and sequence creation efficient. There are many similarly named packages for the same purpose. If for some reason someone tried to download this package but forgot the last 's' they could have found themselves infected with one of the most recent finds by our AI-enabled systems: 'view.'

When it comes to malware there are usually two main things we want to do: static and dynamic analysis. Static analysis focuses on the source code, what can I find out by looking at the sources, imports, strings and so on. Dynamic analysis on the other hand focuses more on behavior and understanding what the malware is actually doing by executing portions of the code or in some cases all of it. In most cases, we will need to apply a mix of both techniques in order to get the full picture and it is very common that dynamic analysis is run while also looking through the resulting assembly code in a disassembler and/or debugger. But enough introductions already, let's begin with the actual analysis.

Source code: The low-hanging fruit

Since we are talking about open source malware that means we have access to the source code. However, open source malware is becoming more and more like traditional malware, in the sense that all you see in the open source code is a first-stage dropper whose sole purpose is to reach out to an external server and grab the second-stage payload, the true malware. Let's see if we get lucky with our malicious package 'view' and can manage to find something interesting within the code.

image8-3
Image 2: malicious snippet in setup.py from ‘view’

Just as we suspected all that this is doing is reaching out to an external server and grabbing a second stage executable. In this case, as we can see in Image2, it's reaching out to anonfiles which is immediately a red flag. At this point we probably know enough about this package to understand that it is malicious in nature and we probably don't want this anywhere near our systems.

As a security researcher, my eyes light up when I see this and all I'm thinking is that I hope the file hasn’t been taken down yet. So I rushed to download it. This feels like those times in TV commercials where they have to place a disclaimer at the bottom: "Don't try this at home." However, in the interest of learning all I can say is: Take all proper precautions when dealing with malware. Perhaps an article on how to set up a safe malware analysis environment could be a good addition to our blog. Let us know if this sounds interesting by submitting a comment at the bottom of the page.

The reason the code first makes a request, parses the response and then makes a second request is because the target file has a changing URL, so all the first request is doing is parsing the website for the correct download link. Then, the second request grabs the file.

With our Virtual Machine all set up we can now proceed to download the executable. I like to use `curl` or `wget` to make sure the malicious file is only downloaded and never executed. For additional precautions we can write the output to a file with a non-executable file extension such as txt.

Now that we have the executable you could think that the source code analysis is over. But not so fast. Even though executables don't have source code we can read, they do have strings that are often full of valuable information. Something as simple as running `strings` on the exe is enough to give us tons of clues as to what this is doing. With strings, we can see some of the imports the executable is using along with lots of Python libraries and Python code. Maybe this is a Python script compiled with pyinstaller to make it an executable, which would explain all the Python code we see within the executable. To get a better idea we can move on to other tools that will help us.

image7-Jan-18-2023-08-19-20-2307-PMImage 3: strings command output from ‘view’s remote executable

Online tools for analyzing binaries

Now that we have an executable there are some very valuable services online that will help us in our task of understanding this malware. Many options for sandboxes exist and many have free options that are very complete. My favorites are VirusTotal (VT) and any.run, these are the first I always go to, but I wanted to try something new and I came across filescan.io. Let's give it a shot.

image11-1Image 4: filescan.io summary result

This gives us tons of information and one of the most interesting parts is that it allows you to download extracted files the executable may be hiding. But it doesn't always work so you might still need to dive into the deep waters of malware analysis to get to the bottom of the malicious behavior of extracted files.

One of the tabs contains extracted strings where we can see some of the imports and functions the malware is trying to use. We can see things like `GetProcAddress` and `LoadLibrary` which tell us that the author is likely trying to hide the true inner workings of their code by loading libraries in memory. Another very interesting one, as seen in Image5, is `IsDebuggerPresent` which tells us that this malware is implementing some sort of Sandbox and Analysis Evasion and wants to complicate things for us. Oftentimes as soon as malware detects it's being debugged or run in a sandbox it proceeds to sleep while hiding its true behavior.

image6-Jan-18-2023-08-23-44-5517-PMImage 5: extracted strings from filescan.io

VirusTotal is leading the industry on these solutions because we can get so much more detail from VT reports. It's always good to have a variety of tools under your belt and ready to go but we tend to have favorites for a reason. As we can see in image6, the level of detail we can get is much more. We can see the actual arguments that were passed to the calls and even the returned value. This goes a very long way in understanding true behavior.

image2-Jan-18-2023-08-24-58-2037-PMImage 6: Native calls in VirusTotal report

The network information given by VT also has lots of hints at what the malware is doing, where it is going and who it is talking to. In this case, we can see plenty of indicators of malicious behavior and we can even extract a couple of Indicators of Compromise (IOC).

Malicious sites being contacted by our malware sample:

  • Accv.es : Hints for attribution. European, Spain? 

  • Url paths in spanish : More hints for attribution.

  • discord.com/api/webhooks : We know discord web hooks are usually up to no good.

  • Crl.dhimyotis.com : Reaching out to grab a root certificate from. Odd?

  • Pastebin.com : Command and control

image3-Jan-18-2023-08-35-57-8859-PMImage 7: Pastebin.com: Command and Control (C2)

Of course, pastebin.com stands out and is even identified by the automated sandbox engines as the Command and Control (C2 or C&C) server. Another one that stands out is dhimyotis.com. The latter is odd because when I go to check out what it is, it tells me it's a security website. They have a product meant to help verify trust and identity on the internet, however, why is this malware reaching out to it and why is it grabbing a root certificate from a page that has Directory listing enabled? Seeing a page with directory listing can be an indicator that the site has been compromised, but these could also be there for legitimate purposes. A lot of malware analysis consists of heavy research, understanding new concepts and exploring all possibilities, so this isn't necessarily something malicious, it does however tell us something more about the behavior of our executable.

image4-Jan-18-2023-08-37-07-1091-PMImage 8: Directory listing reached by malware from dhimyotis.com

Manual analysis

We have now used a combination of manual inspection and automated online tools to help us understand what the malware is doing and we certainly have enough to deem this a true malicious package with nefarious intentions. But we're not really clear on what the end goal is here. This is where things start getting fun. 

Before we go full reversing mode and open IDA or Ghidra, let's follow the clues that have been telling us this is a Python script wrapped in a Windows executable. We already know the exe is filled with Python bytecode and libraries so in order to confirm this is a packed executable we can look at the PE headers.

img-PE_headers_with_pestudioImage 9: PE headers with pestudio

In image9, we can see this executable contains a section that doesn't fall under the standard naming mechanism established for standard executables, `_RDATA`. PEstudio is even nice enough to highlight it for us to bring our attention to it. Looking further we see that there is also an overlay. An overlay is an appended section to the exe which screams packed executable, in this case, Python script wrapped in an exe.

Enough playing around, we know there is Python inside, so let's crack it open and extract it. There are many ways we could extract the section we are interested in. There is even a library in the PyPI registry to help us out with this: `pefile`. We can write a quick script to read the executable, get the overlay offset and dump the file contents from that offset until the end of the file. But why stop there. There is another Python script available that can go ahead and extract all the `pyc` files, which are the Python bytecode within our exe: `pyinstxtractor`. Let's run this and get the really interesting files.

image10-2Image 10: extracting pyc bytecode from exe via pyinstxtractor.

In image10, we can see that this tool not only extracts our files of interest but also tells us what the probable entry point is, in this case it points to `source.pyc`. Which is really useful given that these executables are wrapped with absolutely everything they need to run, that means, all libraries and functions it uses. Malicious actors can also add dead code to make it a bit confusing so knowing the entry point is very valuable.

image9-1Image 11: extracted files

Finally, we can use a Python decompiler to go from bytecode, pyc files, to source code, py files. There are plenty of options out there, don't you just love open source. For our analysis we used `uncompyle6` which can of course be found on the PyPI registry. This is as simple as point and run.

Ultimately, this looks like some sort of cryptominer. It reaches out to pastebin and some other sites, it contains many references to crypto wallets, specifically Exodus wallets, and even uses some Discord webhooks for exfiltration and communication.

The feeling here is the reason I love doing this: Putting it all together in an article makes it look fast and simple. And sometimes with enough experience it can certainly be that way. But there is nothing better than banging your head against the wall for a couple days, then getting an epiphany mid-day while doing something completely different and finally coming back to the problem to realize the solution. That feeling of finally understanding everything you were working on is amazing and a great part of the reason why I love to wrap it all nicely in a blog post and share my experience.

We didn't end up going fully into the deep end, just briefly tested the waters and found what we wanted. But there is so much more that can be done depending on the complexity of the malware sample. In this case, we only needed some basic malware analysis to get to the bottom of it but perhaps some reversing with IDA or Ghidra can be next.

All our research in regard to this package is now available in our products and cataloged under Sonatype-2023-0134. Users of Sonatype Repository Firewall can rest easy knowing that such malicious packages would automatically be blocked from reaching their development builds.

If you're not yet a Sonatype customer and want to find out if your code is vulnerable, you can use our free Sonatype Vulnerability Scanner to quickly find out.

Picture of Juan Aguirre

Written by Juan Aguirre

Juan is a security researcher at Sonatype and part of the team who has helped Sonatype catalog more than 100 million open source components.