A Quick Look at Ghidra BSim

On December 22 2023, the Ghidra project released version 11 of the Ghidra Software Reverse Engineering platform. After reading through the Release Notes, a “major new feature“ caught my eye: BSim, which promises to offer a scalable system for static software comparative analysis within Ghidra. This intersects nicely with my own graduate research at University of Cincinnati, so I decided to give it a whirl over the break.

What is BSim?

An Extension for Ghidra

If you aren’t yet familiar with Ghidra, it is a Software Reverse Engineering tool, released by the United States National Security Agency (NSA), that assists with static and dynamic analysis of compiled software, often focusing on cases where you (the analyst) may not have access to the original source code or documentation. In addition to the link above, my malware analysis course has multiple lectures and notes covering Ghidra.

BSim is a new feature for Ghidra that is implemented as an extension module paired with a new dedicated server component (that is independent of the “Ghidra Server” functionality, if you are already familiar with that).

Capabilities of BSim

BSim analyzes each of the identified functions within your binaries and produces a “feature vector” for each of them that is largely based upon Ghidra’s P-Code representation of the function, plus a few other observations. P-Code is the generic pseudo-assembly language that Ghidra uses internally to translate from every machine language it supports. This is the “secret sauce” under the hood that makes Ghidra unique and powerful. As BSim is largely built from utilizing this dataset, it presents not only the possibility to search for identical or similar (slightly modified) functions across your library, but also offers the potential to be able to perform these comparisons and discover matches in binaries that were compiled for different architectures. For instance, if a malicious malware author used the same C++ code for an encrypted communication channel in malware targeting Windows on AMD64 and then also wrote different malware for an ARM64 macOS system that reused that same code, in theory BSim should be able to help identify this potential relationship.

BSim accomplishes this by generating signatures for each function within a Ghidra project’s binaries, and then stores these in a database. In the case of an Elasticsearch back-end, the database needs a plugin installed that can perform the similarity comparisons across the feature vectors generated by BSim. In order to generate more signatures for additional binaries and commit them into the database, a bsim command-line utility has been provided. The Ghidra UI, however, can be used to perform context-aware queries against this database, using the currently open analysis project(s), and a configured database connection to a running BSim server.

The Ghdira Project has a short tutorial that is worth following for exploring BSim’s various features. The Ghidra tutorial has you create a small local “H2” database, which is sufficient for smaller-scale libraries and comparisons, but won’t scale as well as the other two options (nor will it be as suitable for sharing/collaboration).

Set Up BSim Elasticsearch Database

I attempted to get this project working via the PostgreSQL instructions and bsim_ctl that are present within the Ghidra Help system’s BSim section. Unfortunately, following the instructions there resulted in a PostegreSQL build that would crash immediately after starting, so I pivoted to use the Elasticsearch back-end instead, and so far that one has been working fairly well. An important caveat, however, is that the BSim capability still requires a special plugin to be installed in Elasticsearch, named lhs.zip, and distributed within the ZIP archive that Ghidra is distributed in (actually, it’s inside a ZIP file which is inside that ZIP).

In order to ease all of this for a demonstration, I have created the following repository in GitHub with some scripts and Docker recipes to help bootstrap a functioning Elasticsearch-based BSim database:

There are more detailed instructions in its README, but a simple start is to clone the repository, modify the default .env (if needed) and run the setup script:

git clone https://github.com/ckane/ghidra-bsim-elastic.git
cd ghidra-bsim-elastic
cp .env.sample .env

For basic usage, you can simply use the default .env. However, open it in your editor and determine if you want to change anything, such as the listen port for BSim’s Elasticsearch back-end if it would conflict with another Elasticsearch instance you’re already running. Once satisfied, run the setup script. Follow the instructions when prompted as you will be prompted to provide the generated superuser password in order to create the BSim database, and you may be prompted to confirm potentially destructive overwrite actions, if this isn’t your first time running the script.

./setup_bsim_elastic.sh

Once the setup script completes (should be a few minutes) and has no errors, there should be a new container running Elasticsearch listening on the port specified as ELASTIC_PORT in .env. By default, this is port 9200, which is the standard Elasticsearch port. This instance will have a new database & index named bsim created and ready to add malware signatures to it.

Create Ghidra User

Ghidra’s UI will use authentication to access the BSim database, and this authentication layer is actually managed by Elasticsearch, not BSim or Ghidra. You will want to create a new Elasticsearch user with privileges to modify the bsim database, and give it a password. A script named add_user.sh has been provided to create these users. To create a new user with a specific password:

./add_user.sh username secretpw

The above will create a new user named username with a password secretpw in the Elasticsearch instance that contains the BSim database. This user will be granted the superuser Elasticsearch role, so that the user may read from the database as well as contribute new samples to it. More restrictive and complex administrative roles can be created within Elasticsearch, if desired, but that is beyond the scope of this post.

Once created, the above user credentials can be used within the BSim feature in Ghidra, to connect to a new BSim database. For the command line stuff within my GitHub repository, the elastic user will continue to be used. However, Ghidra by default will assume the username of your logged-in user is the username to use for all authentication, so make sure to create a username matching that which your Ghidra UI uses.

Importing Samples

Similar to above, I have created a script to ease the workload of ingesting large folders of multiple malware samples. Similar to the setup script earlier, it will display the elastic user’s password to you and ask you to manually enter it in order to proceed. VirusShare offers some great freely-available sample sets that are categorized (APT1, Loki, Mediyes, Zeus, Locker, etc…), which make great candidates for this kind of analysis. The VirusShare Torrent Tracker provides torrent links to download these all. A great example set is the APT1 corpus which has some corresponding malware analysis reporting from Mandiant. If the direct corpus link above fails to work, visit the earlier “tracker“ link, and look for it in the generated list.

For this exercise, we will assume that the VirusShare_APT1_293.zip has successfully been downloaded.

mkdir -p vxshare_apt1
cd vxshare_apt1
unzip ../VirusShare_APT1_293.zip
cd ..
./add_samples.sh apt1 ./vxshare_apt1

The above will import each of the samples, one at a time, into a new project named apt1 within the bsim_projects workspace directory. Clearly, with 293 samples, this will take some time. After all have been imported into the apt1 Ghidra project, the bsim command to generate and commit signatures into the database will be performed, and the script will ask for the elastic user’s password to be manually entered again (similar to the setup script). Each sample’s feature vector will be computed and stored in the bsim database, which will be another long-running ingest process. Once it is complete, the data set is available to use from within Ghidra.

In addition to the above, the scripts ImportAllProgramsFromADirectoryScript.java and GenerateSignatures.java can be used from the GUI to accomplish these same tasks, though with a bit more manual effort. The BSim tutorial can introduce you to using these. The File->Batch Import operation within the UI can also help importing large sets of binaries.

Enable BSim Within Ghidra

After completing the earlier steps, there should be a ./bsim_projects/vxshare_apt1.gpr file created. The setup script run earlier downloads and installs a Ghidra installation into ./ghidra_11.0_PUBLIC that can be used. As well, any personal installation of Ghidra stored elsewhere can also be used, so long as it is version 11.0 or later. The copy installed in the repository directory will continue to be used for all command-line helper scripts, however.

After opening up Ghidra, use File->Open Project to bring up the dialog to navigate to the vxshare_apt1.gpr project file mentioned in the previous paragraph. Once opened, it should show up in the Active project listing, similar to below (note that due to choice of project name, and also some prior work, the list might not look exactly like this):

Ghidra Active Project showing apt1 project

Double-clicking on one of the VirusShare_* items in the list will bring it up in the Ghidra CodeBrowser. Within the CodeBrowser, select File->Configure and it will bring up a dialog. Check the checkbox for BSim, and then click its Configure link to bring up the BSim Plugins dialog. Check all the checkboxes available there, in order to enable the most BSim features. Once done, click OK to close the plugin selector, then Close on the Configure Tool.

Ghidra Configure Tool with BSim Selected BSim Plugins Selector

Once this is all done, a new BSim menu should be present in the menu bar of the CodeBrowser.

BSim Menu in CodeBrowser

Choosing BSim->Perform Overview will give a high-level summary of the functions within the binary that have at least one potential match elsewhere. The BSim system will prompt for the BSim Server. If the connection to the Elasticsearch BSim server hasn’t been established yet, clicking the “cog” icon button next to the BSim server drop-down will bring up another dialog allowing it to be added, giving the hostname (localhost, for the demo), port, type of database (Elastic), and the username and password created earlier. Once added, the database will be available in all BSim drop-down lists. Setting the similarity threshold to something lower than the default of 0.7 can help provide some more comparison results, at the expense of mis-matched identifications.

Overview Analysis

After choosing BSim->Perform Overview, and providing the necessary inputs, a table will be displayed that summarizes each of the functions within the binary that appear to have a match within the bsim dataset. It displays where in the program’s virtual memory the offset of the function is, the name/label of the function in Ghidra, the number of hits within the data set

BSim Program Overview

Function Match Analysis

Right-click on any of the rows, and a menu pops up that allows the user to Search Selected Functions. Choosing this option brings up another dialog that is a function-context list of all the matches (in other words, it’s a zoom in on the highlighted row). This view is split in half, with the top pane listing the different functions that are similar, and a lower pane that lists the binaries that have similar functions, but summarizes multiple hits from the same binary into a single row. The similarity score describes how similar the feature vector is, while the confidence score is used to rank which similarities are considered higher-confidence.

BSim Function Matches

Side-by-side Function Comparison

Selecting one of the rows that isn’t a 100% similarity match, and right-clicking, provides access to the Compare Functions view (shortcut is SHIFT+C). In this view, both functions are shown side-by-side, with highlights where differences are identified. The default view is the decompiler pseudo-C output, but the view can be switched to disassembly by selecting the Listing View tab at the top of the window. This view is useful for helping speed to confirmation of such differences. In the example below, with the highlighted code in blue, it could be deduced that the code on the right might be a modified derivative of the code on the left.

BSim Function Comparison