Tracking Malware with Import Hashing
Mandiant
Written by: Mandiant
Tracking threat groups over time is an important tool to help defenders hunt for evil on networks and conduct effective incident response. Knowing how certain groups operate makes for an efficient investigation and assists in easily identifying threat actor activity.
At Mandiant, we utilize several methods to help identify and correlate threat group activity. A critical piece of our work involves tracking various operational items such as attacker infrastructure and email addresses. In addition, we track the specific backdoors each threat group utilizes - one of the key ways to follow a group's activities over time. For example, some groups may favor the SOGU backdoor, while others use HOMEUNIX.
One unique way that Mandiant tracks specific threat groups' backdoors is to track portable executable (PE) imports. Imports are the functions that a piece of software (in this case, the backdoor) calls from other files (typically various DLLs that provide functionality to the Windows operating system). To track these imports, Mandiant creates a hash based on library/API names and their specific order within the executable. We refer to this convention as an "imphash" (for "import hash"). Because of the way a PE's import table is generated (and therefore how its imphash is calculated), we can use the imphash value to identify related malware samples. We can also use it to search for new, similar samples that the same threat group may have created and used.
Though Mandiant has been leveraging this technique for well over a year internally, we aren't the first to publicly discuss this. An imphash is a powerful way to identify related malware because the value itself should be relatively unique. This is because the compiler's linker generates and builds the Import Address Table (IAT) based on the specific order of functions within the source file. Take the following example source code:
#include
#include
#include
#include
#pragma comment(lib, "ws2_32.lib")
#pragma comment(lib, "wininet.lib")
int makeMutexA()
{
CreateMutexA(NULL, FALSE, "TestMutex");
return 0;
}
int makeMutexW()
{
CreateMutexW(NULL, FALSE, L"TestMutex");
return 0;
}
int makeUserAgent()
{
HANDLE hInet=0, hConn=0;
char buf[sizeof(struct hostent)] = {0};
hInet = InternetOpenA("User-Agent: (Windows; 5.1)", INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
hConn = InternetConnectA(hInet, "www.google.com", 443, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
WSAAsyncGetHostByName(NULL, 3, "www.yahoo.com", buf, sizeof(struct hostent));
return 0;
}
int main(int argc, char *argv[])
{
makeMutexA();
makeMutexW();
makeUserAgent();
return 0;
}
When that source file is compiled, the resulting import table looks as follows:
ws2_32.dll
ws2_32.dll.WSAAsyncGetHostByName
wininet.dll
wininet.dll.InternetOpenAwininet.dll.InternetConnectA
kernel32.dll
kernel32.dll.InterlockedIncrement
kernel32.dll.IsProcessorFeaturePresent
kernel32.dll.GetStringTypeW
kernel32.dll.MultiByteToWideChar
kernel32.dll.LCMapStringW
kernel32.dll.CreateMutexA kernel32.dll.CreateMutexW
kernel32.dll.GetCommandLineA
kernel32.dll.HeapSetInformation
kernel32.dll.TerminateProcess
Imphash: 0c6803c4e922103c4dca5963aad36ddf
We abbreviated the table to save space, but the bolded APIs are the ones referenced in the source code. Note the order in which they appear in the table, and compare that to the order in which they appear in the source file.
If an author were to change the order of the functions and/or the order of the API calls in the source code, this would in turn affect the compiled import table. Take the previous example, modified:
#include
#include
#include
#include
#pragma comment(lib, "ws2_32.lib")
#pragma comment(lib, "wininet.lib")
int makeMutexW()
{
CreateMutexW(NULL, FALSE, L"TestMutex");
return 0;
}
int makeMutexA()
{
CreateMutexA(NULL, FALSE, "TestMutex");
return 0;
}
int makeUserAgent()
{
HANDLE hInet=0, hConn=0;
char buf[sizeof(struct hostent)] = {0};
hConn = InternetConnectA(hInet, "www.google.com", 443, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
hInet = InternetOpenA("User-Agent: (Windows; 5.1)", INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
WSAAsyncGetHostByName(NULL, 3, "www.yahoo.com", buf, sizeof(struct hostent));
return 0;
}
int main(int argc, char *argv[])
{
makeMutexA();
makeMutexW();
makeUserAgent();
return 0;
}
In this example, we have reversed the order of makeMutexW and makeMutexA, and of InternetConnectA and InternetOpenA. (Note that this would be an invalid sequence of API calls, but we use it here to illustrate the point.) Below is the import table generated from this modified source code (again abbreviated); note the changes when compared to the original IAT, above, as well as the different imphash value:
ws2_32.dll
ws2_32.dll.WSAAsyncGetHostByName
wininet.dll
wininet.dll.InternetConnectAwininet.dll.InternetOpenA
kernel32.dll
kernel32.dll.InterlockedIncrement
kernel32.dll.IsProcessorFeaturePresent
kernel32.dll.GetStringTypeW
kernel32.dll.MultiByteToWideChar
kernel32.dll.LCMapStringW
kernel32.dll.CreateMutexWkernel32.dll.CreateMutexA
kernel32.dll.GetCommandLineA
kernel32.dll.HeapSetInformation
kernel32.dll.TerminateProcess
Imphash: b8bb385806b89680e13fc0cf24f4431e
The final example shows how the ordering of included files at compile time will affect the resulting IAT (and thus the resulting imphash value). We'll expand on our original example by adding files imphash1.c
and imphash2.c
, to be included with our original source file imphash.c
:
-- imphash1.c --
int makeNamedPipeA()
{
HANDLE ph = CreateNamedPipeA("\\.\pipe est_pipe", PIPE_ACCESS_DUPLEX,
PIPE_TYPE_MESSAGE, 1, 128,
64, 200, NULL);
return 0;
}
-- imphash2.c --
int makeNamedPipeW()
{
HANDLE ph2 = CreateNamedPipeW(L"\\.\pipe est_pipeW", PIPE_ACCESS_DUPLEX,
PIPE_TYPE_MESSAGE, 1, 128,
64, 200, NULL);
return 0;
}
-- imphash.c --
#include
#include
#include
#include
#include "imphash1.h"
#include "imphash2.h"
#pragma comment(lib, "ws2_32.lib")
#pragma comment(lib, "wininet.lib")
int makeMutexW()
{
CreateMutexW(NULL, FALSE, L"TestMutex");
return 0;
}
int makeMutexA()
{
CreateMutexA(NULL, FALSE, "TestMutex");
return 0;
}
int makeUserAgent()
{
HANDLE hInet = 0, hConn = 0;
char buf[sizeof(struct hostent)] = {0};
hConn = InternetConnectA(hInet, "www.google.com", 443, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
hInet = InternetOpenA("User-Agent: (Windows; 5.1)", INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
WSAAsyncGetHostByName(NULL, 3, "www.yahoo.com", buf, sizeof(struct hostent));
return 0;
}
int main(int argc, char *argv[])
{
makeMutexA();
makeMutexW();
makeUserAgent();
makeNamedPipeA();
makeNamedPipeW();
return 0;
}
Using the following command to build the EXE:
cl imphash.c imphash1.c imphash2.c /W3 /WX /link
The resulting IAT is:
ws2_32.dll
ws2_32.dll.WSAAsyncGetHostByName
wininet.dll
wininet.dll.InternetConnectA
wininet.dll.InternetOpenA
kernel32.dll
kernel32.dll.TlsFree
kernel32.dll.IsProcessorFeaturePresent
kernel32.dll.GetStringTypeW
kernel32.dll.MultiByteToWideChar
kernel32.dll.LCMapStringW
kernel32.dll.CreateMutexW
kernel32.dll.CreateMutexA
kernel32.dll.CreateNamedPipeAkernel32.dll.CreateNamedPipeW
kernel32.dll.GetCommandLineA
kernel32.dll.HeapSetInformation
kernel32.dll.TerminateProcess
Imphash: 9129bdbc18cfd1aba498c94e809567d5
Changing the order of includes for imphash1.h
and imphash2.h
within the source file imphash.c
will have no effect on the ordering of the IAT. However, changing the order of the files on the command line and recompiling will affect the IAT; note the re-ordering of CreateNamedPipeW
and CreateNamedPipeA
:
cl imphash.c imphash2.c imphash1.c /W3 /WX /link
ws2_32.dll
ws2_32.dll.WSAAsyncGetHostByName
wininet.dll
wininet.dll.InternetConnectA
wininet.dll.InternetOpenA
kernel32.dll
kernel32.dll.TlsFree
kernel32.dll.IsProcessorFeaturePresent
kernel32.dll.GetStringTypeW
kernel32.dll.MultiByteToWideChar
kernel32.dll.LCMapStringW
kernel32.dll.CreateMutexW
kernel32.dll.CreateMutexA
kernel32.dll.CreateNamedPipeWkernel32.dll.CreateNamedPipeA
kernel32.dll.GetCommandLineA
kernel32.dll.HeapSetInformation
kernel32.dll.TerminateProcess
Imphash: c259e28326b63577c31ee2c01b25d3fa
These examples show that both the ordering of functions within the original source code - as well as the ordering of source files at compile time - will affect the resulting IAT, and therefore the resulting imphash value. Because the source code is not organized the same way, two different binaries with exactly the same imports are highly likely to have different import hashes. Conversely, if two files have the same imphash value, they have the same IAT, which implies that the files were compiled from the same source code, and in the same manner.
For packed samples, simple tools or utilities (with few imports and, based on their simplicity, likely compiled in the same way), the imphash value may not be unique enough to be useful for attribution. In other words, it may be possible for two different threat actors to independently generate tools with the same imphash based on those factors.
However, for more complex and/or custom tools (like backdoors), where there are a sufficient number of imports present, the imphash should be relatively unique, and can therefore be used to identify code families that are structurally similar. While files with the same imphash are not guaranteed to originate from the same threat group (it's possible, for example, for the files were generated by a common builder that is shared among groups) the files can at least be reasonably assumed to have a common origin and may eventually be attributable to a single threat group with additional corroborating information.
Employing this method has given us great success for verifying attacker backdoors over a period of time and demonstrating relationships between backdoors and their associated threat groups.
Mandiant has submitted a patch that enables the calculation of the imphash value for a given PE to Ero Carrera's pefile (http://code.google.com/p/pefile/).
Example code:
import pefile
pe = pefile.PE(sys.argv[1])
print "Import Hash: %s" % pe.get_imphash()
Mandiant uses an imphash convention that requires that the ordinals for a given import be mapped to a specific function. We've added a lookup for a couple of DLLs that export functions commonly looked up by ordinal to pefile.
Mandiant's imphash convention requires the following:
- Resolving ordinals to function names when they appear
- Converting both DLL names and function names to all lowercase
- Removing the file extensions from imported module names
- Building and storing the lowercased string . in an ordered list
- Generating the MD5 hash of the ordered list
This convention is implemented in pefile.py version 1.2.10-139 starting at line 3618.
If imphash values serve as relatively unique identifiers for malware families (and potentially for specific threat groups), won't discussing this technique alert attackers and cause them to change their methods? Attackers would need to modify source code (in a way that did not affect the functionality of the malware itself) or change the file order at compile time (assuming the source code is spread across multiple files). While attackers could write tools to modify the imphash, we don't expect many attackers to care enough to do this.
We believe it is important to add imphash to the lexicon as a way to discuss malware samples at a higher level and to exchange information about attackers and threat groups. For example, incident responders can use imphash values to discuss malware without specifically disclosing which exact sample (specific MD5) is being discussed.
Consider a scenario where an attacker compiles 30 variants of its backdoor with different C2 locations and campaign IDs and deploys them to various companies. If a blog post comes out stating that a specific MD5 was identified as part of a campaign, then based on that MD5 the attacker immediately knows what infrastructure (such as C2 domains or associated IP addresses) is at stake and which campaign may be in jeopardy. However, if the malware was identified just by its imphash value, it is possible that the imphash is shared across all 30 of the attacker's variants. The malware is still identifiable by and can be discussed within the security community, but the attacker doesn't know which specific samples have been identified or which parts of their infrastructure are in jeopardy.
To demonstrate the effectiveness of this analysis method, we've decided to share the imphash values of a few malware families from the Mandiant APT1 report:
We calculated the above malware families and corresponding imphash values over the set of malware from the Mandiant APT1 report released in February 2013. Using the imphash method described above, we calculated imphash values over all the samples, and then counted the total number of samples that matched on each imphash. Using 356 total samples from the report, we were able to identify 11 imphash values that provided significant coverage of their respective families. Pivoting from these imphash values, we were able to identify additional malware samples that further analysis showed were part of the same malware families and attributable to the same threat group.
Imphash analysis, like any other method, has its limitations and should not be considered a single point of success. Just because two binaries have the same imphash value does not mean they belong to the same threat group, or even that they are part of the same malware family (though there is an increased likelihood that this is the case). Imphash analysis is a low-cost, efficient and valuable way to triage potential malware samples and expand discovery by identifying "interesting" samples that merit further analysis. The imphash value gives analysts another pivot point when conducting discovery on threat groups and their tools. Employing this method can also yield results in tracking and verifying attacker backdoors over time, and it can assist in exposing relationships between backdoors and threat groups. Happy Hunting!