Shellcoding with MSVC // Amethyst Systems

When it comes to thinking about shellcode, normally we don’t think of the MSVC toolchain. Maybe we’ll think of MASM, the assembler of MSVC, but we certainly won’t think of the C compiler. We typically prefer NASM, an extremely flexible x86/64 assembler for binary creatives.

As badass as writing an assembly shellcode payload can be for Windows, due to the utterly gargantuan size of their eventual product, it is tedious and painful. You have to re-roll standard structures for PE files and have to recall what the secret sauce of this structure is. Not to mention if you don’t use assembly structures you have to memorize specific offsets for common data structures. Over time the badass feeling of dropping down to assembly gets drowned out by the sheer tedium of actually writing the assembly payload.

Gone are the classic days of writing everything in assembly, now we have higher level languages to get our code done. But the cost of being in a higher level language is the amount of hand-holding the compiler and linker do to our memory layout. It is not exactly clear how we can control how variables are layed out within a program. This continues the tradition of sometimes unclear and incomplete documentation from Microsoft on their flagship operating system and how to develop for it. So I’m going to continue the tradition I’m making of this blog of clarifying and documenting how to do perhaps somewhat questionable things with the MSVC compiler.

Tutorials have been written, but they rely on the user manipulating flags from the GUI. Not to mention there’s specific tactics I want to show which are useful for a CMake pipeline. Plus the other tutorial encourages you to turn off optimizations, to which I say “fuck that.” So in this tutorial, I’ll be teaching how to manipulate the compiler to layout your memory how you want it and how to prepare the dev environment via CMake.

Shellcode

First of all, what is shellcode? Shellcode can be defined as an executable binary blob intended for some kind of executable payload. It is typically employed in software exploitation and got its name from the fact that many of these binary blobs simply spawned a command shell, like /bin/bash. Perhaps one of the most famous exploits is the buffer overflow, typically demonstrated by exploiting the C function strcpy.

Each individual kind of shellcode comes with its own challenges. For example, with strcpy exploits, shellcode has to successfully copy via the strcpy function. strcpy works by iterating over a C string until it hits the null byte. Naturally, this means the shellcode payload cannot have a null byte in it. Compiling an assembly payload without a null byte in it requires precise assembly manipulation, selecting specific instructions to avoid encoding a null byte somewhere in the payload.

For exploits, shellcoding with MSVC is absolutely idiotic. You definitely do not have the precise control you need in order to craft an exploit payload for the specific parameters your exploit requires. However, there is a style of shellcoding MSVC will come in handy for insofar as speeding things up: reflective payloads.

Reflective payloads are payloads which can be loaded independently in various arbitrary memory situations. Shellcode is a type of reflective payload. Think things like thread injection on an arbitrary process, or the reflective payloads of malware leading to other stages of the infection. Writing your shellcode in C helps avoid the tedium of shellcoding in an assembler at the expense of losing your extreme precision with the assembler. If you don’t need that precision, it’s an absolute godsend.

Coding Tactics

Laying out memory

If you’re familiar with assemblers, you know some assemblers like NASM give you control over the entire layout of your code. It freely lets you allocate data in an arbitrary section of the target binary. Without any manipulation, the MSVC compiler will put code and data in expected sections: .text for code, .data for writable data, .reloc for the relocation directory and so on. We don’t want sections, though– we want a single blob of binary. In other words, we want a single section. The Coding Project tutorial mentions merging .rdata with .text via linker flags, but I like using a different tactic to layout the memory space. Enter the section pragma.

As an MSVC coder, you are probably familiar with #pragma once or even #pragma comment. #pragma section allows you to define a section within the PE file generated by your code. The section name only gives you eight characters to work, so don’t go wild naming your sections. This is especially annoying once you know how sections are ordered by the linker.

If you ever look at a COFF object file with a program like 010 Editor (highly recommended hex editor), you’ll notice the sections actually seem to have extra text that doesn’t show up in the eventual PE file. This is because the linker organizes the sections by splitting on a dollar sign and sorting the sections alphabetically. It then proceeds to lay out the section in the order dictated by the section names.

Let’s say we’ve got two sections generated by the linker for the .text section: .text$a and .text$b. When eventually building the binary with these sections, it sorts the section data for the .text section and lays them out in order, discarding everything past the dollar sign. With this sorting tactic in mind, we can control how our data and code gets layed out within a target section. Thus, for the best control, we can allocate a section with the section pragma and stuff both data and code inside it! Defining functions and data with MSVC with this tactic in mind looks like this:

#pragma section(".sc$008", read, execute)
__declspec(code_seg(".sc$008")) uint32_t fnv321a(const char *str) {
      uint32_t hash = 0x811c9dc5;

   while (*str != 0) {
      hash ^= *str;
      hash *= 0x1000193;
      ++str;
   }

   return hash;
}

At this point I should tell you to follow along with the code example, a shellcode payload that puts a little sheep friend on your desktop. This little code snippet will allocate the fnv321a function at .sc$008, just after a few functions above it. We unfortunately can’t tell the compiler to make our section RWX with the section pragma, but we have a way around this we’ll cover later on. To make our shellcode easily executable, we put its main function at the very top of the .sc section stack at sc$000. But what about data? Well, that actually relates to another situation we have to deal with: relocations and relative addressing.

Accessing data

If you’re a Gen Z hacker who initially cut their teeth on x86-64 assembly, you’re probably not aware of shellcoding limitations of x86-32. See, you’ve got this fancy thing called RIP-relative addressing. That basically means you can place your data anywhere nearby, since the memory location address is going to be based on the address of the instruction accessing the data.

Believe it or not, x86-32 kinda-sorta has this. It’s only for call instructions, however– not data instructions. That means using data that wasn’t allocated on the stack resulted in a relocation, thus rendering your shellcode immobile. Traditionally, we’ve been taught to perform the call/pop technique on an address actually containing our data. We can still do this in MSVC, because x86-32 has an inline assembler, but in a different way. Here we can see the RIP-relative version of accessing the data as well as the x86 compliant shellcode tactic:

#if defined(_M_AMD64)
#pragma section(".sc$004", read, execute)
__declspec(code_seg(".sc$004")) char *target_filename(void) {
   return &TARGET_FILENAME[0];
}
#elif defined(_M_IX86)
#pragma section(".sc$004", read, execute)
__declspec(code_seg(".sc$004"), naked) char *target_filename(void) {
   __asm {
      call eip_call
   eip_call:
      pop eax
      add eax, 5
      ret
   }
}
#pragma comment(linker, "/INCLUDE:_TARGET_FILENAME")
#endif

#pragma section(".sc$005", read, execute)
__declspec(allocate(".sc$005")) char TARGET_FILENAME[] = "C:\\ProgramData\\sheep.exe";

Since we know where our data is going to be allocated, and we know where our function is going to be allocated, we can use the assembly tactic of call/pop to get the address of our access function then advance it to the address of our allocated data. Both beneficial and detrimental for us, the compiler doesn’t know we’re actually accessing this allocated data, and the optimizer will wipe it out if we don’t tell it to keep it. So for 32-bit, we use a #pragma comment to tell the linker to include our data. The benefit of writing our naked function this way is that the optimizer doesn’t know how to optimize the function, so it instead keeps the function as-is, meaning we don’t need to use the include trick to preserve our function.

If you’ve been following my PE blogs, or if you’re here from already being a Windows shellcoder, you know that the import table is its own section. We can’t exactly rely on that section being there with our imports when we execute our payload– we need to dynamically resolve the APIs we need at the top of our shellcode. As is tradition of this blog, here’s some forbidden content: the undocumented headers of the PEB and PEB_LDR_DATA data structures.

typedef struct _PEB_LDR_DATA_EX {
   ULONG                   Length;
   ULONG                   Initialized;
   PVOID                   SsHandle;
   LIST_ENTRY              InLoadOrderModuleList;
   LIST_ENTRY              InMemoryOrderModuleList;
   LIST_ENTRY              InInitializationOrderModuleList;
} PEB_LDR_DATA_EX, * PPEB_LDR_DATA_EX;

typedef struct _PEB_EX {
	BOOLEAN                 InheritedAddressSpace;
	BOOLEAN                 ReadImageFileExecOptions;
	BOOLEAN                 BeingDebugged;
	BOOLEAN                 Spare;
	HANDLE                  Mutant;
	PVOID                   ImageBase;
	PPEB_LDR_DATA_EX        LoaderData;
	PVOID                   ProcessParameters;
	PVOID                   SubSystemData;
	PVOID                   ProcessHeap;
	PVOID                   FastPebLock;
	PVOID                   FastPebLockRoutine;
	PVOID                   FastPebUnlockRoutine;
	ULONG                   EnvironmentUpdateCount;
	PVOID* KernelCallbackTable;
	PVOID                   EventLogSection;
	PVOID                   EventLog;
	PVOID                   FreeList;
	ULONG                   TlsExpansionCounter;
	PVOID                   TlsBitmap;
	ULONG                   TlsBitmapBits[0x2];
	PVOID                   ReadOnlySharedMemoryBase;
	PVOID                   ReadOnlySharedMemoryHeap;
	PVOID* ReadOnlyStaticServerData;
	PVOID                   AnsiCodePageData;
	PVOID                   OemCodePageData;
	PVOID                   UnicodeCaseTableData;
	ULONG                   NumberOfProcessors;
	ULONG                   NtGlobalFlag;
	BYTE                    Spare2[0x4];
	LARGE_INTEGER           CriticalSectionTimeout;
	ULONG                   HeapSegmentReserve;
	ULONG                   HeapSegmentCommit;
	ULONG                   HeapDeCommitTotalFreeThreshold;
	ULONG                   HeapDeCommitFreeBlockThreshold;
	ULONG                   NumberOfHeaps;
	ULONG                   MaximumNumberOfHeaps;
	PVOID** ProcessHeaps;
	PVOID                   GdiSharedHandleTable;
	PVOID                   ProcessStarterHelper;
	PVOID                   GdiDCAttributeList;
	PVOID                   LoaderLock;
	ULONG                   OSMajorVersion;
	ULONG                   OSMinorVersion;
	ULONG                   OSBuildNumber;
	ULONG                   OSPlatformId;
	ULONG                   ImageSubSystem;
	ULONG                   ImageSubSystemMajorVersion;
	ULONG                   ImageSubSystemMinorVersion;
	ULONG                   GdiHandleBuffer[0x22];
	ULONG                   PostProcessInitRoutine;
	ULONG                   TlsExpansionBitmap;
	BYTE                    TlsExpansionBitmapBits[0x80];
	ULONG                   SessionId;
} PEB_EX, * PPEB_EX;

We access this data structure with a memory pull on address [fs:0x30] for 32-bit and [gs:0x60] for 64-bit. We do not have access to inline assemblers on 64-bit, but there is a macro provided by Microsoft to access this memory in particular: __readfsdword and __readgsqword:

#if defined(_M_IX86)
   iat()->peb = ((PPEB_EX)__readfsdword(0x30));
#elif defined(_M_AMD64)
   iat()->peb = ((PPEB_EX)__readgsqword(0x60));
#endif

With the PEB acquired– and equipped with our Secret Structures That I Can’t For The Life Of Me Remember Where I Copied Them From– we can enumerate the currently loaded modules in whichever program our payload winds up in. We ultimately want to get to kernel32 to get some crucial functions in general. Predictably, this is the second library loaded in any program, the first being ntdll. We take advantage of InLoadOrderModuleList and InLoadOrderLinks to get to kernel32 for our purposes, walking over ntdll because we don’t need it.

   PPEB_LDR_DATA_EX ldr_ex = (PPEB_LDR_DATA_EX)iat()->peb->LoaderData;
   PLDR_DATA_TABLE_ENTRY_EX list_entry = (PLDR_DATA_TABLE_ENTRY_EX)ldr_ex->InLoadOrderModuleList.Flink;
   PLDR_DATA_TABLE_ENTRY_EX ntdll_entry = (PLDR_DATA_TABLE_ENTRY_EX)list_entry->InLoadOrderLinks.Flink;
   PLDR_DATA_TABLE_ENTRY_EX kernel32_entry = (PLDR_DATA_TABLE_ENTRY_EX)ntdll_entry->InLoadOrderLinks.Flink;

   uint8_t *kernel32 = (uint8_t *)kernel32_entry->DllBase;

We can then take advantage of another technique I’ve covered and load our target exports by hash. In our example, we only need two functions. LoadLibraryA was needed for a previous version of the payload that attempted to use URLDownloadToFileA, but something’s up with that function for some reason.

Dealing with the compiler

We know how to lay out our code and data now, but what about compiler options? There’s a reason a lot of shellcoding tutorials tell you to turn off various compiler flags when doing this– the compiler likes to include a lot of things that are actually detrimental to shellcoding, or even might optimize away your code and data! For example, we need our shellcode section to be read-write-execute for testing purposes. Additionally, in debug mode, without adjusting the default compiler flags, memory won’t be laid out in the position we predicted. Plus, functions are gated with an external jump function we cannot control. We set the /INCREMENTAL:NO flag to make the memory layout sane again.

# setting the shellcode section RWX makes testing easier
# disable incremental linking to make the memory map similar between debug and release builds
target_link_options(shellcode PUBLIC "/SECTION:.sc,RWE" "/INCREMENTAL:NO")

Let’s talk about the extra code that MSVC includes. As reversers know, on modern Windows binaries, main is not the same as the entrypoint of the binary. The entrypoint leads to the main routine, but it is not the entrypoint. This is because the C runtime prepares the environment. Part of that runtime is making sure the stack isn’t smashed. This creates functions we cannot relocate that start at the top of our functions, so we need to remove them by adding the /GS- switch to the compiler. In addition to this, we need to remove runtime checks (/RTC switch) from the debug compiler flags, because those too create additional functions nested in our code which we cannot control. There’s no switch to just turn them off, so we have to zap them out of the default flags for Debug mode. For 32-bit, we need to turn off XMM registers, because those optimizations create relocations and data outside our shellcode segment.

# disable stack checks because we can't relocate those functions
target_compile_options(shellcode PUBLIC "/GS-")

# remove runtime checks from the build flags
string(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
string(REGEX REPLACE "/RTC(su|[1su])" "" CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG}")

if(CMAKE_GENERATOR_PLATFORM STREQUAL "Win32")
  target_compile_options(shellcode PUBLIC "/arch:IA32") # disable xmm registers
  set(SHELLCODE_TEST_VAR "_SHELLCODE_DATA")
else()
  set(SHELLCODE_TEST_VAR "SHELLCODE_DATA")
endif()

Annoyingly for us, MSVC can identify when you’re essentially writing a memset routine and will help you by inserting a memset C runtime symbol. This results in injecting a function outside our shellcode segment again, so we need to implement and overwrite the memset symbol that gets injected. I believe these are called instrinsics. More investigation might be required to see what intrinsics get injected into your code.

Extracting and testing the shellcode

You’ll note that I built the shellcode binary as a DLL file. This helps isolate the main routine taken by our host binary by making our shellcode payload a DLL export. In addition to avoiding C runtime messing with our shellcode, we force the optimizer to keep our main shellcode function without any muckery. An exported symbol has an implicit reference by being accessible to other programs. This also gives us alternative execution opportunities with our shellcode– perhaps you want to load the DLL with Powershell and call the shellcode export?

The extraction part is easy. You’ll note the way our section table looks upon compilation of our shellcode DLL:

The PE section we created with our shellcoding technique

This section contains all our code and data, and simply needs to be copied into a bin file. We add the following program to our build pipeline like so:

add_custom_command(TARGET shellcode
  POST_BUILD
  COMMAND "$<TARGET_FILE:extract_section>" "$<TARGET_FILE:shellcode>" .sc "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/shellcode.bin" "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/shellcode_gen.h"
  VERBATIM)

This will extract our shellcode into a binary file for further processing. We want to include our shellcode in another program now. There’s a rather interesting tool called binobj which converts a binary file into an .obj file for inclusion in a C program. We use binobj in the pipeline as well to eventually test with our test program.

add_custom_command(TARGET shellcode
  POST_BUILD
  COMMAND "$<TARGET_FILE:binobj>" "${SHELLCODE_TEST_VAR}" 1 "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/shellcode.bin" "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/shellcode.obj"
  VERBATIM)

enable_testing()
add_executable(test_shellcode test.c)
target_include_directories(test_shellcode PUBLIC
  "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>")
add_dependencies(test_shellcode shellcode)
target_link_libraries(test_shellcode "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/shellcode.obj")
add_test(NAME test_shellcode COMMAND test_shellcode)

We then proceed to test the shellcode by reflectively loading it and executing it.

extern uint8_t SHELLCODE_DATA[];

int main(int argc, char *argv[]) {
   uint8_t *valloc_buffer = VirtualAlloc(NULL, SHELLCODE_SIZE, MEM_COMMIT, PAGE_EXECUTE_READWRITE);

   if (valloc_buffer == NULL)
      return 1;

   memcpy(valloc_buffer, &SHELLCODE_DATA[0], SHELLCODE_SIZE);

   return ((int (*)(LPVOID))valloc_buffer)(GetModuleHandle(NULL)) != 0;
}

We then verify the shellcode works by running ctest -C Debug or ctest -C Release, to see if the shellcode was properly lifted from the binary and doesn’t have any relocation or execution issues. Et voila, you have successfully written a reflective payload without writing assembly on Windows.