Stack Cutting
This is a Simple Loader (Hooking) module to push sensitive Win32 API calls through a stack-cutting call proxy.
Project Files
NOTES
Inherited Architecture
This is a module for Simple Loader (Hooking). To run it, use:
./link loader.spec demo/test.x64.dll out.bin %HOOKS="modules/stackcutting/stackcutting.spec"
The Proxy
proxy.c is our call proxy. It's position-independent code. Its only job is to munge the stack, call a Win32 API function with the right arguments, fix the stack, and return the result.
Our goal, in this fixing process, is to create the illusion of a full stack (e.g., RtlUserThreadStart on down) without the walked frames containing our DLL loader, injected DLL, and other things in between. I call this stack cutting, because I'm cutting the bad frames out.

Finding a Valid Frame
For this to work, we need a valid stack frame to point things to. In this example, I simply grabbed the return address and frame pointer for our DLL loader's caller. I then propagate these to eventually reach the proxy function.
This depends on the loader executing from a context where sane frames come before it.
If this loader is fired via CreateRemoteThread, we may find ourselves in a context where we don't have a good frame behind us. If we spam a return address without a good frame pointer, our stack unwinding becomes less than predictable. This implementation detects this situation, and opts to NULL the return address and frame pointer, when the caller context frame pointer is NULL. This won't give us the illusion of a complete callstack, but in many situations, the stack unwinding should stop at our proxy function--giving us a truncated stack.
Hiding the Call proxy
Even with a valid frame and some stack munging, we still have a problem. Our proxy function WILL show up in the callstack. It's position-independent code. We want to place it somewhere more advantageous than private memory. My solution was to find a code cave (e.g., slack space between the .text section of a module and the nearest page size) to store the PIC. Here, I try the executable module itself, and if it's not big enough I try kernel32. If neither work, I just VirtualAlloc unbacked memory. I wouldn't choose kernel32 for production use, but the pattern of trying modules until one you like works is demonstrated in this code.
go in stackcut_setup.c is where the call proxy is setup. Setting this proxy up happens before the loader init logic is called.
Cutting the Stack
When I started this project, I sought to reproduce Mariusz Banach's results from ThreadStack Spoofer. I expected that I could set my call proxy's stack-stored frame address and return address 0 and this would stop the stack walking. Better, I hypothesized that if I updated both to VALID values, much further up the stack, I could create the illusion of a complete callstack.
My hypothesis didn't go as planned.
For x86 things worked as expected. x64? Not so much.
Why? The call proxy lives in a position-independent context. It's stack does not unwind the same way as a function baked into a compiled executable. Why? Because there's no .pdata section with UNWIND_INFO structures to assist this process. Why do we have a .pdata section? Because unwinding an x64 stack without it requires a lot of guesswork. Commonly, this guess work is carried out by APIs like StackWalk64.
To fool the stack walking algorithm (when there's no .pdata), I found I had to set the frame address and return address as I did before BUT I also had to place the desired return address at the top of (bottom-most memory address) of my proxy function's frame.
This works, but there's another problem. This space at the top of the frame is the x64 shadow space, for the callee function to use (as it wishes), to save register content. If the callee overwrites our spammed return value, the illusion breaks. Sometimes, it breaks in a way where the unwinding just terminates at our proxy function. Sometimes, it breaks in a way that's more suspicious. It depends on what the callee puts into that first slot.
Sometimes, we can work around the above problem. For example KERNEL32$Sleep stomps this slot with the ebx register content. This is easy to work with. We just write our desired return address to ebx and KERNEL32$Sleep will put it where we want it.
But, some functions (e.g., VirtualAlloc) stomp this slot with a register meant for argument passing. We can't overwrite that. Fortunately, in the VirtualAlloc case, the result is to terminate our stack unwinding at our proxy function. One way to work around this, might be to call another function that cooperates better (e.g., directly going to NtAllocateVirtualMemory).
I share the above to warn that this implementation is by no means universal and it doesn't give the stack illusion with every Win32 API call. It requires some quality time in a debugger validating the result.
Hooking Win32 APIs
stackcutting.spec implements the callable labels setup and hooks to satisfy Simple Loader (Hooking)'s contract to layer tradecraft on top of a base loader.
This module hooks VirtualAlloc, VirtualProtect, LoadLibraryA, Sleep, and MessageBoxA. The only task for these hooks is to push the hooked functionns and their arguments through the stack cutting call proxy.
One piece of gymnastics is that stack cutting hard requires that it's setup function is the entry point for the loader. stackcutting.spec uses remap to get rid of the original go() function before merging its code (with a new go()) function. redirect is used as well to change local references to the renamed go() to the new go() function.
PIC Hooking Inception
The hooking in this tradecraft isn't just for the DLL though! attach "KERNEL32$VirtualAlloc" "_cVirtualAlloc" (stackcut_setup.spec) rewrites VirtualAlloc API calls in our DLL loader to local _cVirtualAlloc calls. The same happens with LoadLibraryA too. This is our tradecraft getting underneath its own program to push these calls through our stack cutting proxy.
While this self-hooking inception is cool, it has risks. stackut.spec's setup attaches to several functions. SetupProxy calls VirtualProtect and (potentially) VirtualAlloc. Both are hooked. But, we don't want these calls in SetupProxy hooked because the proxy isn't initialized yet. optout "SetupProxy" "_cVirtualAlloc, _cVirtualProtect" mitigates this by preventing Stack Cutting's hooks from affecting this function. Crucially, optout does allow other tradecraft modules to potentially instrument SetupProxy.
Design Decision: Global Variables in PIC
I made the decision to use global variables in this rewrite of stack cutting. These variables keep track of our spoofed return address, frame address, and the call proxy itself.
They're not necessary to setup the Stack Cutting feature from PIC. You could call SetupProxy from PIC and pass it onto the hooking PICO without a global.
What we gain using global variables is our stackcut.c code works as-is in our DLL loader AND merged with our PICO hooking module.
While PIC does not normally have transparent access to global variables, Crystal Palace can restore .bss global variables with a fixbss helper function. The specific tradecraft choice and implementation is handled by the PIC services module run by loader.spec.
In some operations contexts, you might choose to give up some modularity to avoid global variables in a loader. In a tradecraft demonstration context (e.g., purple teaming), I believe this modularity, speed of implementation, and easy inception to unify PIC and DLL tradecraft is worth using globals. YMMV.
Conversation
- Writing a Debugger From Scratch - DbgRs Part 6 - Stacks (2023) by Tim Misiak. A gentle introduction to stack unwinding, Frame Pointer Omission, and using .pdata/UNWIND info written from the context of writing a debugger.
- ThreadStackSpoofer (2021) by Mariusz Banach is a similar POC to hook the
Sleepfunction, stomp the return address to 0, callSleepEx, and return.
License
This project is licensed under the BSD License.