Enter Sandbox part 20: Intercepting Buffers, f.ex. Python code from compiled binaries

In my previous post in this series I mentioned that looking at ‘dynamic’ strings processed by the analyzed sample adds a lot of value.

We shouldn’t really think of strings as strings. We should think of them as buffers. As such, intercepting interesting buffers is actually what makes sandboxes so useful. Strings are a big part of it, but as usual, there is more.

In some older posts I have already demonstrated how often it is the case that knowing where to look allows us to extract very interesting buffers, and often – the actual code of the hidden program/script:

It applies to:

  • Delphi programs – hooking inline comparison functions helps with extracting info of command line arguments accepted by the program (manual analysis would be quite painful, even with IDR, or designated IDA scripts and flirt signatures; they are, admittedly, a game-changer for static analysis for these binaries, but why can’t we just extract this data with a sandbox?)
  • Nullsoft Installers – intercepting actual Nullsoft installation scripts
  • Perl2Exe – POS malware is easy to analyze when you extract the perl script that _is_ the actual malware

The very same applies to WinBatch, and many other ‘script to exe’ solutions that basically try to hide script using the good ol’ security by obscurity method.

And anyone who looked at modern (emphasis on ‘interesting’) malware knows that most of the juicy code is hidden in memory buffers allocated temporarily during the run-time, or in tones of randomly generated garbage code, or code that is virtualized. No matter what technique is used to slow down the analysis tho, tracking these buffers is often the key for a quick determination of what sample is doing.

Admittedly, it is relatively easy to monitor the copypasta code, but much harder for creations coming from more advanced malware authors. They actively try to make this tracking work harder. Not only they strip the MZ/PE headers, section names, sometimes use their own PE loader, some use shellcode-only code, etc. Some use hundreds of small buffers that are hard to keep a track of. And then there are noise-generators that will make analysis of event intercepted by even the best-placed hook really hard (e.g. string operations that don’t mean a thing, but may trigger various detection, or will simply be truncated due to a number of API calls). The latter is actually another anti-trick. Call an API enough times and it will stop being logged. For every clever monitoring idea, there is a way to make it less clever.

Anyway… talking about buffers is a subject for another post. In this short text I will show how placing a good hook works very well with some Python programs that got converted to .exe. In this particular instance – I will describe my thought process for analysis of an old PyInstaller-ed sample (note, it may not apply to all versions of PyInstaller; the sample I am talking about is from ~6 years ago!).

I remember looking at this particular sample a few years back and was scratching my head. I knew it’s a wrapper, but was not sure how to bite it. At that time there was not that much body of knowledge available on how to analyze this sort of samples, no good static decrypters/code extractors were available (at least these I tried didn’t work), so I was looking for some quick wins using the good ol’ reversing trick – cheating.

I quickly noticed that python27.dll was loaded early during the program execution. Looking at the function names resolved by the program via GetProcAddress I hypothesized that some of them could be monitored to retrieve the source code that I assumed was present inside the sample:

Py_NoSiteFlag, Py_OptimizeFlag, Py_VerboseFlag, Py_Initialize, Py_Finalize, Py_IncRef, Py_DecRef, PyImport_ExecCodeModule, PyRun_SimpleString, PyString_FromStringAndSize, PySys_SetArgv, Py_SetProgramName, PyImport_ImportModule, PyImport_AddModule, PyObject_SetAttrString, PyList_New, PyList_Append, Py_BuildValue, PyFile_FromString, PyString_AsString, PyDict_GetItemString, PyErr_Clear, PyErr_Occurred, PyErr_Print, PyObject_CallObject, PyObject_CallMethod, PyThreadState_Swap, Py_NewInterpreter, Py_EndInterpreter, PyInt_AsLong, PySys_SetObject

My attention immediately focused on the PyRun_SimpleString function

int PyRun_SimpleString
(const char *command)

This is a simplified interface to PyRun_SimpleStringFlags()
below, leaving the PyCompilerFlags* argument set to NULL.

int PyRun_SimpleStringFlags
(const char *command, PyCompilerFlags *flags)

Executes the Python source code from command in the __main__
module according to the flags argument. 
[...]

I hypothesized a.k.a. hoped that monitoring it would get me the python code executed by the program. I added a quick hook for this function to my program, and… lo-and-behold, I immediately was able to see the results:

PyRun_SimpleString:
import sys

PyRun_SimpleString:
del sys.path[:]

PyRun_SimpleString:
sys.path.append(r”<path>“)

PyRun_SimpleString:
sys.path.append(r”
<other path>“)

PyRun_SimpleString:
# Copyright (C) —
<some bootstrap pyinstaller code>

[…]

PyRun_SimpleString:
from Crypto.Cipher import AES;
from base64 import b64decode as hAtayw;
import os;
import base64;
import ctypes;
from Crypto.Cipher import AES as Ahquye
exec(hAtayw(“

[…]
the actual encoded malicious code followed! from there it was easy-peasy…

This simple hook served me many times since, and I was able to quickly analyze many samples that were ‘protected’ this way.

Sometimes the simplest things work.

Monitoring crucial functions is not one of these things, unfortunately, because you need to first discover what these crucial functions are.

I hope this post and other in this series help…

Enter Sandbox part 19: The string theory is cool, but the practice is not

Monitoring string functions inside a sandbox is really helpful. This is because strings are probably the must important buffers that we want to see being processed by the analyzed programs. They are on such a nice high-level of abstraction that we intuitively understand their meaning, often even context, and all of it without much effort. Allowing us to save time otherwise needed to understand the inner workings of samples.

By peeking at strings we can extract a lot of valuable information f.ex. about program processing its command line arguments (e.g. number of arguments, discovery of available options when strings are compared directly), how dynamic strings are built, how conversions are done, what conditions are tested, including detecting a variety of anti-* tricks. Monitoring strings helps with an automated extraction of additional IOCs as well, f.ex. URLs that are not being actively used, yet have been built and stored in memory, etc.

The problem is… it is almost impossible to cover them all! Hmm let me rephrase that last sentence – it’s almost impossible to cover even the basics!

If you ask any programmer how many string functions are offered by their favorite programming language I bet the answer will be oscillating within a few dozen, to … say… a 100? and if they are aware of more archaic issues related to character encoding, and worked with various programming languages… maybe they will double that number, maybe even triple it.

It turns out that the ‘old’ Windows API framework alone offers more than 380 of these. And together with libs offered by various programming frameworks this number easily multiplies.

Over the years I made many attempts to build a comprehensive list of all strings functions that are offered via Windows API, and popular libraries.

It should be easy, right?

  • Pick up a number of the most commonly imported DLLs e.g.:
    • msvcrt.dll and its variants
    • kernel32.dll
    • user32.dll
    • advapi32.dll
    • oleaut32.dll
    • ole32.dll
    • maybe throw in the ntdll.dll as well
  • look at their exported functions
  • cherry-pick these APIs that are associated with string processing
  • and… you are done.

BUT

Such exercise brings up more questions than answers. Which ones do we include? How many function sets are actually out there?

Let’s see…

Copying, moving, length calculation, comparison, substring extraction, character position finding from left and right side, parsing, tokenization, concatenation, reversing, searching, replacement, regexes, ANSI versions, Unicode versions, UTF-8 versions, NT API, Windows API, case sensitivity variants, string memory allocation and release functions, integer to string, string to integer, long integer to string, string to long integer, time/date to string,  string to time/date, string formatting, string trimming, lower case, upper case and many other conversion functions, gazillion of wrappers to cater for various character sets, endianess, or to address certain classes of vulnerabilities, or specific ad-hoc needs of programmers or frameworks (e.g. Variant type, strings used by COM, etc.).

And these are just ‘standard’ APIs, without:

  • C++ functions (with its overloaded constructors)
  • Visual Basic
  • Visual Basic for Application
  • Visual Basic Script
  • JavaScript
  • tones of other wrappers that exist either within native OS libs and programs (.NET, PowerShell, Office, Shell APIs, crypto APIs, OLE/COM wrappers, multiple versions of msvcrt, kernelbase and api- wrappers)
  • popular frameworks (QT)
  • popular libraries (PCRE, SQLite3)
  • exports from popular DLLs supporting programming environments like python, perl (yes, we can monitor these too!)
  • different compilers e.g. MingW, or Borland/Delphi/Code Gear/Embarcadero that rely heavily on inline functions
  • kernel functions
  • internal functions that can be recognized via debug symbols
  • inline/internal functions re-used by malware, if signatures and hooks can be applied to them (e.g. string encryption/decryption functions)
  • plus… tones of duplicated code, thanks to static compiling, and your good ol’ copypasta

One can argue that many Window-oriented functions, or messages could be also included as they offer an extra insight into the program’s inner working. Hence functions operating on resources that are building blocks of the UI (ribbons, menus, labels, buttons, etc.), dynamically created UI elements, as well as any messages that have to do with a text (WM_*, EM_*, etc.) could be also included. Going further, we can also include more advanced, or shall we say higher-level functions e.g. XML processing APIs, Database APIs, any APIs or method processing a syntax (e.g. WQL in WMI). And yes, we can argue that many of them will eventually reach out to the lower-level string APIs that operate on actual text, but hey… the API tree like this will be a great time-saver.

If we take a step further the need to monitor all strings can be more precisely defined as seeing all strings processed on the highest possible level (i.e. on the program nesting level, not intermediate libraries). It is extremely difficult to do, but perhaps one day… in Sandboxes 3.0.

You see where it is going?

The other end of the rope is the inevitable noise and performance that such in-depth monitoring would certainly affect very badly… Still, for specific samples such in-depth analysis would offer a lot time back to reversers who otherwise need to manually deconstruct the business logic of the samples.

Nearly 390 string functions are listed below. There is more, but I can’t list them all; because if you program sandboxes, you need to do your homework yourself 🙂

IsTextUnicode, CompareStringA, CompareStringEx, CompareStringOrdinal, CompareStringW, IdnToAscii, IdnToUnicode, lstrcat, lstrcatA, lstrcatW, lstrcmp, lstrcmpA, lstrcmpW, lstrcmpi, lstrcmpiA, lstrcmpiW, lstrcpy, lstrcpyA, lstrcpyW, lstrcpyn, lstrcpynA, lstrcpynW, lstrlen, lstrlenA, lstrlenW, __isascii, __toascii, _isalnum_l, _isalpha_l, _isatty, _iscntrl_l, _isctype, _isctype_l, _isdigit_l, _isgraph_l, _isleadbyte_l, _islower_l, _ismbbalnum, _ismbbalnum_l, _ismbbalpha, _ismbbalpha_l, _ismbbgraph, _ismbbgraph_l, _ismbbkalnum, _ismbbkalnum_l, _ismbbkana, _ismbbkana_l, _ismbbkprint, _ismbbkprint_l, _ismbbkpunct, _ismbbkpunct_l, _ismbblead, _ismbblead_l, _ismbbprint, _ismbbprint_l, _ismbbpunct, _ismbbpunct_l, _ismbbtrail, _ismbbtrail_l, _ismbcalnum, _ismbcalnum_l, _ismbcalpha, _ismbcalpha_l, _ismbcdigit, _ismbcdigit_l, _ismbcgraph, _ismbcgraph_l, _ismbchira, _ismbchira_l, _ismbckata, _ismbckata_l, _ismbcl0, _ismbcl0_l, _ismbcl1, _ismbcl1_l, _ismbcl2, _ismbcl2_l, _ismbclegal, _ismbclegal_l, _ismbclower, _ismbclower_l, _ismbcprint, _ismbcprint_l, _ismbcpunct, _ismbcpunct_l, _ismbcspace, _ismbcspace_l, _ismbcsymbol, _ismbcsymbol_l, _ismbcupper, _ismbcupper_l, _ismbslead, _ismbslead_l, _ismbstrail, _ismbstrail_l, _isspace_l, _isupper_l, _iswalnum_l, _iswalpha_l, _iswcntrl_l, _iswctype_l, _iswdigit_l, _iswgraph_l, _iswlower_l, _iswprint_l, _iswpunct_l, _iswspace_l, _iswupper_l, _iswxdigit_l, _isxdigit_l, _mbcasemap, _mbccpy, _mbccpy_l, _mbccpy_s, _mbccpy_s_l, _mbcjistojms, _mbcjistojms_l, _mbcjmstojis, _mbcjmstojis_l, _mbclen, _mbclen_l, _mbctohira, _mbctohira_l, _mbctokata, _mbctokata_l, _mbctolower, _mbctolower_l, _mbctombb, _mbctombb_l, _mbctoupper, _mbctoupper_l, _mbctype, _mblen_l, _mbsbtype, _mbsbtype_l, _mbscat, _mbscat_s, _mbscat_s_l, _mbschr, _mbschr_l, _mbscmp, _mbscmp_l, _mbscoll, _mbscoll_l, _mbscpy, _mbscpy_s, _mbscpy_s_l, _mbscspn, _mbscspn_l, _mbsdec, _mbsdec_l, _mbsdup, _mbsicmp, _mbsicmp_l, _mbsicoll, _mbsicoll_l, _mbsinc, _mbsinc_l, _mbslen, _mbslen_l, _mbslwr, _mbslwr_l, _mbslwr_s, _mbslwr_s_l, _mbsnbcat, _mbsnbcat_l, _mbsnbcat_s, _mbsnbcat_s_l, _mbsnbcmp, _mbsnbcmp_l, _mbsnbcnt, _mbsnbcnt_l, _mbsnbcoll, _mbsnbcoll_l, _mbsnbcpy, _mbsnbcpy_l, _mbsnbcpy_s, _mbsnbcpy_s_l, _mbsnbicmp, _mbsnbicmp_l, _mbsnbicoll, _mbsnbicoll_l, _mbsnbset, _mbsnbset_l, _mbsnbset_s, _mbsnbset_s_l, _mbsncat, _mbsncat_l, _mbsncat_s, _mbsncat_s_l, _mbsnccnt, _mbsnccnt_l, _mbsncmp, _mbsncmp_l, _mbsncoll, _mbsncoll_l, _mbsncpy, _mbsncpy_l, _mbsncpy_s, _mbsncpy_s_l, _mbsnextc, _mbsnextc_l, _mbsnicmp, _mbsnicmp_l, _mbsnicoll, _mbsnicoll_l, _mbsninc, _mbsninc_l, _mbsnlen, _mbsnlen_l, _mbsnset, _mbsnset_l, _mbsnset_s, _mbsnset_s_l, _mbspbrk, _mbspbrk_l, _mbsrchr, _mbsrchr_l, _mbsrev, _mbsrev_l, _mbsset, _mbsset_l, _mbsset_s, _mbsset_s_l, _mbsspn, _mbsspn_l, _mbsspnp, _mbsspnp_l, _mbsstr, _mbsstr_l, _mbstok, _mbstok_l, _mbstok_s, _mbstok_s_l, _mbstowcs_l, _mbstowcs_s_l, _mbstrlen, _mbstrlen_l, _mbstrnlen, _mbstrnlen_l, _mbsupr, _mbsupr_l, _mbsupr_s, _mbsupr_s_l, _mbtowc_l, _strcmpi, _strcoll_l, _strdate, _strdate_s, _strdup, _strdup_dbg, _strerror, _strerror_s, _stricmp, _stricmp_l, _stricoll, _stricoll_l, _strlwr, _strlwr_l, _strlwr_s, _strlwr_s_l, _strncoll, _strncoll_l, _strnicmp, _strnicmp_l, _strnicoll, _strnicoll_l, _strnset, _strnset_s, _strrev, _strset, _strset_s, _strtime, _strtime_s, _strtod_l, _strtoi64, _strtoi64_l, _strtol_l, _strtoui64, _strtoui64_l, _strtoul_l, _strupr, _strupr_l, _strupr_s, _strupr_s_l, _strxfrm_l, _tolower, _tolower_l, _toupper, _toupper_l, _towlower_l, _towupper_l, isalnum, isalpha, iscntrl, isdigit, isgraph, isleadbyte, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswascii, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, strcat, strcat_s, strchr, strcmp, strcoll, strcpy, strcpy_s, strcspn, strerror, strerror_s, strftime, strlen, strncat, strncat_s, strncmp, strncpy, strncpy_s, strnlen, strpbrk, strrchr, strspn, strstr, strtod, strtok, strtok_s, strtol, strtoul, strxfrm, wcscat, wcscat_s, wcschr, wcscmp, wcscoll, wcscpy, wcscpy_s, wcscspn, wcsftime, wcslen, wcsncat, wcsncat_s, wcsncmp, wcsncpy, wcsncpy_s, wcsnlen, wcspbrk, wcsrchr, wcsrtombs, wcsrtombs_s, wcsspn, wcsstr, wcstod, wcstok, wcstok_s, wcstol, wcstombs, wcstombs_s, wcstoul, wcsxfrm, SysAllocString, SysAllocStringByteLen, SysAllocStringLen, SysFreeString, SysReAllocString, SysReAllocStringLen, SysReleaseString, SysStringByteLen, SysStringLen, ToAscii, ToAsciiEx, ToUnicode, ToUnicodeEx, WCSToMBEx