Enter Sandbox – part 5: In search for Deus Ex Machina II

My last post focused on limitations that are typically a result of ingenuity of software developers and malware authors. It is also a result of our inability to comprehend the vast number and a variety of files created en masse every day. Apart from evasions, these quirky productions are what is often limiting sandboxes’ ability to clearly distinguish what’s good and bad. But that’s not all.

Today we will talk about limitations that are enforced by a sandbox running the best way it can. So well it does its job that it may lead malware to explore only a single path of execution – one that will prohibit other paths to be explored. A perfect example here is Upatre malware (I am talking about its slightly older versions from last year). When such Upatre sample is ran inside a sandbox it attempts to download a second part – the actual payload (e.g. Dyreza). To do so, it contacts a list of sites that host the payload. If one of the sites is down the malware will attempt to download from another. Usually, it stores 2-3 different URLs inside its body and would walk through them one by one in order to (attempt to) download the payload.

Since most of sandboxes allow online connectivity (typically proxied/anonymized) a successful download of the payload from the first domain will lead to payload execution and other domains will never get accessed. Has the sample been executed offline, sandbox could observe attempts to download from more different domains. The result could be that all of them could get blacklisted, and the potential behavioral rule “downloads from more than one domain” could have triggered. Otherwise the behavior is no that different from a typical legitimate downloader.

The below example shows a log of connection attempts of one such Upatre sample running offline:

InternetConnectW: 62[.]210[.]204[.]149:80
HttpOpenRequestW: GET, /0912uk11/<hostname>/0/51-SP3/0/
HttpOpenRequestW: GET, /0912uk11/<hostname>/1/0/0/
InternetConnectW: coimbatoreholidays(.)com:80
HttpOpenRequestW: GET, /images/viny11.pnd
InternetConnectW: macintoshfilerecovery(.)biz:80
HttpOpenRequestW: GET, /images/viny11.pnd
InternetConnectW: coimbatoreholidays(.)com:80
HttpOpenRequestW: GET, /images/viny11.pnd
InternetConnectW: macintoshfilerecovery(.)biz:80
HttpOpenRequestW: GET, /images/viny11.pnd

And this is an excerpt from a behavioral log from VT:

HTTP requests
URL: hxxp://coimbatoreholidays(.)com/images/viny11.pnd
TYPE: GET
USER AGENT: rupdate
URL: hxxp://www(.)coimbatoreholidays(.)com/images/viny11.pnd
TYPE: GET
USER AGENT: rupdate
DNS requests
coimbatoreholidays(.)com (192[.]185[.]97[.]96)
www(.)coimbatoreholidays(.)com (192[.]185[.]97[.]96)

The macintoshfilerecovery(.)biz domain is not mentioned at all.

Interestingly, the log from virscan.org shows both coimbatoreholidays and macintoshfilerecovery. It could be either that they test the sample offline, or the domain coimbatoreholidays.com died as a payload provider at the time the sample was tested.

This may lead us to the following conclusion: a good sandbox should test samples both online and offline.

But. Let’s remember that this is just one of many conditional layers.

Nowadays samples are executed on a variety of systems. Gone are days when you could just run it inside Windows XP and you would extract all the juice. Nowadays payloads include hybrid binaries targeting Windows XP, Windows 7, 32-bit and 64-bit platforms. That gives us 3 (xp 64 not included) OSs extra to cover at least. Windows 8.x is already out there for quite some time and Windows 10 is around the corner as well. They do have their quirks and some malware targeting them won’t work on older systems. With regards to XP – some samples compiled with modern compilers won’t even run on it anymore since their internal PE headers set a requirement for the OS to be at least Windows Vista/7 or newer (an example is JackPOS described by Josh in this post from Feb, 2014). Others simply statically link to APIs not present on older OS versions. And it’s often done w/o malicious purpose – people who write code often don’t even realize that they are breaking the compatibility with older OSses by simply using APIs that are introduced in newer OS versions (f.ex. ransomware is often linked with bcrypt.dll library that is not present on XP).

Back to a discussion about what sandbox does so well – which is also its weakness.

Logging in the sandbox is everything. And logging takes time. The more you hook/intercept, the more you see, but the slower it gets. No shortcuts here. Hook NT functions for .NET, Visual Basic application or Autoit executable, and you end up with gazillion of logs, malware not even partially executed and session ending w/o reaching the actual payload (let alone a proper decision about its ‘badness’). Hook and log a function that is executed 100000 for stalling purposes, and malware wins. And so on and so forth. One can extend a session time, but this is a naive concept. If you happen to log a wrong function, it will take forever to execute (examples include Visual Basic-based wrappers that do a lot of string operations before building a shellcode launching RunPE payload).

In other words: to log, or not to log is a really hard puzzle to crack.

Take processes as an example: every new process spawn – add for monitoring (more logging to come). The problem is that it’s often not necessary. A sample spawning cmd.exe, net.exe, reg.exe to do some simple task is very common and hooking such processes doesn’t add any value to analysis. Quite the opposite, the logs grow exponentially and nothing good comes out of it. And for automated analysis it is a very difficult problem to handle.

Still, should we monitor, or not?

One could build a whitelist of processes (or more precisely: processes and recognized command lines) one doesn’t want to monitor, but this is much harder than one may think and would probably immediately get exploited by malware authors who could leverage such non-monitored processes for some evasion purposes. With regards to troubles with building such a white list consider the following:

  • additional blank characters
  • multiple commands in one go (separated by ampersands, or using conditions)
  • batch files instead of command executed directly
  • presence of file extensions in the command (or a lack of)
  • capitalization of letters that form the commands
  • multiple alternatives to do the same thing (del/erase, net/net1/sc, regedit/reg, copy/xcopy, etc.)
  • commands requiring command line switches can be shuffled around
  • commands executing processes only to run DDE commands (yes, it still happens 🙂 )
  • processes launched using shell verbs and their variety
  • paths used by the commands can be obfuscated and played around with in many ways (e.g. environmental variables, ‘/’ in paths work as a replacement for ‘\’, substed paths, junctions, etc.)
  • new commands introduced with new OSes (cacls, xcacls, icacls, etc.)

Still, it might be worth building a list of ‘safe’ non-monitored processes f.ex.

  • netsh firewall set allowedprogram …
  • net stop …
  • taskkill …
  • iexplore www_getwindowinfo

Of course, to be useful, each command would require a dedicated parser or rules to extract juicy information for the high-level report.

Last, but not least – the variety of problems sandboxes witness when they try to execute random samples and actually make them do something is a reason why a compensating approach a.k.a. yara rules or any other signature-based approach must be taken to ensure detection of ‘difficult cases’. I always refer to yara sigs as “poor man’s AV engine”. They are extremely handy, easy to write, but they do not scale well since each sig is a separate entity w/o any optimization and are extremely FP-prone since testing is typically done on a very limited sample set. Antivirus engines are doing the very same thing, but in a highly optimized fashion and with a much lower FP rate. Whether we like it or not sandbox needs to rely on signatures, or rule-based classification. Maybe it is what it is and perhaps a good sandbox needs to work like a good antivirus first.

Enter Sandbox – part 4: In search for Deus Ex Machina

When we talk about sandboxing we can’t avoid talking about its limitations. The first thing that comes to our mind in this context is usually evasion (or evasions really since there are so many of them), but it’s actually not the most important part. The important (and depressing) part is that sandboxes are actually nearly identical with antivirus in many aspects. They share the very same, flawed concept of relying on a hope that detection of badness can be somehow codified. They are reactive even if it is masked under the depth of analysis they can provide and their ability to actually “see” what a given sample is doing. And mind you, I am not making a point here to bash on sandboxes (they are extremely useful), but about being _realistic_ with regards to what they can and cannot do. If you buy one, you need to make an informed decision.

Let me explain.

First, a bit of history. In 2007 or so I was responsible for processing large volumes of samples coming from a new, behavioral engine implemented by my employer at that time. My first reaction to this avalanche of suspicious files was that there is really a lot of crap out there that I would never imagine existed. I bet this is the first thought of anyone who ever had to deal with large quantities of samples and… it gets worse. These samples I was getting were samples that were already somehow _highlighted_ by the engine as suspicious.

Who knows what else is out there.

Let’s face it. There are gazillions of files out there that defy logic, assumptions, and fool your parsers and rules. And for any million of samples that you have just ‘covered’ by your new engine update there is another million of… people doing some weird stuff. They are either coding their legitimate apps in some very unique, creative way and coming up with some “clever” software choices, or intentionally trying to obfuscate and break stuff to make your life difficult (as a malware analyst, let alone automated systems).

It is one thing to sandbox a sample and see what it does [a.k.a. manually review the report], it is another to automatically decide whether it is good or bad. In my first attempts, I started looking for patterns on a file-level [static analysis]. Obviously, it didn’t go very far thanks to wrappers, protectors, and packers of any sort (often used by legitimate apps as well). I remember I went as far as implementing something that was a primitive version of a fuzzy file comparison based on specifically created rules for dedicated families (e.g. if the only difference between 2 files was just an URL, or small area of config, then my rule would calculate hash from the file excluding this area and then mark it as identical, if such hash was already ‘seen’). I later got a help from a very talented developer who took these ideas much further (he was a much better coder than I am) and he added a lot of interesting ‘detection’ features to the ‘samples sorting’ script, but at the end of the day we both felt that it was a pretty mundane work. Yes, it worked for many stupid installers and samples – and it does work even today, but it yet has to be proven as a reliable decision maker.

Static analysis on a file format level can’t take us very far. So I started running the samples in my own primitive sandbox and the resulting log was helping me to cherry-pick similar actions carried out by various samples. This definitely improved the ‘family’ or ‘outbreak’ detections and I can even claim some successes there, but it was not even close to be strong enough to clearly distinguish between good and evil. I was expanding on it further and further and started defining rules that would flag samples for further review. To give you an example: if CreateProcess/WriteProcessMemory/CreateRemoteThread happen then flag it as potentially bad (code injection). More and more rules, and more and more ambiguity.

Like many clever researchers who constantly prove that AV is a piece of mierda, these samples were doing it to my efforts w/o any security-related research done. They were already out there, often for many years and it’s just the fact I have not seen it before made my life miserable, because I had to catch up. And catching up with all this requires a lot of resources.

To conclude: neither sandbox nor AV can provide enough insight (even if sandbox takes us further) to provide a good decision. Maybe it’s just that detecting bad stuff is really a terrible idea. While I am not the biggest fan of whitelisting, there are moments when I think that going totalitarian on all unapproved software and files is really the only way to go. Perhaps there is no room for democracy in security industry.

So, I spent a couple of paragraphs talking about limitations of sandbox that comes as a natural consequence of ingenuity of software developers (whether on a bad or good side of the fence, it doesn’t matter) and without mentioning a single evasion. Even if I didn’t think of it and would never admit it at that time, efficient sandboxing cannot co-exist w/o actually creating behavioral signatures the very same way as you do in your regular av work. Find the pattern, codify it, move on. That’s why I am now returning to state what I said earlier: sandboxes are actually nearly identical with antivirus in many aspects. Yes, they have some advantages, but it’s naive to see it as a solution to all the problems. It’s just yet another security control out there. And it is often bypassed – the funniest part is that is probably more often done incidentally and unpurposely than by relying on ‘anti-sandbox’ tricks. I will come back to this topic in the future.

Lots of babbling requires some specific examples to highlight the issues I was talking about:

  • static analysis fail not only because of wrappers and protectors, but also because executable file properties are inspected during run-time and affecting them during run-time will lead to code paths that are established only when the code is actually running
  • dynamic analysis is limited by the business logic of the application running; to date, most of it is an idiot logic relying on ‘get there fast asap’ and they do all the malicious stuff w/o much thinking; occasional evasions are just a distraction from a general trend
  • a couple of non-evasive ideas that break most of the sandboxes (and reversers ;)); these ideas are all parts of legitimate  software you can find everywhere:
    • command line aguments
    • give an application any sense of interactivity and it fools every single sandbox on the planet
    • using any proprietary UI framework kills all autoclickers
    • using non-English language in your application will (with a few exceptions) instantly confuse any western reverser and also kill the autoclickers
    • APIs relying on ANSI code pages make life difficult if combined with the usage of non-English languages; guessing which ANSI code is used is not easy and requires a dedicated engine
    • Non-latin alphabets is an instant kill to many reversers
    • Scripting languages are hard to cope with properly (monitoring native functions won’t help on such a high level) f.ex. AutoIt
    • There are legitimate cases for injecting data to a child process using memory writing functions (one is e.g. writing a copy of environment block to a child process’ memory)
    • There are gazillion versions of libraries used by software – they are often compiled from the very same (or slightly modified) source code, but with various options – hard to distinguish (in a generic way) whether they handle ANSI or Unicode; and attempts to intercept inline functions require dedicated signatures (note that compilation may turn them to many different forms e.g. optimized code /in many ways, depending on options/, targeting a specific processor, architecture, etc.), using various versions of the same compiler may also produce different results