Nowadays many web sites offer services that can be called ‘malware analysis for the crowd’. Web sites like VirusTotal, ThreatExpert, JSUnpack and many others provide a file scan/analysis functionality utilizing multiple antivirus scanners and/or sandbox/live analysis bundled with a bunch of other tools e.g. file format analyzers, packer detectors, and so on and so forth. They actually do a really great job and submitting samples to these services is one of the very first steps taken by many Incident Response handlers and Forensic Investigators all over the world. This post is my attempt to summarize my thoughts on the topic of both automated malware analysis in general and consensual submission of files to a web site owned by a third party.
You see… while it is a great source of immediate intel, submitting samples to the publicly available services is not always the best choice. There are real-life situations where it is not only a bad idea, but it also may be very costly to your company, or your customer. Both on the PR and financial side of things. So, while I do not oppose these services , I do believe that some serious thought needs to be given to it first, and of course, _before_ the submission. It is also my strong belief that you can’t rely on information you cannot yourself verify (if asked to). And if you do, you not only deprive yourself from a pleasure of finding things out, but also risk drawing incorrect conclusions.
The list below is obviously far from being complete:
- The sample may be a part of the targeted attack
- Samples submitted to these services are shared; they are shared for a good purpose of course, to produce AV signatures and provide better detection, but… sooner or later one of these sensitive sampless may fall into hands of a person that will eagerly write a cool blog about it (and frankly speaking, that will be a great blog entry!)
- Malware including passwords, credentials for data extortion, as well as data that would clearly identify the customer is getting more and more common; trust me, there are many malicious samples out there that contain very sensitive data inside its code and you really don’t want them to be shared; researchers working for security companies know about it – they actively search and look for interesting samples because any new technique, new Rustock, Stuxnet, etc will not only boost the company’s profile and researcher’s own personal image - more importantly – it also allows them to escape a daily routine of writing signatures to focus on a cool stuff (you know who you are )
- AV scan is helpful to identify the malware, yet…
- With a number of malware samples collected by AV companies being extremely high, it’s easy for a particular file to be detected incorrectly
- Many AV companies use generic names like ‘trojan horse’, ‘trojan generic’, ‘heuristic badness’ etc.; this doesn’t really answer the question ‘what does this malware do’
- AV companies may use other AV vendor’s scanners to automatically process large sample sets; a mistakenly classified malware can easily transfer the incorrect classification to other vendors (a fun fact: in 2010, one of the leading AV vendors pulled a leg of other vendors by generating 20 dummy malware samples for which they created detections and submitted these samples to VirusTotal; within less than 2 weeks, more than 10 vendors detected these files as malicious!)
- Even scans with products from multiple AV vendors don’t guarantee detection – most AV engines do not detect new samples fast enough; you will be often left on your own with a new or targeted malware (take a note of this point: AV is still more a service that is reactive than proactive – someone needs to submit the sample first for the signature to be created)
- False Positives are still there
- Sandbox/live analysis is by its nature limited
- It is not interactive, or interaction is very limited; it is easy to use, but this is its trade-off; you only see a data dump and a subset of artifacts, but without understanding the code and the context in which these artifacts have been created (of course, it is often enough to answer: is it malicious?, but not ‘what does it really do’)
- It doesn’t rely on your company’s baseline build; thus, tested malware will run in an environment completely different from your company’s and may behave differently; practically speaking, if you are an incident responder interested in domains you want to block, or a forensic investigator, you can’t rely on the result of this analysis only; you may miss some of the artifacts that malware could produce have it got a chance to be executed within a slightly different environment or at a different time
- Many malicious samples come with an anti-sandboxing technology; it is very simple to use and quite hard to bypass
- Dynamic analysis in general is also very limited by its nature
- It misses a lot of code branches, including dead code (some malware authors still use older compilers and these can produce executables like this); in some cases dead code helps to find some crucial information about malware authors or their modus operandi
- It misses a lot of code/data/generated at runtime, decrypted at runtime, etc.
- It misses the metadata associated with the sample – coding style, copied&pasted routines, hidden messages, config data, etc.
- It assumes malware immediately does its dirty work; this can be easily slowed down by a long delay or other tricks e.g. built-in ‘expiration date’ or system/hardware ID (that is, some malware is pre-compiled to work on specifis system only)
- Many malware samples used in targeted attacks won’t work in an environment not having specific files/paths/registry keys and will immediately exit; Stuxnet and credit card dumpers are good examples
- Certain functions of malware are executed only if a specific application is running (e.g. browser, IM software)
- It doesn’t work well for components e.g. DLL files (if they export functions, you don’t know what arguments to pass)
- It doesn’t work well for kernel mode drivers, as well as PDFs, SWF, Java, DEX, SIS, and hundreds of other file formats that you will come across in your career
- It doesn’t work for server-side malware
- It also doesn’t work well for malware that expects… command line arguments
- and million other reasons…
- Last, but not least – if you are using older browser, you are providing a web site with a full path to a sample location on your hard drive; this may look innocent, but you may be revealing information about your customer, current case or even your own company or credentials (%USERPROFILE%\Desktop\ACMECASE\sample.zip is a really bad idea to place your samples)
As you can see, there are many reasons why you should be careful when you handle samples extracted from yours or your customers’ systems. There are companies out there that have been exposed because the samples targetting their systems have leaked to the public.
It also makes sense to invest time and learn on how to do in-depth malware analysis in-house, or at least find a trusted specialist to help you with this task. You can stand by any claim coming out from your analysis, and more importantly – you will als have a lot of fun while cracking the malware.
The bottom line is:
- Use automation as much as you can
- Think twice before you submit the samples to web sites owned by third party and more importantly – assume and accept the fact that you lose control over the distribution of your samples
- Use data from multi AV scan/sandbox/live analysis as a foundation for further analysis, not as a final conclusion
- Do not trust threat names provided by automated tools, and understand that the difference between threats is getting more and more blurry; even if some malware is called virus or trojan, it may also include worm’s capability, rootkit functionality and MBR infection routines
- If you add results of automatic analysis to your reports, do your homework and confirm findings manually, or state that it is impossible (and provide the reasons)
- Do learn and use in-depth malware nalysis techniques but also understand that it has limitations as well – some malware takes months to develop and is improved over the time, often reaching level of complexity making its analysis really hard; sometimes it is just not worth it
- Read other blogs – just because one guy says something, doesn’t mean it is correct – learn to question everything and trust only stuff that is peer reviewed