Finding good keywords

When we analyze malware one of the best tools we have at our disposal are signatures (in any form, doesn’t need to be yara or peid). Generating signatures is an art, an art that takes a lot of human cycles & would be interesting to automate this process. While binary signatures are kinda hard to extract automatically (you typically need to eyeball the code first), the ANSI/Unicode strings are pretty straightforward.

In this post I will run through a couple of strategies one can use to generate good keyword sets for malware research.

The first thing we can do is to focus on whitelisting. If we just run strings over the whole clean OS install we will end up with a long list of strings that you should not include in your production signatures. Once you have such a keyword list, you could run strings on a malware sample and exclude all the whitelisted strings from the output.

Of course, this is not a very good idea, because the ‘good’ strings will include lots of important stuff we actually want to see e.g. API names. The list will also be quite bulky, because every file including non-PE files will add tones of strings. All these useless keywords will affect performance.

To optimize this process we can narrow down our focus to PE files only. We can also try to be a little more specific — we can add some context to each string. This is a bit more time consuming, because we need to write a tool that can provide string metadata. For example, a tool like PESectionExtractor.pl can help — it parses PE files and extracts strings in a context of each PE file section.

Okay, so now we have a list of all good strings, with a context.

This is good for exclusions, but not for inclusions. If you want to write e.g. yara signature, or your own signature scanner we need to find the juicy stuff.

How to do it?

In my experimental string extraction tool I introduced a concept of string islands. It exploits the fact that important strings are typically grouped together in a close proximity of each other inside the samples. Both in genuine, legitimate software, and in malware. The easiest examples where this principle works are PE file resources. Most of resource strings are obviously grouped together in this place. Import tables, export tables follow the same principle. And depending on a compiler, we will often find many strings used by the program in a very specific place of the file (e.g. .data section).

So… finding new keywords that could indicate the file as malicious can start with parsing a file, extracting its sections, extracting islands within each section, extracting strings within each island, and then using a short list of ‘needle’ keywords to determine if that specific island is ‘interesting’. We can use whitelisted strings as an exclusion as well (also, if we have the context, e.g. section where they come from, we can use surgical exclusion applied only to matching sections).

Now we have a very rich data set to work with. We excluded tones of non-interesting strings. We can do some stats, add interesting keywords back to the ‘needles’ pool and repeat the process. After few iteration you will observe a nice pattern emerging and your keyword list will quickly improve.

Using this principle I extracted thousands of very useful keywords and artifacts. The most attractive findings are islands where we can find clusters of keywords belonging to a single category. This helps to classify them on the spot.

An example of a ‘good’ island found this way is shown below. By the look of it it’s a typical infostealer. Using its built-in strings identifying targeted programs / password we can collect lots of juicy keywords in one go. These can make it directly to automatically generated yara rules & of course our ‘needle’ pool.

We are the robots

Robots.txt is an interesting file. For years it has been exploited by hackers, pentesting tools writers, crawlers, web scrapers, SEOs, etc. I was thinking the other day what sort of data robots.txt stores today. It’s been a few good years since I looked at some examples and decided to do some digging. I downloaded a few top1m domain lists, post-processed them to only look at TLDs (and SLDs where necessary), and kicked off a lengthy process of downloading it all.

At the time of writing it’s still running, but I already have some interesting findings to report from the first 100K domains:

  • The number of domains that went dead over last few years is crazy. Many domains listed once in Alexa 1M are no longer there today.
  • The number of websites that don’t use robots.txt is staggering. I was really shocked how many don’t use it all. One can argue that it’s not necessary, but if you can use it to manage legitimate crawlers… why not?
  • The number of domains that redirect to some random stuff when robots.txt is requested is yet another phenomenon. As a results many downloaded files are just junk HTML pages.
  • The number of sites that include server-side programming snippets in the output is also very interesting; you can literally see PHP code present inside the downloaded pages. Not a good security hygiene right there.
    • Interestingly, some of the leaked snippets are clear SSO tactics to inject links to some sites ONLY when the user-agent is googlebot — most likely malicious SSO tactics at work.
  • The number of sites that are actually most-likely-pnwed is also surprising. Apart from the aforementioned malicious SSO snippets, browsing through downloaded html pages reveals many instances of the very same ASCII Art hidden inside comments on many unrelated sites; it could be just simple hactivism, vandalism, but it somehow got there.
  • The commented entries, error messages, or entries clearly introduced using a web-editor (contain HTML tags) are an interesting read too.
  • The length of some of these files (listing hundreds, thousands of entries) shows that authors don’t know what wild cards are, or what the purpose of the robots.txt really is 🙂
  • The file is offered in various encoding: ASCII, UTF8, UTF16 – even if the semi-official agreement is that it should be either ASCII, or UTF8.
    • Another Localization fun fact: many ASCII robots.txt files often include non-ASCII characters e.g. in German, French, Russian, Chinese 🙂

Quite frankly… it’s quite a mess. One that only human could make 😉