You are browsing the archive for Malware Analysis.

PE Section names – re-visited, again

July 26, 2019 in Batch Analysis, Malware Analysis

In my old post I listed lots of different, unique, characteristic PE Section names. I have updated that post (and its predecessor) a number of times over the years.

For a long time I was sitting in a comfort zone thinking that this data had to be like a superset of most, if not all PE Sections one would expect to find in the wild….

Wrong. A classic availability error.

The thing is that the list was sourced from a large malicious sampleset, and a small set of well-known clean files. There is tho, it seems, a lot files that I missed.

In an effort to address this bias (in my defense, I suspected it to exist, this is why this post is here), I started a process of mass-downloading clean samples ~5 years ago. Now I have got tones of them. After running various statistical analysis on them I am confident to say that my original PE Section set is not complete. Far from it. My point is supported by the superficial metadata analysis that follows.

Surprisingly, I have never listed these sections:

  • RT_CODE
  • RT_DATA
  • RT_CONST
  • RT_BSS

I am shocked, because they are actually very common inside the clean files!

Same goes for IPP* sections (used by OpenCV):

  • IPPCODE
  • IPPDATA

and Hewlett-Packard sections:

  • TulipLog – HP test/verification tools

and NVidia section:

  • _NVTEXT3 – unknown purpose; code?

A couple of ‘obvious ones’ we can guess the purpose of, by looking at the names only:

  • .SHAREDS
  • _LTEXT
  • _LDATA
  • COMPRESS
  • FlashPix
  • NONPAGED
  • INITCONS
  • COMMONDA
  • PRIVATE
  • ApiHooks

And then the whole collection of PAGE* sections:

  • PAGECONS, PAGEDATA, PAGE_COM, PAGE_INI, PAGEDC11, PAGE_DDC, PAGEDC80, PAGEDFER, PAGECFER, PAGE_CAI, PAGE_ISR, PAGEDC60, PAGEDC10, PAGESER, PAGEDC50, PAGEDC40, PAGEcKPL, PAGEcFRM, PAGE_DAL, PAGEcRMA, PAGEcRM, PAGE_MCM, PAGEdMXL, PAGEdKPL, PAGEdFRM, PAGEcMXL, PAGE_RW, PAGE_RO, PAGE_CPR, PAGE_CPC, PAGE_PPL, PAGEDTES, PAGEDNLG, PAGECTES, PAGECNLG, NON_PAGE, PAGESRP0, PAGEdreg, PAGEdjaw, PAGEcsrv, PAGEcjaw, PAGEcsec, PAGEcTSL, PAGEdctw, PAGEcctw, PAGEcwfd, PAGEcpsm, PAGEcnlo, PAGEcast, PAGELK, PAGEdsv_, PAGEdcln, PAGEcsv_, PAGEccln, PAGE_DEV, PAGEdStn, PAGE_IVI, PAGE_ISI, PAGE_IKV, PAGE_IIL, PAGE_ICZ, PAGE_ICI, PAGEdscn, PAGEdimg, PAGEdSnF, PAGEcimg, PAGEDC12, PAGE_ITN, PAGE_ILN, PAGE_IEG, PAGE_IBT, PAGEdoid, PAGEDC41, PAGE_WSV, PAGEdwi2, PAGEdwi1, PAGE_CRM, PAGEdPSL, PAGEcPSL, PAGEdPsr, PAGErPSL, PAGErMXL, PAGErKPL, PAGErFRM, PAGEdTSL, PAGE_PWR, PAGE_TOP, PAGE_PMC, PAGE_MEM, PAGE_DBG, PAGED, PAGE_OSS, PAGECODE, PAGEDLEG, PAGECLEG, PAGEcwkp, PAGEcptw, PAGE_LK, PAGE_IGN, PAGEdSnd, PAGE_DAT, PAGEdWsP, PAGEdrlg, PAGEKD, PAGE_IRV, PAGEipp, PAGEABLE, PAGEdtyl, PAGEdpma, PAGEdkmr, PAGEdcpk, PAGEctyl, PAGEcpma, PAGEckmr, PAGEccpk, PAGED_DA, PAGEcLGC, PAGEI028, PAGEI027, PAGEI026, PAGEI025, PAGEI024, PAGEI023, PAGEI022, PAGEI021, PAGEI020, PAGEI019, PAGEI018, PAGEI017, PAGEI016, PAGEI015, PAGEI014, PAGEI013, PAGEI012, PAGEI011, PAGEI010, PAGEI009, PAGEI008, PAGEI007, PAGEI006, PAGEI005, PAGEI004, PAGEI003, PAGEI002, PAGEI001, PAGEI000, PAGE_BIO, PAGEVRFY, PAGED_CO, PAGEPARW, PAGEVRFD, PAGEVRFC, PAGEHDLS, PAGEWMI, PAGESPEC, PAGE_VCN, PAGE_SMU, PAGE_PSP, PAGE_ISP, PAGE_GVM, PAGE_GC_, PAGE_BGM, PAGE0003, PAGE0002, PAGE0001, PAGEdQua, PAGESRP, PAGESENM, PAGE_NO_, PageIVUE, PAGErVLT, PAGEdVLT, PAGEccpt, PAGEcVLT, PAGELKCO, PAGE_DF_, PAGEdThP, PAGE_VCE, PAGE_UVD, PAGEI029, PAGECNST, PAGELKD, PAGEtext, PAGErdat, PAGEdata, PAGE_IOM, PAGEnPSL, PAGEnMXL, PAGEnKPL, PAGEnFRM, PAGE_DYN, PAGEUSBS, PAGEPOWR, PAGEWdfV, PAGEiVAC, PAGESPR0, PAGE_M, PAGE_IOC, PAGE_DIS, PAGE_CX, PAGEWCE1, PAGEWCE0, PAGEUBS0, PAGEcrea, PAGEDNLD, PAGErGEN, PAGEfull, PAGESCAN, PAGER32R, PAGER32C, PAGELK16, PAGEBTTS, NOPAGED, .no_page, nonpage, PAGEopen, PAGE_INV, PAGE_ATA, PAGE_AFP, PAGEVRFB, PAGEUSB, PAGEUMDM, PAGESAN, PAGENDSW, PAGENDST, PAGENDSM, PAGENDSI, PAGENDSF, PAGENDSE, PAGENDSA, PAGEMOUC, PAGELOCK, PAGEIPMc, PAGEI042, PAGEI041, PAGEI040, PAGEI039, PAGEI038, PAGEI037, PAGEI036, PAGEI035, PAGEI034, PAGEI033, PAGEI032, PAGEI031, PAGEI030, PAGEEAWR, PAGEEADS, PAGEC, PAGEBGFX, PAGEAFD

Finally, sections named in a somehow intriguing way:

  • .secure
  • .DllShar
  • .DllDebu
  • HookShar
  • DebugDat
  • DebugCod
  • DeathAnd
  • .ELIOT
  • EWTPHOOK
  • FINDSHAR
  • .Process
  • .PwrMoni
  • .remotep
  • .remoteF
  • .HOOKVAR
  • .DLLShar

There are also tones of randomly named sections – indicating that vendors do not shy away from using crypters/virtualizers. While it makes a lot of sense (code/IP protection), it also makes it harder to incorporate these ‘anomalies’ into a proper Machine Learning/AI model.

I actually suspect that a careful sampleset analyst will be in a position to fool any ‘AI-driven’, or ‘Next-gen’ antivirus by manipulating PE file properties alone. We have already seen a good example of such work e.g. by Skylight Cyber, but it’s a tip of an iceberg.

Finding good keywords

July 21, 2019 in Batch Analysis, Clustering, Malware Analysis

When we analyze malware one of the best tools we have at our disposal are signatures (in any form, doesn’t need to be yara or peid). Generating signatures is an art, an art that takes a lot of human cycles & would be interesting to automate this process. While binary signatures are kinda hard to extract automatically (you typically need to eyeball the code first), the ANSI/Unicode strings are pretty straightforward.

In this post I will run through a couple of strategies one can use to generate good keyword sets for malware research.

The first thing we can do is to focus on whitelisting. If we just run strings over the whole clean OS install we will end up with a long list of strings that you should not include in your production signatures. Once you have such a keyword list, you could run strings on a malware sample and exclude all the whitelisted strings from the output.

Of course, this is not a very good idea, because the ‘good’ strings will include lots of important stuff we actually want to see e.g. API names. The list will also be quite bulky, because every file including non-PE files will add tones of strings. All these useless keywords will affect performance.

To optimize this process we can narrow down our focus to PE files only. We can also try to be a little more specific — we can add some context to each string. This is a bit more time consuming, because we need to write a tool that can provide string metadata. For example, a tool like PESectionExtractor.pl can help — it parses PE files and extracts strings in a context of each PE file section.

Okay, so now we have a list of all good strings, with a context.

This is good for exclusions, but not for inclusions. If you want to write e.g. yara signature, or your own signature scanner we need to find the juicy stuff.

How to do it?

In my experimental string extraction tool I introduced a concept of string islands. It exploits the fact that important strings are typically grouped together in a close proximity of each other inside the samples. Both in genuine, legitimate software, and in malware. The easiest examples where this principle works are PE file resources. Most of resource strings are obviously grouped together in this place. Import tables, export tables follow the same principle. And depending on a compiler, we will often find many strings used by the program in a very specific place of the file (e.g. .data section).

So… finding new keywords that could indicate the file as malicious can start with parsing a file, extracting its sections, extracting islands within each section, extracting strings within each island, and then using a short list of ‘needle’ keywords to determine if that specific island is ‘interesting’. We can use whitelisted strings as an exclusion as well (also, if we have the context, e.g. section where they come from, we can use surgical exclusion applied only to matching sections).

Now we have a very rich data set to work with. We excluded tones of non-interesting strings. We can do some stats, add interesting keywords back to the ‘needles’ pool and repeat the process. After few iteration you will observe a nice pattern emerging and your keyword list will quickly improve.

Using this principle I extracted thousands of very useful keywords and artifacts. The most attractive findings are islands where we can find clusters of keywords belonging to a single category. This helps to classify them on the spot.

An example of a ‘good’ island found this way is shown below. By the look of it it’s a typical infostealer. Using its built-in strings identifying targeted programs / password we can collect lots of juicy keywords in one go. These can make it directly to automatically generated yara rules & of course our ‘needle’ pool.