{"id":8014,"date":"2022-03-04T23:27:20","date_gmt":"2022-03-04T23:27:20","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=8014"},"modified":"2022-03-05T18:49:35","modified_gmt":"2022-03-05T18:49:35","slug":"good-file-what-is-it-good-for-part-1","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2022\/03\/04\/good-file-what-is-it-good-for-part-1\/","title":{"rendered":"Good file&#8230;  (What is it good for) Part 1"},"content":{"rendered":"\n<p>Most of (anti-) malware researchers focus on malware samples, because&#8230; it&#8217;s only natural in this line of work. For a while now I try to focus on the opposite &#8211; the good, &#8216;clean&#8217; files (primarily PE file format). While it may sound boring&amp;mundane, maybe even somehow trivial, this is actually a very difficult task! <\/p>\n\n\n\n<p>Why? Hear me out!<\/p>\n\n\n\n<p>There are no samplesets available out there (at least that I know of), and any samplesets expire really fast (who cares about drivers for XP or Vista, or even any 32-bit files anymore). The &#8216;availability&#8217; bit is tricky too (apart from some drivers distros originating mostly from Russia it&#8217;s hard to download anything &#8216;in bulk&#8217;), and yes&#8230; in the end you are pretty much on your own when you want to collect some new &#8216;good&#8217; samples&#8230;<\/p>\n\n\n\n<p>And &#8216;good&#8217; companies generate a lot of these&#8230; and many of them are not even interesting for us, and&#8230; you may ask yourself&#8230; what all these good files are really good for?<\/p>\n\n\n\n<p>From the offensive perspective &#8212; it&#8217;s easy: find good files, see if they are vulnerable, find these that are, write POC exploits &amp; either submit CVEs or sell 0days to exploit brokers. Oh wait, it&#8217;s not really &#8216;good&#8217;, is it? Let&#8217;s sit that one on a fence for the time being.<\/p>\n\n\n\n<p>What about &#8216;le&#8217; defense?<\/p>\n\n\n\n<p>Basic analysis of any clean Windows sampleset from last 20 years can tell us that the most common number of PE sections inside these &#8216;good&#8217; executable files is 5 (31%). Followed by 4 and 6 (both 13%), then 3 (12%) and 2 (10%), and 1 (5%). <\/p>\n\n\n\n<p>Note: like with all statistics, these % numbers are not to be trusted, because it&#8217;s from a relatively small set of clean files, many of which are from the PAST (2000-2020). Still, it&#8217;s something we can at least initiate a conversation with, right? And I doubt the percentages will vary much in larger samplesets, because good files are what they are &#8212; something that is a product of compilers and they tend to follow a template&#8230; <\/p>\n\n\n\n<p>We can exploit that. And we should.<\/p>\n\n\n\n<p>My hypothesis is that no matter what cluster of samples we look at, most of good PE files will oscillate around that 5 PE sections mark by default. Oh&#8230; wait.. Newer compilers may actually shift that number a bit higher &#8211; this is because of inclusion of additional sections that we now see added &#8216;by default&#8217; e.g. <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/win32\/secbp\/pe-metadata\">&#8216;.pdata&#8217; and &#8216;.didat&#8217; sections<\/a>. And to bring up a good example here &#8212; Windows 10&#8217;s Calculator (stub) has 5, and Notepad has 7 sections. <\/p>\n\n\n\n<p>So&#8230; 5..7 range it is. <\/p>\n\n\n\n<p>Anything outside of it is probably&#8230; mildly interesting. Why &#8216;mildly&#8217;? These numbers are good for Microsoft compilers\/PE files, but files built with non-Microsoft compilers will have to fall into a different bucket. Compiler detection is critical here and only if we do so, we can correlate average number of PE sections in &#8216;good&#8217; files generated by that specific compiler. Think Delphi\/Embarcadero, mingw, Go, Nim, Zig, Rust, PyInstaller, etc.. Non-trivial \ud83d\ude41<\/p>\n\n\n\n<p>PE Sections are for beginners tho. Really no point spending much time on them, because the PE profiling landscape has changed a lot over last 15 years. Luckily, there are many other properties of &#8216;good&#8217; files that are worth discussing.<\/p>\n\n\n\n<p>First, the PDB paths. Same as with malware, we can collect a large corpora of these and create a cluster of &#8216;reverse-logic&#8217; yara rules. That is, if the yara hits and the file contains one of the legitimate-looking PDB file names\/paths it is most likely good! It is a terribly naive assumption of course, I mean to believe that all files with legitimate PDB paths are good, but&#8230; why not, for starters. <\/p>\n\n\n\n<p>For instance, if the file is not detected by any AV, and contains unique PDB strings that look like one of these clean PDB paths (on a curated list) then it&#8217;s highly possible it is, indeed a clean one! Right?<\/p>\n\n\n\n<p>And together with other characteristic of good PE files we may craft a little bit more complex yara rules that could all be good indicators of file goodness w\/o losing flexibility (e.g. work across multiple versions of the same file).<\/p>\n\n\n\n<p>I don&#8217;t want to burn &#8216;good&#8217; yara rules in public, because this kills the whole idea described above, so I won&#8217;t be posting too many examples (more about it later as well), but let&#8217;s have a look at this PDB path:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/pdb1.png\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/pdb1.png\" alt=\"\" class=\"wp-image-8017\" width=\"648\" height=\"34\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/pdb1.png 648w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/pdb1-300x16.png 300w\" sizes=\"(max-width: 648px) 100vw, 648px\" \/><\/a><\/figure><\/div>\n\n\n\n<p>If it is was malware, you would write a Yara signature for it, right?<\/p>\n\n\n\n<p>You can do the same for &#8216;good&#8217; files.<\/p>\n\n\n\n<p>The second idea is focused on GUIDs. <\/p>\n\n\n\n<p>Many clean files come as COM libraries and their GUIDs referenced by their type libraries are unique. One could create another set of &#8216;reverse-logic&#8217; yara rules for these, and this way discover &#8216;good, clean files that reference them. It could be an occurrence of GUID in a string format, either ANSI or Unicode as well as its binary representation. <\/p>\n\n\n\n<p>Again, the assumption here is that bad guys don&#8217;t use the same GUIDs in their poly-\/meta-morphic generators (yet). See it for yourself &#8211; the below GUID has only <a href=\"https:\/\/www.google.com\/search?q=305AFD76-ADD0-417E-AA99-3AC4FDB22B21\">few Google hits<\/a> and (until this post) was a good indicator of &#8216;goodness&#8217;:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"163\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1-1024x163.png\" alt=\"\" class=\"wp-image-8018\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1-1024x163.png 1024w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1-300x48.png 300w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1-768x122.png 768w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2022\/02\/guid1.png 1076w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p>Next, we can look at resources. Many legitimate executable files embed  &#8216;branded&#8217; icons. We know these are already being leveraged by some bad guys, but having a large set of these &#8216;good&#8217; icons extracted from many clean samples can help to push samples that include them into different pipelines. <\/p>\n\n\n\n<p>And these, especially, if matched with characteristics of import\/export tables, their hashes, or other basic file properties like size, number of sections and their names, or even a matching subset of strings, plus version information, number of localized strings, entropy, signatures, etc. can form unique descriptors of &#8216;goodness&#8217;. <\/p>\n\n\n\n<p>Can these be abused? 100% . This is why I am not making results of this research public.<\/p>\n\n\n\n<p>And it is for sure that &#8216;good&#8217; files follow patterns, they don&#8217;t change that much and we can exploit that. And since these properties can be extracted automatically, this is in fact, a great place for machine learning (unlike actual malware)! What if what we need is a ML\/AI algo that learns from &#8216;good files&#8217; ? Yes, it&#8217;s not a new concept, but how much of this kinda research is actually made public (especially the algorithmic part)? With this series I plan to bring some of my personal research to the public eye with a hope it can inspire more work in this space. <\/p>\n\n\n\n<p>And coming back to what I mentioned earlier &#8211; I do face a dilemma. I have collected many of these artifacts and statistics during my spelunking, but I don&#8217;t think it&#8217;s a good idea to share them publicly. I think there actually is a scope for a DST debate same as there is for OST and at this moment in time I believe some artifacts or their collections, &#8220;defensive&#8221; findings if you will, should be shared within trusted circles only!!!<\/p>\n\n\n\n<p>Yes, it&#8217;s a 180 degree change of my stance compared to say 10 years ago, but we live in strange times. If I publish a list of all clean PDB paths, clean GUIDs, clusters of legitimate icons it&#8217;s a given that the next generation of malware will immediately re-use them in their creations! And some Red Teams may use that too. <\/p>\n\n\n\n<p>So, it&#8217;s a No. <\/p>\n\n\n\n<p>I am also worried about unfair players in the corporate space who will simply acquire this data for free and use it in their commercial offerings, both on defensive and offensive side.<\/p>\n\n\n\n<p>So, this is a No, too. <\/p>\n\n\n\n<p>Yup, good sharing times are over, sorry.<\/p>\n\n\n\n<p>Where does it leave us?<\/p>\n\n\n\n<p>I guess there is not much that can be done here&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most of (anti-) malware researchers focus on malware samples, because&#8230; it&#8217;s only natural in this line of work. For a while now I try to focus on the opposite &#8211; the good, &#8216;clean&#8217; files (primarily PE file format). While it &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/03\/04\/good-file-what-is-it-good-for-part-1\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[90],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8014"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=8014"}],"version-history":[{"count":7,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8014\/revisions"}],"predecessor-version":[{"id":8023,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8014\/revisions\/8023"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=8014"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=8014"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=8014"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}