{"id":8024,"date":"2022-03-11T23:09:51","date_gmt":"2022-03-11T23:09:51","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=8024"},"modified":"2022-03-12T23:13:52","modified_gmt":"2022-03-12T23:13:52","slug":"good-file-what-is-it-good-for-part-2","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2022\/03\/11\/good-file-what-is-it-good-for-part-2\/","title":{"rendered":"Good file&#8230;  (What is it good for) Part 2"},"content":{"rendered":"\n<p>This <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/03\/04\/good-file-what-is-it-good-for-part-1\/\" data-type=\"post\" data-id=\"8014\">series<\/a> talks about &#8216;good&#8217; files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err&#8230; minor inconveniences :-).<\/p>\n\n\n\n<p>Say you have amassed a bunch of &#8216;good files and declare your first &#8216;goodware&#8217; collection as &#8216;ready for processing&#8217;. <\/p>\n\n\n\n<p>What do you do with it?<\/p>\n\n\n\n<p>The easiest way to start processing this sampleset is by applying advanced file typing first. We want to know what files we collected, and mind you, I mean not only a distinction between your random media file (jpeg, gif) and PE\/ELF\/MACH-O, but also binary vs. scripting, 32- vs 64-bit, EXE vs DLL, user mode vs. kernelmode, standalone exe vs. .NET, signed vs unsigned, standalone vs installer, installer\/packer\/protector types: autoit, pyinstaller, perl2exe, legacy PE file protection layers (mpress, pecompact, themida, etc.), MSI, and gazillion of existing installer types that typically store the installation information as a compressed\/encrypted appended data behind the generic executable installer stub (Nullsoft, InnoSetup, Wise, etc.). <\/p>\n\n\n\n<p>To perform this task I use a combination of <a href=\"https:\/\/github.com\/horsicq\/Detect-It-Easy\">DiE<\/a> and my own spaghetti-code script that I have been improving over last 17 years (sorry, not for sharing, it&#8217;s absolutely disgusting!).<\/p>\n\n\n\n<p>Once we know what we are dealing with we can try to unpack stuff.<\/p>\n\n\n\n<p>Why unpack you may ask?<\/p>\n\n\n\n<p>Because what you download from vendors&#8217; sites is often installers that internally store many additional files, many of which are OF INTEREST. Yup, additional embedded installers, standalone EXE\/DLL\/OCX\/SYS, redistributables, etc.<\/p>\n\n\n\n<p>How to do it?<\/p>\n\n\n\n<p>The <a href=\"https:\/\/www.7-zip.org\/download.html\">7-zip<\/a> is a natural candidate, but we need to be careful. I suspect <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/02\/04\/analysing-nsrl-data-set-for-fun-and-because-curious\/\" data-type=\"post\" data-id=\"7968\">NSRL analysts use it extensively<\/a> and they unpack everything that has &#8216;a binary pulse&#8217; recursively until 7z returns error. As a result, they get tones of executable file sections&#8217; metadata sneaking in to their final hash set, and&#8230; in the end no one wins. <\/p>\n\n\n\n<p>The other natural candidate is <a href=\"https:\/\/www.legroom.net\/software\/uniextract\">Universal Extractor<\/a> (and the updated versions of it f.ex. <a href=\"https:\/\/github.com\/Bioruebe\/UniExtract2\">Universal Extractor 2<\/a>). You can&#8217;t win this battle either, because it&#8217;s too complex and you lose control of what goes unpacked and what ends up in your final metadata set, same as with an overzealous use of 7z.<\/p>\n\n\n\n<p>We should definitely use these tools though, just need to apply some moderation. <\/p>\n\n\n\n<p>For instance, we can disable extraction of PE files internals by 7-zip by using its <a href=\"https:\/\/sevenzip.osdn.jp\/chm\/cmdline\/switches\/stx.htm\">stx<\/a> command line argument:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">7z x -stxPE foo.exe<\/pre>\n\n\n\n<p>With that, you could run this recursively on all files (including traversing unpacked directories) and get a decent list of internal files, but without unpacking PE files! <\/p>\n\n\n\n<p>For Universal Extractor-based tools their best use is &#8230; analysis of their code. You will find info on both the syntax and required toolkit information that helps to unpack less common archive formats. It&#8217;s the best way to study their handling of particular file formats\/installers and how to unpack them so that we can then cherry-pick these that work for us. Again, we don&#8217;t want all, as you will end up unpacking lots of useless information e.g. .MSI files and generating a lot of poor quality metadata. Be choosy.<\/p>\n\n\n\n<p>For InnoSetup there is <a href=\"http:\/\/innounp.sourceforge.net\/\">InnoUnp<\/a>.<\/p>\n\n\n\n<p>For Nullsoft, there is a 7z version 15.05 that extracts Nullsoft installer files very neatly, including the <em>[NSIS].nsi<\/em> file that is a decently reproduced Nullsoft Installation Script!<\/p>\n\n\n\n<p>For AutoIT executables you can use ClamAV as <a href=\"https:\/\/twitter.com\/SmugYeti\/status\/1089919472444624896\">pointed out<\/a> by <a href=\"https:\/\/twitter.com\/SmugYeti\">@SmugYeti<\/a>.<\/p>\n\n\n\n<p>Yup, you have to analyze your advanced file typing results first and then &#8230; divide and conquer.<\/p>\n\n\n\n<p>So&#8230;<\/p>\n\n\n\n<p>Imagine you have file typed all the samples, you know how to unpack them, what&#8217;s next?<\/p>\n\n\n\n<p>I think there are two ways to go about the next step &#8211; it starts with a script picking up a single file from your repository, and:<\/p>\n\n\n\n<ul><li>recursively unpacking it and its subsequent &#8216;descendant files&#8217;, advance file-type them and at the end copy files of interest (PE, MSI, etc.) back to repository<\/li><li>unpack first layer only, then copy files of interest back to repository for further processing<\/li><\/ul>\n\n\n\n<p>I think both approaches have advantages, with the first one being probably the &#8216;smartest&#8217; (i.e. do it once, well), and the second more optimized for resources usage.  Why? For source files being often 200MB in size and more (yes, there are plenty of such installers nowadays!) the whole &#8216;extracted files &amp; directories&#8217; tree may end up being a good few, even few dozens of GB of data! And it simply implies a necessity of using the slow HDD as &#8216;a working space&#8217;.<\/p>\n\n\n\n<p><em>A side note #1 here: I am talking of small SOHO hardware investments here! Note #2, I also don&#8217;t advocate using SSD in your SOHO &#8216;sample processing&#8217; setup as they tend to fail after too many I\/O operations. Had too many issues in the past and don&#8217;t recommend.<\/em><\/p>\n\n\n\n<p>In the second approach, RAM disk may be good enough most of the time and it&#8217;s definitely better, performance-wise. <\/p>\n\n\n\n<p>Choose your poison wisely.<\/p>\n\n\n\n<p>I actually use a hybrid approach &#8211; if the file is relatively small I try to unpack it fully on the RAM drive, and if it is an obvious &#8216;fat installer&#8217; I push it to HDD. And I always try to do &#8216;a full-recursive&#8217;. Saves time and is kinda neat. Again, your personal choice matters here.<\/p>\n\n\n\n<p>So&#8230; now you extracted, distributed and catalogued all this goodness. <\/p>\n\n\n\n<p>What&#8217;s next?<\/p>\n\n\n\n<p>For starters, keep all logs so you can troubleshoot issues, and then&#8230; this is a series, so there will be another post \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This series talks about &#8216;good&#8217; files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err&#8230; minor inconveniences :-). Say you have amassed a bunch &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/03\/11\/good-file-what-is-it-good-for-part-2\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[21,90],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8024"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=8024"}],"version-history":[{"count":10,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8024\/revisions"}],"predecessor-version":[{"id":8034,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8024\/revisions\/8034"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=8024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=8024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=8024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}