The art of artifact collection and hoarding for the sake of forensic exclusivity… – Part 5

If you follow this series you should know by now that I am obsessing here not about the benefits of piracy, but about a new, powerful forensic capability: a truly actionable summary (extracted from the ever-growing evidence…).

In my previous posts in this series I covered a number of different approaches one may take to analyze forensic and telemetric data obtained from an individual system or a cluster of endpoints belonging to a specific org, but here’s one more approach: wikipedia’s categorization feature.

You may or may not be aware that there are wikipedia tools available online that allow us to extract a subset of wikipedia database that meets certain criteria: f.ex. one can select all wikipedia pages that are tagged with a certain category and export it to a file. And lo-and behold – for our ‘software categorization purposes there is a really interesting category we should look at: Lists_of_software:

When we click ‘Add’ we will immediately populate the list of pages:

And when we click ‘Export’ we will get a relatively small XML file listing all the pages of interest…

Now, that list of pages is interesting on its own, because we get a really long list of nice categories – 484 (242 unique) entries (as of today) — see: wiki_pages.txt (based on the page list) and wiki_pages_unique.txt (based on the <title> entries from the exported xml file).

Secondly, when you parse this exported XML file, you will end up with a list of software names, vendor names, domains that can be now used to… yes… categorize software we find during the forensic exams! Luckily to us, most of these legitimate software packages listed on wikipedia follow some sort of naming convention schemes – they allow us to recognize them, especially when they are installed to their own, preprogrammed paths.

Thirdly, when you review that exported XML file you will quickly realize how many of these thousands of software packages you never heard of. This is a humble lesson for any Detection Engineering adept out there – we can’t pretend anymore that we are on top of things. Every single software is a potential source of a supply-chain attack. Every single software may be introducing Local Privilege Escalation bugs. Every single software may include new lolbins. Every single software may offer functionality that can and will be abused by attackers. And this is just a software listed on wikipedia. There are gazillions of other software installs out there that have never been looked at, never been scrutinized, never been assessed from a security standpoint.

I will be exploring many of them in my future posts in this series.

The art of artifact collection and hoarding for the sake of forensic exclusivity… – Part 4

In my last post I mentioned the outdated PAD files. Let’s have a closer look at them.

Before we do so, a short comment first — in the era of omnipresent GenAI buzz sometimes it’s really hard to convince yourself to do any research let alone share the results of it. Everything feels ‘old’, the GenAI obviously knows all the answers, and no one can compete with this vast amount of information that can be extracted from these AI models so effortlessly, even if their advice sometimes feels a bit hallucinogenic…

What keeps me going is the good ol’ adage – luck favors a prepared mind. I believe that you can’t utilize GenAI properly if you don’t know the fundamentals, if you don’t do the legwork, if you don’t research on your own. Ironically, in order to use GenAI efficiently and effectively one has to know far more than before because GenAI is a very strong, confident assistant that often… turns an opponent. It may assist us in the best possible meaning of the word, or can ruin us, if we blindly trust its outputs…

The surprising twist is that we can only get better at using the GenAI by first getting better at the ‘non-AI’ stuff aka ‘the old’. And this post is dedicated to ‘the old’.

Yes, no one cares about PAD files anymore, so why bringing them up? Well, I hope I will convince you that there is still value out there…

So, the PAD files…

It’s hard to download them today, but in the past one could download PAD files from at least these 2, now defunct websites:

  • http://repository.appvisor.com – repo of actual PAD files
  • http://www.qarchive.org/repository – XML file with links to PAD files

Luckily, old copies of QArchive repository still reside on https://web.archive.org f.ex. here (28K+ PAD URLs), and I will share 14K+ PAD files from http://repository.appvisor.com below.

When you attempt to download all the PAD files possible, and/or the URLs they point to, you will quickly realize that many links no longer work. Not a surprise, after all, it is a legacy protocol, and lots of killer-app-wannabe software products never really made it, and in the end – their presence online was barely noticeable. For the purpose of our discussion though, it’s worth mentioning that one can still use webarchive.org to download the copies of these PAD files from the time right before the website hosting them closed… so yes, there is a way to collect many of them, even if they are officially long gone.

Analyzing many PAD files (in bulk) can give us an insight into many interesting aspects of a shrinking, yet still present old-school software distribution model.

For starters, analysing a repo of many PAD files gives us a quick&dirty software categorization list: whatever is listed inside the Program_Category_Class element is of interest. An example category list extracted from 14K PAD files is shown here. By mapping Program_Category_Class to directories that programs are installed to (can be extracted via installer unpackers), one can build a simple categorization engine for these known combos.

The Primary_Download_URL, Secondary_Download_URL, Additional_Download_URL_1, Additional_Download_URL_2, DP_Distributive_Primary_URL elements point to actual URLs that you can use to download the latest version of the software. One way to utilize these is to collect a list of (most likely) clean software installers that can be used to build your ‘good samples’ repo. This in turn can be used to tune and quality-test your yara, yara-x, capa rules…

The Company_Info element and its children may help to collect useful info about (most likely) legitimate companies – there are emails, phone numbers, social media accounts, etc.

Believe it or not, many of these software products still exist out there, in the wild. They are installed on actual endpoints and the information provided inside these PAD files can help us to do 2 things:

  • Recognize them as ‘possibly legit, good software’ –> exclude from the detections/forensic exam view!
  • Recognize and categorize them to build a profile of the endpoint and the org (as discussed in my previous post).

Last, but not least – this is an archive of 14K+ PAD files from http://repository.appvisor.com I downloaded in 2021.

Now… for the bad news.

Analysis of PAD files will give you a list of no-longer-existing domains that may be vetted as trusted by security vendors due to past encounters when the software was still alive. Secondly, some of the existing, installed software packages that happen to be ‘dead’ by now may still include auto-update functionality. Yes, this offers a supply-chain-attack possibility where one can re-register an expired domain and place the malicious updater on this new site. Next time the legitimate autoupdate of now defunct software kicks in, it will now resolve the domain, download the updater and execute it.

Take that, GenAI…