Sysmon – ideas, and gotchas

This post is about ensuring sysmon config works as it should. And also to introduce a few unusual ideas, and highlight a couple of gotchas that perhaps not everyone thinks of when touching it for the first time.

Despite reading about sysmon capabilities a lot, I only recently seriously looked at it from a threat hunting perspective.

There are many ‘template’ sysmon config available online. They are an excellent start for anyone who wants to build their own. There also many presentations about Sysmon and its detection capabilities available too. If you need to find then, just google around. Lots of them provide a lot of basics, as well as some good in-depth food for a thought – they should be definitely consumed before attempting to build your own config…

Let’s begin…

Coverage + Versioning

If you deploy your sysmon config to a single system, or install it in a small lab, config deployment and versioning is not a problem. Anytime you change your config, you just reconfigure the sysmon program manually. And if needed, you can just restart the service, or restart the computer.

In a real world though, the deployment and updates are more complicated. The basic production issues I came across, or heard of are:

Sysmon service doesn’t automatically starts after system restart
Sysmon accepts a new config, but event forwarding stops working
Sysmon generates tones of events & needs to be switched off, or tuned asap (typically for a small subset of hosts where some new noisy program was installed; ironically, very often it’s a security tool that causes all this noise)
The noise degrades performance of a system; owners are not happy
A good config can still cause performance degradation on virtual machines

There are more scenarios, I am sure of it, but using our favorite line from a corporate jargon: the bottom line is that we want to know:

how many systems we have & their OS versions
what % of these systems have the sysmon properly installed
what is a sysmon service status on these systems (running/not running first, and then more detailed service states for troubleshooting)
are events being forwarded to a log aggregation system?
what is a version of a sysmon config being used by each system

Many issues listed earlier are not strictly Sysmon problems. IT dependencies are always tricky. They become even more visible when we deal with a large, multinational company. And despite being a great piece of software, sysmon doesn’t have its own deployment ‘command center’. Everything has to be done by hand, and can be a subject to PEBKAC, various corporate rules (change management, business cases, compliance), laws (DAB, GDPR), and politics.

Again, the bottom line is that even if we are not responsible for deployment, we want and should be tracking it all so we can troubleshoot issues as soon as they are spotted. Absence of data is itself an incident.

In terms of config changes you can use Sysmon Event 16 to track the version of the XML file used to deploy the config. Using data from the event we can check config file’s SHA1, and look at its actual path. It may come handy to include a version number in a sysmon config file name so you can extract it e.g.:

Event 16
Sysmon config state changed:
 UtcTime: 2019-02-13 20:57:26.626
 Configuration: <path>\sysmon_v1.xml
 ConfigurationFileHash: SHA1=<sha1>

It’s all nice and cozy, but… imagine that your sysmon config update happens only once a quarter. In the meantime, new systems are added, old systems are being removed, people responsible for deployment change. If you want to find out what version of config you have on all these systems at the end of the quarter you now need to query your log aggregation tool for that whole quarter of data to find the last Event 16 from all systems … Good luck with that.

Of course, you can probably gather info on the deployment status from other sources, e.g. from the release notes, etc. but I am just highlighting a very important problem that has to be taken into account from a planning and maintenance perspective. I don’t know a good, generic solution for that problem at the moment. Any ideas, and advice from the trenches welcome.

And there is one more decision to make. How many configs to use, and how to maintain them? Config used for servers may be different (more sensitive & noisy) than for workstations. That’s yet another dimension to take into account from a deployment, and maintenance perspective.

Of course, lab testing, unit testing of new rules, and gradient release is a different story, but also important.

Access to the file config

Another thing to consider is the config security. First of all, who writes and modifies the config? It’s a pretty responsible task and only selected people should have an access to it.

Secondly, if you drop a config file on a system where sysmon is deployed, ensure that ACLs are in place for both file, and the Registry key so no one can read its textual or binary form. Delete the XML file immediately after the config update. While the config data can be obtained from the Registry, we can always try to make life of attackers a bit harder.

Rules and tagging

Do yourself a favor and tag everything from the very start. This will help a lot with troubleshooting individual rules, and their classes. Use as many different rule names as possible. Don’t just rely on Mitre techniques. For example, someone running powershell.exe, and a program loading automation DLL is not equivalent (T1086: PowerShell).

Also, your rules will overlap. I once spent a lot of time ‘fixing’ my broken rule. Until I realized (after tagging all rules!) that I was looking at a wrong rule. This was actually the reason I started tagging everything.

Rules and their order

The XML Syntax is unfortunately not the friendliest way to write a config with a few hundreds, sometimes even thousands rules.

I usually take a mixed sorting approach where rules that are grouped under certain common category are put together. They are then sorted by the condition, and then by the actual artifact value. This helps me to keep the stuff in some order.

I also use a very simple visual aid. For all clusters, I align the artifact values to the same column e.g.:

<Image condition="image" name="CScript"    >cscript.exe</Image> 
<Image condition="image" name="HTA"        >mshta.exe</Image> 
<Image condition="image" name="PowerShell" >powershell.exe</Image>

It makes reading / reviewing much easier.

Rules & ‘end with’ optimization

Sysmon is very busy. It processes a lot of stuff that our rules don’t trigger on. We really want it to bail out on that non-matching stuff as quickly as possible, and catch the stuff specified by rules even faster.

@Swiftonsecurity made an interesting discovery with regards to the way ‘end with’ rules are processed. If used wisely, they may significantly improve processing speed of your rules.

The reason for this is that the ‘end with’ comparison doesn’t start the comparison from the end of the string as one would expect, but from the position somewhere inside the longer of the compared strings. This reduces a number of comparisons needed to bail out for strings that are clearly different.

For example, if your rule says ‘Path ends with c:\foobar\foo.exe‘, and sysmon observes a Path c:\foobar\test.csv, the comparison will bail out immediately after a first comparison i.e. when ‘:’ vs. ‘c’ letters are compared:

c:\foobar\test.csv
 c:\foobar\foo.exe

This is because sysmon finds the earliest position in a longer string where the shorter string should begin (counting from the end of the longer string), and starts the comparison from there.

If comparison started from the beginning, it would need to walk through the full path, which is the same in our example, and only fail when ‘t’ is compared against ‘f’:

c:\foobar\test.csv
c:\foobar\foo.exe

So, optimizing rules using this trick is a pretty good idea.

Process Access Rule

I love it and I hate it. This is an extremely tricky rule to use.

Anytime a process is opening a handle to another process, it uses a very specific access mask which states what access is being requested. The bitmask defines lots of different privileges. These privileges include an ability to terminate a process, create threads, duplicate handles, and reading and/or writing from/into virtual memory of a process. You can read all the gore details from the Microsoft article I linked to.

The good news is that it is an excellent way to detect any sort of code/data injections. The bad news is that legitimate software literally abuses the OpenProcess API. Whether it’s just a result of copypasta, or a beauty of a legacy code, it is not uncommon for processes opening handles to other processes to request and be granted a full-blown access called PROCESS_ALL_ACCESS (0x1FFFFF). Windows Explorer does it all the time, so do many other processes, and this includes AV software too:

The even ‘badder’ news is that sysmon doesn’t support bitmask comparison, so if we want to detect a presence of specific bits e.g. ones responsible for reading/writing memory:

PROCESS_VM_READ (0x0010)
PROCESS_VM_WRITE (0x0020)

we need to come up with an idea how to detect these w/o using a bitmask.

The good news is that these values are relatively small numbers, and fit within one byte (256 possible values); we can find all possible values of the byte where either one, or both of these bits (0x10 and/or 0x20) are enabled. We can then generate rules using the ‘end with’ condition on the GrantedAccess rule.

Yup, it means a matrix of all the values with these bits ON:

 condition="end with" 10
 condition="end with" 11
 condition="end with" 12
 ...
 condition="end with" fd
 condition="end with" fe
 condition="end with" ff

There are more bad news though. Now that you limit the rules only to these that read/write memory, you need to exclude all the source/target images that you think should be ignored. This is a subject to threat model embraced by your org, and a lot of manual analysis of the sysmon logs. It is a good rule of thumb to narrow down the list of target processes to the most common that are a subject to memory reading or injections. These include lsass.exe, explorer.exe, svchost.exe, and a couple of others. It’s a hard decision to make really, as we are limiting the visibility.

And… the good news is that you have now significantly limited number of events going to your log aggregator. From there, you can build more dynamic exclusions where you can pair more than one field to make a better exclusion.

Get ready for a lot of frustration

Yes.

In my recent post I complain about everyone focusing on detecting mimikatz. It works great in demos, it is a well-known marketing driver to refer to it, but there is a lot things that can be done that are outside of mimikatz. We need more work on these additional targets.

If you have an experience with any commercial EDR, your sysmon research will confirm that the tool is a great, free substitute of these products.
BUT
You are missing a lot flexibility that some of the commercial tools offer for many years.

For example, an ability to build more complex rules needs to be delegated to a log aggregation system. Since we are already stripping down a number of events that sysmon is logging using our targeted rules, we now have two levels of filtering to maintain. It’s actually hard to manage w/o making a mistake.

Managing your ruleset is also really challenging. If you start walking through available configs you may notice that some of them contain typos, don’t be too critical – you will make these typos too. Since it’s an XML, you need to use an editor that can at least help you check the validity of the syntax. Otherwise you will be chasing the unicorns.

Where the rules are absolute (e.g. full path, or registry entry), it’s relatively easy to keep a track of them e.g. in a separate sheet. When you start using keywords, infixes, and more ‘wide’ rules…. avoiding duplicates will become really hard. There is also additional complexity with regards to WOW subsystem. For many rules it’s handy to mirror system32 and syswow64 as well as Program Files and Program Files (x86) rules.

Testing rules is also not that easy. For trivial cases where we trigger on a file name we can do it on the spot. For code injection, accessing lsass.exe, and running some lolbins, running processes as SYSTEM, accessing registry keys protected by ACL, etc. you may need to dedicate a solid amount of time to make it work+get appropriate approvals to start testing…

Good luck!

Hexacorn