Get your logging act together, loggers!

If I had to name one, the most frustrating blocker/issue that makes my work hard, one that makes me cringe every time I encounter it, it would be probably… bad logs.

Over the years I came across so many bad logs that I sometimes wonder… How it is even possible that DFIR exists… And how can it progress further when the core data we use on daily basis keeps us in a constant state of ‘I guess this is what it is/means(?)’.

A quick anecdote: while on one of my early DFIR cases, I encountered a set of logs that clearly identified malicious activity (IP connections to C2). The problem was that the logs didn’t include any timestamps, only stages of connections 🙁

What the actual…

The only thing I could write in my report was that there was an evidence of alleged activity occurring… I just don’t know … when. Bye bye trust. Not only such statement makes me look like an idiot in front of a customer, but makes it really hard to prove activities in cases where it actually really really matters (e.g. cases that go to court; mine didn’t).

There are many types of logs, logging systems, standards, log converters; many are vendor-specific, some are open-source, some are derived from homemade tools for parsing forensic artifacts (lots of them started with a lot of guesswork), some are ‘native’ to solutions being used, some are ad-hoc log formats created by developers, some are result of blue teamers parsing org-specific stuff on their own (again, often lots of guesswork!).

And then there is a whole avalanche of application-specific logs that can be only described as ‘custom’. And under the umbrella of this ugly word we include verbose logs, debug logs, telemetry, and troubleshooting logs, as well any other log-salad no one sane would ever imagine was even possible. Often created not for the consumption by the security industry, but for testing/QA, and sometimes even for sales, marketing purposes. And most of the time, very poor contextually. And in some industries that catch up with the modern technologies, there is also an additional layer that focuses on combining data from a large number of vendors, normalizing that data, and then standardizing it into a single, unified log stream. A TRULY challenging effort.

SO…

Lots of random.

If there is any way to influence / encourage vendors and coders to improve the logging quality this would really really help us all.

In particular:

  1. collecting logs in a way that makes them more actionable
    [we don’t even need to care about one standard, per se, just include all possible information that can make these logs useful in a decision-making process]
  2. keeping them properly formatted so distinction between field names, values is clear
  3. keeping them properly placed
    who cares that ‘service started’, or ‘service stopped’, and there are 50000 logs like this in the Windows Security Event Log that has NEVER been created with a mission to be dedicated to hold this particular application operational logs in it; don’t steal resources if you can place troubleshooting logs like this in a file, or your solution-specific logs
  4. keeping them relevant
    not only mixing admin/management/content update/operational logs is bad for the analysts’ BAU, it actually makes them question the value of this particular security control
    we need ‘easy’, ‘fast’, ‘actionable’, and no one wants or has time to do DIY log analysis anymore
    again, it’s 2018; we need to look at logs, not painstakingly learning what to parse, how, and how to interpret the result (and the context)
  5. keeping them not only relevant, but also short, minimalistic, in an almost Scandinavian furniture way
    • perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
  6. documenting them in a way that makes them easy to interpret, even by juniors

Let’s look at a few examples of what can be done:

  • think of logs as a part of a timeline, not an atomic event
  • think of who is reading logs, and why
  • think of the volumes of logs; reduce volumes, if they can be reduced
  • think of making logs highly configurable
    • introduce configurable level of verbosity, and overall be more generous with detailed logs
  • enrich data at source; there is plenty of information to gather from the system; do it once, or every once in a while, but make it available in the logs
  • reach out to the security community, ask what should be logged apart from what you are logging today
  • again, think of the audience – not only troubleshooting, but also forensic analysis
  • split logs into
    • security control admin/management logs (access, role management, availability, etc.)
    • security control content management logs (definition updates, etc.)
    • security control operational logs (actual events)
    • if needed, split the logs into subsets, e.g. an endpoint solution can maintain separate log branches for
      • removable devices
      • AV detections
        • ‘bad malware’ i.e. ‘more advanced toolkit’ — hacking tools, rootkits, information stealers, mimikatz, powersploit, nopowershell
        • commodity malware
        • RATs
        • spyware
        • adware, pup, trackware
        • dual-purpose tools like psexec, nmap, nirsoft tools
        • cryptomining
        • etc.
      • client IDS
      • ‘noisier’ FP-prone heuristics (e.g. reputation)
      • etc.
  • talk to other vendors and standardize
    AFAIK there is NOT a single standard how AV detections are logged in e.g. Windows Event Logs
    most of admins / blue teamers are forced to learn which events are important by testing how AV logs them and how it names them or classifies them
    what a waste of time!
    not only log of existing events is not comprehensive (you only see what was detected, but not what is a full set of possibilities), there is also no EICAR-like file set for various types of malware/payloads (perhaps a good idea to develop these?)
  • help to prioritize
    one of the goals is to make it easy to distinguish between high-fidelity, and low-fidelity alerts; i.e. immediately actionable, and these that we can keep on a backburner for a while
  • use ‘portable’ character encoding
    • utf8/unicode, if possible, avoid ASCII/ANSI/DBCS/MBCS; seriously, it’s already 2018; if your program/OS can’t do Unicode 16-bit or utf-8 please leave the scene
    • encode blobs with base64
  • write timestamps with the highest possible resolution
    • don’t truncate timestamps!
    • you can use a binary format too (serialized/encoded), or a simple string (can be converted to anything)
    • I am personally not the biggest fan of formatted timestamp strings
      • they often trim timestamp data [time resolution]
      • it actually takes a lot of time to both convert from actual number to strings and back (multiply the time of a single conversion operation by a number of timestamps)
      • time formats are random and often highly misleading (e.g. mm/dd/yyyy vs. dd/mm/yyyy); note that your audience follows-the-Sun more often than not; ppl from various countries interpret the timestamps in their local ‘native’ format by default; mental conversion between formats is actually VERY difficult, and error-prone
  • sometimes it’s better to write binary logs (usually WAY faster)
    in such cases provide full file format documentation, example code, converters
  • log timestamps in UTC; if in local time, indicate it somehow
  • distinguish timestamps related to:
    • time of the actual event
    • time when the event itself was logged locally (e.g. Windows Event Logs)
    • time when it arrived to a collector/centralized log system
    • note that these 3 timestamps can help you to deal with various availability issues; even discover some, in a first place
    • add timestamps for all activities that matter
      for example, if you can link a download URL with a file, and then with its execution, and maybe even possible malware detection by AV, consider adding more events (one event/row) with various timestamps:

      • time of accessing the download link
      • time of detection/blocking, if link is bad
      • time of actual download completed/saving (file creation)
      • time of file AV detection alert, if applicable
      • time of file AV remediation/quarantine alert, if applicable
      • time of file ‘Open File Security Warning’ security alert triggered by Zone.Identifier.
      • time of user finally executing file [ShellExecute/CreateProcess]
      • time of UAC prompt being shown, if applicable
      • time of actual program start
      • time of advanced/behavioral AV/EDR detection
      • time of advanced/behavioral AV/EDR remediation
      • time of first network connection
      • time of program exit, if applicable
      • I know it’s a lot, but consider the added value for the analyst:
        • it’s actually really good to start mixing network logs with timeline logs extracted from endpoint forensic artifacts, if available
        • some EDRs already gather lots of such data and some present them in a timeline-ish way
        • you are not triggering individual blinkenlights anymore; you are helping to automate the response process!
  • use more rows to describe an event, if needed
    it’s better to describe operation states instead of using a single, multi-value ‘Frankenstein’s monster’ log that needs to be parsed using a lot of regex and data conversion/analysis tricks (e.g. in Splunk)
  • if you use multiple rows, use some sort of cookie, session ID that binds them together in a single cluster of events (transactions)
  • properly use standards: CSV, TSV, JSON, XML, whatever else
    seriously, this happens so frequently: please don’t screw up CSV just because you don’t know how to output new lines, escape special characters, please read rfc4180 (pls test corner cases, new lines, multiline data, data with utf-8 chars, etc.)
  • for activities that require listing file names – don’t just drop the file name; there are so many opportunities there:
    • file name
    • file extension
    • file type based on the file extension, if available
    • file type based on the file structure, if available
    • file size
    • full path
    • any other metadata available
      • hashes (MD5, Sha1, Sha256, perhaps even CRC32 for quick comparisons)
      • attributes (hidden, executable, directory, junction, symlink, setuid, etc.)
      • file timestamps (any possible on the specific file system)
      • PE compilation timestamp
      • version information
      • signature information
      • debug information (e.g. PDB path)
      • existing classification (e.g. VT)
      • avoid adding any unconfirmed information
        (e.g. vague statements from tools that ‘estimate’ or ‘guess’ file characteristics)
  • for processes
    • process identifier (PID)
    • process name
    • process command line
    • process characteristics
      • CLI, GUI
      • GS, DEP, ALSR, CFG
      • etc.
    • process current directory
    • parent process identifier (PID)
    • parent process name
    • parent process command line
    • other possibilities
      • section names in file where the process was initialized from
      • section names in memory
      • security token
      • privileges
      • time of start
      • time of exit
      • list of modules at start
      • list of modules at exit
      • list of RWX regions at start, with mappings to images
      • list of RWX regions at exit, with mappings to images
      • main window title, class, if GUI app
      • list of child windows
      • perhaps some flags that can indicate
        • if GUI process, was there any interaction with the user before the program exited?
        • etc.
      • etc.
    • exit code
    • some of these properties can be listed for all events; some can be dumped separately (different events) in a one-off fashion when
      • the process is started
      • at regular intervals
      • and/or when the monitored property changes
      • program exits
        this may save a lot of analyst time; such basic volatile information may be often enough to get a basic understand of what happened
  • for crashes
    • similar info as for the processes
    • + time of crash
    • + address of crash
    • + stack trace/with symbols, if available
    • + list of loaded modules
    • etc.
  • for network artifacts enrich data if possible
    • interface(s)
    • its promiscuity
    • MACs, if needed
    • source and destination IP
    • source and destination ports
    • ip version
    • ip type (static, dhcp)
    • states (established, listening, etc.)
    • direction (inbound/outbound)
    • sizes of data transfers in/out
    • session ids, if it is within one session
    • host name(s)
    • logged object other metadata
      • mime types
      • user agents
      • referers
      • classification
      • user name / region / business unit (if can be mapped)
      • any links to processes involved e.g. PID of the process, process name, etc.
      • CVE
      • CVE scores
      • CVE age/expiration (so we don’t trigger events on perl or php vulns from 2000, or CodeRed anymore)
      • any tags (e.g. VIPs, PCI-DSS systems, DMZ, whatever else)
  • if possible DO NOT COMBINE values e.g.
    • 1 field: ip:port
      is worse than
    • 2 fields: ip and port
  • again, DO NOT COMBINE values
    if we need to use regex to extract basic values from the logs you are doing it wrong
    – you are making things slow, cuz every single row needs to be regexed to extract these atomic data items
  • yes, think ATOMIC chunks of data
  • if you provide field names
    • AVOID changing names of the fields over time
    • AVOID changing order of the fields over time
    • don’t use white characters in the field names (makes life easier if data is imported to other systems)
    • don’t make field names too long (imagine what you see in Excel when you export data, or when you write Splunk queries that have to rely on prefixed fields in a form of
      <120 characters>.file
      <120 characters>.md5)
    • escape characters properly
    • use consistent nomenclature (src_ip and ip_dst is inconsistent)
  • DO NOT DUPLICATE information
    • user vs. account
    • host vs. ComputerName vs. Source vs. src
    • cmd vs. COMMAND
    • arg0 vs a0
    • etc.

Then… there is a documentation. This really deserves a dedicated post, but just to highlight a few bits:

  • write down the format of the log
    • typical location on the system
    • info on process/library creating them
    • character set used
    • field names
    • field descriptions – not one-liners, but real meaning + context how to interpret
    • order of the fields
    • how characters are escaped
    • how new lines are added
    • read, read again, read one more time
    • test, test, test
  • now add all possible, additional details you can think of; context, comparisons, example data, example incident data, and also review the source code; then add info on specific (code snippets are OK) conditions that trigger certain events; analysts need to understand EVERY condition that triggers the event logging so they can write playbooks for it
  • provide sample code to parse logs in at least 1-2 popular languages (python? perl? ruby?)
  • provide a sample tool/program to test the logs (e.g. one that will generate various types of events in a way like AV uses EICAR)
  • give the community/analysts a platform to provide a feedback

And finally, only at the stage when we have as many logs as possible, and they are in a proper/decent format, and they provide enough information that is enriched any possible way, and they have been tested to provide required data, only then we can start doing Mite Att&ck assessments…