{"id":9874,"date":"2025-02-22T10:57:31","date_gmt":"2025-02-22T10:57:31","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=9874"},"modified":"2025-02-22T10:57:31","modified_gmt":"2025-02-22T10:57:31","slug":"optimizing-the-regexes-or-not","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2025\/02\/22\/optimizing-the-regexes-or-not\/","title":{"rendered":"Optimizing the regexes, or not"},"content":{"rendered":"\n<p>Every once in a while we all contemplate solving interesting yet kinda abstract threat hunting problems. This post describes one of these&#8230;<\/p>\n\n\n\n<p>The problem:<\/p>\n\n\n\n<p><strong>Given a relatively long number of strings, how do you write a regular expression that covers them all, but doesn&#8217;t hit on any other string?<\/strong><\/p>\n\n\n\n<p>The context:<\/p>\n\n\n\n<p>I have extracted file names associated with kernel drivers referenced by all the .inf files present inside all of (unpacked) archives that can be found inside the <a href=\"https:\/\/driverpack.tilda.ws\/main-page\">DriverPack<\/a>. <\/p>\n\n\n\n<p>The rationale:<\/p>\n\n\n\n<p>Hunting for new kernel drivers introduced to the environment may be easier if I can extract kernel driver names from the telemetry, and only report creation of these that reference files that are NOT present on the &#8216;known list of good kernel driver file names&#8217;.<\/p>\n\n\n\n<p>The solution:<\/p>\n\n\n\n<p>Looking for existing tools that may help to address this problem in a generic way I came across this perl module &#8211; <a href=\"https:\/\/metacpan.org\/release\/DANKOGAI\/Regexp-Optimizer-0.15\/view\/lib\/Regexp\/Optimizer.pm\">Regexp::Optimizer<\/a>. To my surprise, it actually works quite nicely. <\/p>\n\n\n\n<p>I gave it <a href=\"https:\/\/hexacorn.com\/d\/ServiceBinary2su.txt\">7.5K file names<\/a> associated with &#8216;known clean kernel module drivers&#8217; and it gave me the following <a href=\"https:\/\/hexacorn.com\/d\/regex.txt\">regex<\/a>. I have tested all the file names from the &#8216;ServiceBinary2su.txt&#8217; file and the regex worked well. This is the test script:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\">\n<p>use strict;<br>use warnings;<br>use utf8;<\/p>\n\n\n\n<p>$| = 1;<\/p>\n\n\n\n<p>my $f=&#8217;regex.txt&#8217;;<br>open F,&#8221;&lt;$f&#8221;;<br>binmode F;<br>read F,my $regex,-s $f;<br>close F;<\/p>\n\n\n\n<p>my $x=shift;<br>if ($x=~\/^$regex.sys$\/i)<br>{<br>   print (&#8220;$x matched\\n&#8221;);<br>}<br>else<br>{<br>   print (&#8220;$x didn&#8217;t match\\n&#8221;);<br>}<\/p>\n<\/blockquote>\n\n\n\n<p>The final regex is 52624 bytes long. The input data was 103317 bytes long (including new lines). We have achieved a 51% &#8216;compression rate&#8217;, but debugging of such a complicated regex pattern sounds like a heck of a job. It would seem that sometimes solving interesting yet kinda abstract threat hunting problems brings more confusion to the process than we anticipate&#8230; And getting fixated on using regexes to solve this kind of problem is actually a bigger problem itself. The multi-pattern search-oriented trie structures are far more suitable to solve this sort of multi-pattern search\/comparisons. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every once in a while we all contemplate solving interesting yet kinda abstract threat hunting problems. This post describes one of these&#8230; The problem: Given a relatively long number of strings, how do you write a regular expression that covers &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2025\/02\/22\/optimizing-the-regexes-or-not\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[79],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9874"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=9874"}],"version-history":[{"count":4,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9874\/revisions"}],"predecessor-version":[{"id":9878,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9874\/revisions\/9878"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=9874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=9874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=9874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}