{"id":9164,"date":"2024-04-26T23:40:21","date_gmt":"2024-04-26T23:40:21","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=9164"},"modified":"2024-04-26T23:40:21","modified_gmt":"2024-04-26T23:40:21","slug":"a-license-metadata-to-kill-for","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2024\/04\/26\/a-license-metadata-to-kill-for\/","title":{"rendered":"A license (metadata) to kill (for)&#8230;"},"content":{"rendered":"\n<p>Many forensic artifacts can be looked at from many different angles. A few years ago I proposed a concept of <a href=\"https:\/\/www.hexacorn.com\/blog\/2015\/04\/11\/introducing-filighting-and-the-future-of-dfir-tools-part-2\/\" data-type=\"URL\" data-id=\"https:\/\/www.hexacorn.com\/blog\/2015\/04\/11\/introducing-filighting-and-the-future-of-dfir-tools-part-2\/\">filighting<\/a> that tried to solve a problem of finding unusual, orphaned and potentially malicious files dropped inside directories that contain files that DO NOT reference these orphaned files at all.<\/p>\n\n\n\n<p>I really hope that forensic analysis tools will evolve to add more features that will help to automate file system analysis based not only on a list of known hashes and\/or file extensions, but also paths, partial (relative) paths, file names, actual file types based on their content, and ideas that rely on more complex algorithms: using prebuilt artifacts collections, leveraging various correlations (ideas like filighting), and of course machine learning and AI.<\/p>\n\n\n\n<p>Today I want to explore one more angle of looking at file system artifacts &#8212; classes of file content. There are many file formats out there: executables, documents, configuration files, database files, and many other file types. The classification I am focusing on today though is slightly different &#8211; the format itself doesn&#8217;t interest me too much, but the function of the file does&#8230;<\/p>\n\n\n\n<p>My guinea pig will be a license file. The type of a file that is all over the place, but no one reads them. And yes, removing them from the examiner&#8217;s view (during file system analysis) may not add a lot of value, but it&#8217;s used here only to illustrate the idea. There are many other file classes like this that can be classified as noise to the examiners&#8217; eyes and if we start clustering them together, who knows, maybe we have just saved some personhours there&#8230;<\/p>\n\n\n\n<p>I asked myself the following question: <\/p>\n\n\n\n<p>&#8211; having a file system in front of me, how do I find all license files on it?<\/p>\n\n\n\n<p>There are at least a few approaches I can think of:<\/p>\n\n\n\n<ul>\n<li>use hashes of known license files,<\/li>\n\n\n\n<li>use file names typically used by license files,<\/li>\n\n\n\n<li>analyze content of all files and look for content that resembles a license file.<\/li>\n<\/ul>\n\n\n\n<p>All of them have their own challenges:<\/p>\n\n\n\n<ul>\n<li>the first one needs a lot of prep work to collect good hashes, <\/li>\n\n\n\n<li>the second one is hard to do w\/o some proper analysis of a clean sampleset, and <\/li>\n\n\n\n<li>the third one is the most reliable, but it&#8217;s slow &amp; needs even more preparation because it has to take into account a few more aspects: localization issues (license in various languages), file encoding issues (Unicode variants, ASCII, MBCS), file formats (TXT, RTF, HTM(L), PDF, DOC(X), etc.), and of course &#8212; performance (reading many files to analyze their content is expensive, plus not every file referencing GPL, LGPL, GNU is a license file)<\/li>\n<\/ul>\n\n\n\n<p>I am going to focus here on the second one.<\/p>\n\n\n\n<p>Your typical license file is usually called <em>license<\/em>, <em>license.txt<\/em>, <em>eula.txt<\/em>, and in case of Open Source, we often see files named like <em>gpl.txt<\/em>, <em>license.gpl.txt<\/em>, <em>lgpl.txt<\/em>, etc.<\/p>\n\n\n\n<p>When you start researching this file naming bit a bit more, you will soon realize that there are a lot of variations. A lot of issues listed in 3rd point come to play as well f.ex.:<\/p>\n\n\n\n<ul>\n<li>file names can be localized, <\/li>\n\n\n\n<li>file extensions can be <em>.txt<\/em>, <em>.rtf<\/em>, <em>.htm(l)<\/em>, <em>.doc(x)<\/em>, <em>.pdf<\/em>, <em>.xml<\/em>,<\/li>\n\n\n\n<li>some of the file names have typos,<\/li>\n\n\n\n<li>many license file names use various prefixes or suffixes that identify the licensed software, the language or code page identifying the language the license file is written in,<\/li>\n\n\n\n<li>some file names may refer to compressed file names f.ex. *.tx_ (in installation packages),<\/li>\n\n\n\n<li>some license files may be stored inside the archives (including password-protected files) or installers,<\/li>\n\n\n\n<li>some licenses are embedded inside the compiled help files (<em>.hlp<\/em>, <em>.chm<\/em>),<\/li>\n\n\n\n<li>some programs may be hiding the licensing information in files named with various infixes: <em>copying<\/em>, <em>releasenotes<\/em>, <em>thirdparty<\/em>, <em>copyright<\/em>, and their variants, etc.,<\/li>\n\n\n\n<li>some may refer to software version in terms of full, trial f.ex. <em>evaluation<\/em>,<\/li>\n\n\n\n<li>some files with a license in name often refer to actual software licensing (getting keys, subscription, transferring the licenses, etc.),<\/li>\n\n\n\n<li>finally, some file names may be available in a 8.3 DOS notation only.<\/li>\n<\/ul>\n\n\n\n<p>As usual, the more you look, the more complex the problem you see.<\/p>\n\n\n\n<p>For this post I have compiled a large file containing possible license file names. You can download it <a href=\"https:\/\/hexacorn.com\/d\/license.txt\">here<\/a>.<\/p>\n\n\n\n<p>Will it make anybody&#8217;s life easier? <\/p>\n\n\n\n<p>I don&#8217;t know. <\/p>\n\n\n\n<p>What matters is that we learned a little bit more how difficult the process of automated file system analysis is. What started as a trivial and frivolous idea ended up being a Don Quixotish attempt to formalize something that is impossible to tackle, even with a data-heavy approach&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many forensic artifacts can be looked at from many different angles. A few years ago I proposed a concept of filighting that tried to solve a problem of finding unusual, orphaned and potentially malicious files dropped inside directories that contain &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2024\/04\/26\/a-license-metadata-to-kill-for\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[19],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9164"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=9164"}],"version-history":[{"count":2,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9164\/revisions"}],"predecessor-version":[{"id":9166,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/9164\/revisions\/9166"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=9164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=9164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=9164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}