{"id":8867,"date":"2023-11-25T00:27:57","date_gmt":"2023-11-25T00:27:57","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=8867"},"modified":"2023-11-25T00:27:57","modified_gmt":"2023-11-25T00:27:57","slug":"looking-for-the-randomness-in-the-most-non-ai-ml-way","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2023\/11\/25\/looking-for-the-randomness-in-the-most-non-ai-ml-way\/","title":{"rendered":"Looking for the randomness in the most non-AI\/ML way&#8230;"},"content":{"rendered":"\n<p>Here&#8217;s an old-school file name-based research&#8230; it is not game changing, it won&#8217;t bring any immediate solution, but it&#8217;s still worth doing today&#8230;<\/p>\n\n\n\n<p>The software we install (focus here is on Windows, as usual) creates a loooot of files, and while many of them seem to be completely random, whimsical in nature, especially with regards to their file names, they do end up forming a corpora of sort&#8230; Or, when bundled together, all these file names known to be created for legitimate purposes are a great material for research.<\/p>\n\n\n\n<p>For this post I collected 1.5M executable file names from Windows. They may not be a full set of file names &#8216;out there&#8217;, but it&#8217;s enough to play around with&#8230;.<\/p>\n\n\n\n<p>I then looked at statistics of 2- and 3- and 4-character long infixes (ignoring any non [a-z] characters).<\/p>\n\n\n\n<p>The results are below:<\/p>\n\n\n\n<ul>\n<li>How often 2-character long infixes appear in these 1.5M file names: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_3.txt\">filename_stats_2.txt<\/a> &#8211; as you can see, not very useful&#8230;<\/li>\n\n\n\n<li>How often 3-character long infixes appear in these 1.5M file names: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_3.txt\">filename_stats_3.txt<\/a> &#8211; not very useful either&#8230;<\/li>\n\n\n\n<li>How often 4-character long infixes appear in these 1.5M file names: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_4.txt\">filename_stats_4.txt<\/a> &#8211; this is better&#8230; we definitely can cherry-pick a lot of 4-character long infixes that never appear in the set: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_4_non-existing.txt\">filename_stats_4_non-existing.txt<\/a><\/li>\n<\/ul>\n\n\n\n<p>Using the latter, we can create regexes sets:<\/p>\n\n\n\n<ul>\n<li>1 leading character, 3 following: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_4_non-existing_regex1-3.txt\">filename_stats_4_non-existing_regex1-3.txt<\/a><\/li>\n\n\n\n<li>2 leading characters, 2 following: <a href=\"https:\/\/hexacorn.com\/d\/filename_stats_4_non-existing_regex2-2.txt\">filename_stats_4_non-existing_regex2-2.txt<\/a><\/li>\n<\/ul>\n\n\n\n<p>Using these regexes sets you may actually get better at finding randomly named filenames! You will also find a lot of FPs, of course, but now you have a set of regexes you can tune to your needs&#8230;<\/p>\n\n\n\n<p>Can this be used in ML\/AI research?<\/p>\n\n\n\n<p>Yes, by all means, but the set of file names used as a base should be a loooot higher and collected in a more meaningful way. One can argue that f.ex. <a href=\"https:\/\/www.hexacorn.com\/blog\/2015\/01\/05\/when-you-are-a-temp-your-days-are-often-numbered-so-are-your-file-names-part-1\/\">temporary files created by installers<\/a> could be excluded, we could also exclude file names that are following certain patterns in names (f.ex. starting with a dollar &#8216;$&#8217;, tilde &#8216;~&#8217;, or file names conforming to a pattern &#8216;&lt;GUID>.exe&#8217;), we could reduce the corpora by understanding versioned file names (f.ex. &#8216;FirefoxSetup63.exe&#8217;, &#8216;FirefoxSetup64.0.2.exe&#8217;, etc.), we could ignore non-English file names (&#8216;\u041c\u0435\u043d\u0435\u0434\u0436\u0435\u0440 BIM \u0421\u0435\u0440\u0432\u0435\u0440\u0430 GRAPHISOFT 19.exe&#8217;, &#8216;\u8054\u7cfb\u6c49\u5316\u4f5c\u8005.exe&#8217;, etc.) or, artificially created file names that are used by many &#8216;download\/update&#8217; managers (&#8216;ICReinstall_&#8217; as in &#8216;ICReinstall_any_video_converter.exe&#8217;, &#8216;ICReinstall_driver identifier.exe&#8217;, etc.), or &#8230; we could also focus entirely on signed installers only as well, or compiled within a certain timeframe f.ex. last decade).<\/p>\n\n\n\n<p>As I said&#8230;  it is not game changing, it won&#8217;t bring any immediate solution, but it&#8217;s still worth doing today&#8230; <\/p>\n\n\n\n<p>And I will now answer the &#8216;why&#8217;:<\/p>\n\n\n\n<p>&#8211; just to understand how hopeless the whole file name-matching idea is!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s an old-school file name-based research&#8230; it is not game changing, it won&#8217;t bring any immediate solution, but it&#8217;s still worth doing today&#8230; The software we install (focus here is on Windows, as usual) creates a loooot of files, and &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2023\/11\/25\/looking-for-the-randomness-in-the-most-non-ai-ml-way\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[21,79,1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8867"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=8867"}],"version-history":[{"count":5,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8867\/revisions"}],"predecessor-version":[{"id":8896,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8867\/revisions\/8896"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=8867"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=8867"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=8867"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}