Looking for the randomness in the most non-AI/ML way…

Here’s an old-school file name-based research… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

The software we install (focus here is on Windows, as usual) creates a loooot of files, and while many of them seem to be completely random, whimsical in nature, especially with regards to their file names, they do end up forming a corpora of sort… Or, when bundled together, all these file names known to be created for legitimate purposes are a great material for research.

For this post I collected 1.5M executable file names from Windows. They may not be a full set of file names ‘out there’, but it’s enough to play around with….

I then looked at statistics of 2- and 3- and 4-character long infixes (ignoring any non [a-z] characters).

The results are below:

  • How often 2-character long infixes appear in these 1.5M file names: filename_stats_2.txt – as you can see, not very useful…
  • How often 3-character long infixes appear in these 1.5M file names: filename_stats_3.txt – not very useful either…
  • How often 4-character long infixes appear in these 1.5M file names: filename_stats_4.txt – this is better… we definitely can cherry-pick a lot of 4-character long infixes that never appear in the set: filename_stats_4_non-existing.txt

Using the latter, we can create regexes sets:

Using these regexes sets you may actually get better at finding randomly named filenames! You will also find a lot of FPs, of course, but now you have a set of regexes you can tune to your needs…

Can this be used in ML/AI research?

Yes, by all means, but the set of file names used as a base should be a loooot higher and collected in a more meaningful way. One can argue that f.ex. temporary files created by installers could be excluded, we could also exclude file names that are following certain patterns in names (f.ex. starting with a dollar ‘$’, tilde ‘~’, or file names conforming to a pattern ‘<GUID>.exe’), we could reduce the corpora by understanding versioned file names (f.ex. ‘FirefoxSetup63.exe’, ‘FirefoxSetup64.0.2.exe’, etc.), we could ignore non-English file names (‘Менеджер BIM Сервера GRAPHISOFT 19.exe’, ‘联系汉化作者.exe’, etc.) or, artificially created file names that are used by many ‘download/update’ managers (‘ICReinstall_’ as in ‘ICReinstall_any_video_converter.exe’, ‘ICReinstall_driver identifier.exe’, etc.), or … we could also focus entirely on signed installers only as well, or compiled within a certain timeframe f.ex. last decade).

As I said… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

And I will now answer the ‘why’:

– just to understand how hopeless the whole file name-matching idea is!

The world of partially downloaded files…

Update

Don’t read the old post. It’s a result of an experiment and the experiment failed 🙂

Thanks to _BradleyVX who pointed out the hallucinations that sneaked in to the below list.

Don’t do AI at home, kids!

If you need a list of temporary file names and extensions have a look at this great list, or other file extension lists.

Old Post

Anytime you download a file via a browser, instant messenger, or other apps… it is first saved to a temporary file…

These temporary files are saved with some particular extensions:

For Browsers:

  • Chrome – .crdownload
  • Firefox – .download (hallucination! should be .part)
  • Opera – .opdownload
  • Safari – .part (hallucination! should be .download)
  • Microsoft Edge – .temp (hallucination! should be .partial)
  • Brave – .brave-download

For email clients (most are hallucinations):

  • Microsoft Outlook – .tmp
  • Mozilla Thunderbird – .part
  • Apple Mail – .download
  • Gmail Web – .tmp
  • Yahoo Mail – .tmp
  • ProtonMail – .tmp
  • Tutanota – .tmp

For Instant Messengers (most are hallucinations):

  • WhatsApp – .temp
  • Telegram – .temp
  • Discord – .dat
  • Signal – .tmp
  • Skype – .part
  • Facebook Messenger – .download
  • Google Chat – .temp

Are there any others?

Now, in the interest of full disclosure… I have not written much of this post. In the past I would manually download these programs, and then would spend hours testing their file-saving capabilities. For this post, I simply asked Google Bard. It’s terrifying and amazing at the same time.