How to become the best Malware Analyst E-V-E-R

Update

BIG thank you to all the reversers who provided the feedback so far; I have reviewed and updated the content based on this feedback, and hope this will be now more useful to all new guys; reversing community has a long history of being awesome and it shows. AGAIN, BIG THANK YOU!

btw. the title of this post is a tongue-in-cheek 🙂

Old Post

There are a couple of reasons for me to write this post:

  • I have been doing software analysis for a long time – 20 years, at least
  • For many years I have been doing it wrong
  • For many years I had absolutely no clue I was doing it wrong
  • In last few days a number of things happened
    • I see ‘malware analysts’ doing more and more Bat Deductions (this is from reading various posts from the security vendors, I won’t name them)
    • I have spoken to people who lied to me about their reverse engineering skills and I am trying to understand why + what stops them from actually learning the skills
    • I have publicly criticized GREM – not in a malicious way, but more as an observation that certifications != skills (apologies to Lenny Zeltzer who is doing lots of the GREM work and I might have sounded critical of his course; my point was merely that a few days of even most detailed course won’t replace hours of … well… messing up your analysis and often going nowhere)
    • I was actually approached by a number of GREM-certified people to help them with some reverse engineering/explanation of what they do/see
    • Malware analysis is fun, but only if you know what you are doing; I was lucky to learn from many smart people and at the times when internet was not so prevalent, so I was recently asking myself a question – what would be the steps I would take today, knowing what I know, to learn RCE skills in the most efficient way possible

Let me start with an old boring anecdote. One of my first experiences as a reverse engineer was looking at the code of demoscene productions. That was an incredibly stupid idea to pick it up as a target, but I was young and well, stupid. And yes, I didn’t know better. What I remember was that I got obsessed with the demoscene tricks and I really wanted to understand how they are done – in this case, how a certain particular full-screen scrolling effect was created.

I eventually learned how they pulled it off (that was a special mode of the VGA card – a variation of the mode X IIRC – that allowed to write 16 pixels with one stosd operation if you must know), but not from my reverse engineering efforts. These ended up after few days when I realized that for these couple of days I was staring at a highly packed code and data which I disassembled with a pirated disassembler I found online (FWIW it was not IDA). The code was very misleading to me, because the disassembler added a lot of ‘useful’ comments to it; these comments sent me down the rabbit hole pretty quickly e.g. they hinted that some of the code was using DMA channels, so I hypothesized that perhaps the quick data transfer I was observing on the screen was due to the DMA usage. Oh boy… So… Wrong. Wrong. Wrong. It was that special VGA mode, cleverly used. What I was looking at was a non-sense gibberish that looked like a code.

I wasted a lot of time.

I have repeated the very same mistake MANY times. But… there is no other way to learn reverse engineering, a core skill for malware analysts. You just need to look at a lot of listings and files/databases with the disassembled or decompiled code. Over years I was cracking software, adding immortality to games, shortcuts to pass from one level to another, analyzing malware and pretty much always finding something new to look at every day. Yes, no matter how good you are, how many hours you spent with Ida, OllyDbg, Xdbg, and windbg, you are going to come across new stuff all the time. That’s actually the most attractive part of being a reverse engineer. You break stuff apart, and this stuff can be a PE file, a word document, an exploit inside the Adobe Flash, a COM trickery, an API trickery, a new functionality abused for malicious purposes, a closed-source project that can be only found on ATMs, a stuxnet-like nested doll, a clever php backdoor injected inside a legitimate HTML page, a JavaScript snippet activated when the user presses the Submit button, an AutoCad program, a jailbreak for iphone, a powershell snippet, a VBS badness, a HTA deliverable, a good old Office macro, a tricky URL, a tricky protocol, a side-loading attack, a supply chain attack, an insider threat modifying the affiliate ID, and old game that you want to port to a new platform, a vaporware where you want to add some missing functionality or bypass some annoying bug or disable a message box … Basically, whatever that can be changed for good or bad, abused or misused, and sometimes just understood, for the sole purpose of ‘knowing it better’, and all of it on a software level falls into this ‘software/malware analysis’ category…. and discovering it, analyzing it, breaking it apart and understanding the motivation of the guys behind the code is a source for a great intellectual fun, and fulfillment, even if most of the time, we are half-guessing…

So… without a further ado… this is my list of rules…  if you want to become a good malware analyst, I’d suggest following some of them. Don’t be shy. Ask for help. Describe what you did. Where you got stuck. There is always someone who will be happy to help. And again, and most importantly, remember that most of the reversing work is never finished. Yup. We start, we abandon it pretty quickly. We come back to it. I have some projects I opened 10 years ago and I sometimes come back to when I am in a mood. In many cases only if you are paid proper money it is worth doing further analysis. And the latter is what drives what I like to call a ROI-driven malware analysis (or, in general terms, software analysis that has an actual end & deliverable that keeps your man hours low, client satisfied, and everyone aware that we are just scratching the surface, but in an educated way).

So… again, without a further ado:

Rule #0:

  • Waste a lot of time
  • There is absolutely no way to take a shortcut in reverse engineering
  • Yes, I am sorry, but it’s like with everything else… 10000h rule  stands true. BUT. like with everything in life, you may try to beat this number down!

Rule #1:

  • Don’t trust your tools; they will mislead you, they will betray you
    • Note: on day to day basis you can trust tools most of the time, but only if you made an effort in the past to understand what is that they are showing you, so that you can spot, or at least suspect bugs when you see something unexpected, or wrong
  • In particular, don’t trust any automation (automatic comments, sandboxing, IOC extraction, assessment, etc.); use these ONLY after you learn manual analysis
  • The automation (emphasis ‘at the beginning of learning’) is for the lazy, you can only use it when you know the caveats
    • To clarify, if you know how sandbox works, if you are just after behavioral stuff (IOCs), then sandbox analysis is often the best way to go; the point I am making is that if you don’t understand the caveats of automatic analysis, you will not be able to fully trust the sandbox output (or from any other tool really); also, depending on the case, you may want to confirm the sandbox findings manually (ok, for the caveats: anti-* routines, various execution paths depending on time, command line arguments, detected OS, presence of the internet, C2 address still existing, presence of targeted software, etc. – yup, you may get more than 1 different reports from the sandbox for a given sample, depending on the circumstances)

Rule #2:

  • Data is code, code is data
  • If you don’t know what you are looking at, assume you are looking at data.
  • Any attempts to disassemble it are pointless unless you know the context, or have a hint it is indeed a piece of code.

Rule #3:

  • There is no such thing as static analysis
  • Let me explain…
  • You calculated hashes, you ‘extracted’ strings, carved embedded files, you googled it, you checked VirusTotal, reviewed the sandbox report and you think you have done well.
  • Maybe you even used some PE Viewer/Editor….
  • Well, you have done what any automated system can do.
  • Try harder.
  • What many people refer to as static analysis*** is primarily FILE Analysis, not PROGRAM analysis
  • This is a TREMENDOUS difference
  • You can craft a file to look like Notepad yet still deliver a malicious payload
  • Files lie, code doesn’t
  • Seriously… whatever you see in strings, PE tools, it’s just a wrapper; always assume it’s a wrapper that will fool you; if you trust it, there are always chances you will be taken for a ride
  • ***Now, I need to explain one thing – a number of respected reversers pointed out that they do lots of static code analysis; I think this is the key here – they mention ‘code’, and the reason I am talking about non-existence of static analysis is that there is that TREMENDOUS difference between analyzing a file, and a code that this file stores; static code analysis (SCA) exists of course, and is actually a big industry (ever heard of Fortify?); so, anyone reading _code_ w/o executing it is doing SCA… as a reverser you will do lots of static code analysis

Rule #4:

  • You must look at the code and must think like the coder who wrote it. You are building the image in your head, code blocks take some shape, shapes create a pattern, they align or not, you grasp it, or you don’t. You are basically writing or rewriting this code in your head.
  • When you look at the code, you emulate it, you guess, you connect to it (from the basic instructions to code blocks)
    • As you do analysis e.g. in IDA, please make it a rule to label everything you see in the code as you walk through it. It saves a lot of time to know you have visited a certain path, routine, etc. even if you don’t know what it does, make a guess and name it in appropriate way e.g. ‘unknown_possibly_reading_file_xyz’, or ‘nothing1’ – it’s always better than ‘sub_xyz’
    • You will be surprised how these labeled functions ‘add up’ to your understanding of the code e.g. naming memory functions both for allocation and memory disposal often ‘fill-in’ the listings of many functions that rely on these functions quite well
  • You may not like it, but if you are not a programmer, you can’t reverse engineer well.
  • A programmer runs the program or idea in their head, many times, and this is not called static programming, oh no… it’s a daunting technical and algorithmic challenge. It’s an obsession. Reverse engineering is even more painful as you need to go into that programmer’s head and understand _why_ they did certain things the way they did + you have to deal with what compilers produce. And it’s a code stripped of lots of information. And sometimes purposefully obfuscated.  Most of the time this is a painful exercise. This is why when you ‘get it’, you get that awesome ‘I cracked the puzzle’ feeling. I-t  i-s  p-a-i-n-f-u-l, but there is a reward.

Rule #5:

  • Do not use tools unless you used pen and paper at least once.
  • Read PE file specification (same applies to other file formats, but let’s start with PE)
  • Look at a random PE file (calc.exe, or whatever), print out the first page of the hex dump, and use pen to highlight structures you _manually_ recognize based on the PE format documentation. Don’t be lazy. Do it. Then use tools to confirm. If you make mistake, analyze where you made that mistake. Step back, re-assess. Ask others.
  • Don’t watch videos on reverse engineering unless you know the basics. Videos are for lazy. You need to get your hands dirty. Asap.
  • I will repeat it ad nauseam. Don’t rely on tools from the day one. It will cost you long-term. Use pen. Write stuff down. Make mistakes. Correct them. Nowadays I often use an editor, but that’s because we have virtual machines and you can make mistakes inside the VMs that won’t crash the host system; in the past, when a mistake meant a system crash or hang, I had a lot of paper around me with notes, and lots of hexadecimal stuff written on it
  • If you have done the manual analysis at least once, you will be:
    • able to understand what you see or look it up
    • able to spot tricks (in a file format, in a code)
    • able to write tools
      YAY! if you can write tools you will be always ahead of the curve… the best example is given here by @hasherezade – she is producing lots of Proof Of Concept code and she often writes it for TESTING purposes (to see if she understands the technique properly and to test other tools)

Rule #6:

  • The best reversers are not necessarily malware reversers
  • The best reversers often started by poking around in other people’s software
  • In the past they didn’t have tools that are available today; they spent hours, days, weeks, months staring at some code, trying to break it
  • Some of them (e.g. Rolf Rolles) can do what some refer to as ‘Zen reversing’ – they look at the code, and can instantly recognize what it does; I actually have a ‘personal’ list of people that I almost religiously respect for the reversing magic they can do (e.g. vulnerability researchers, but also coders and reversers who often understand advanced math and apply it to build better tools)
  • Like anything in the world, it requires lots of hours spent on training; it’s gonna hurt

And on a practical note…

Reverse Code Engineering (RCE) is getting really popular and is really needed. It is helpful in malware analysis, debugging your own apps, solving crackmes, fixing bugs in abandowanware, and it can be handy in localization. It makes you a better programmer as well. Of course, it also helps to steal and plagiarize code, bypass software protections, discover vulnerabilities, write shellcodes and jailbreaks, reproduce stuxnets, rootkits and make people’s lives miserable and/or interesting in many other creative ways. All of it is either coding, or reading other peoples’ code, and repurposing it. Again, you better be a programmer to do reversing efficiently. The code blocks in malware are the exact code block used in a legitimate software. The old posts on e.g. CodeGuru and CodeProject are often leveraged in malware creations. Often, pretty much in a copypasta way. These posts are often very detailed. Read stuff that is still there. StackOverflow can wait.

So, back to the original question: how to learn RCE and/or malware analysis quickly?

There are many answers online and they vary a lot. Many people suggest books, tutorials, ebooks… on IDA, on assembly, on Reverse Engineering in general, some suggest doing courses and certificates (including GREM), others watching youtube videos and some advise new adepts of RCE to simply stop wasting their time.

I would like to provide you with my own version making it as minimalistic and practical at the same time as possible. Yes, it is not full, yes it is far from being perfect, yes you are not going to analyze rootkits just yet (and yes – it is Windows oriented).

But…

If you read the stuff I point to and really focus on spending a few hours/week on actually making tones of mistakes plus avoid claiming victories easily achieved by using automation and tools developed by others, you are going to get there before you even realize:

  • Decide if it is for you; seriously… reversing is a terribly mundane process… it’s actually like forensics, and probably no one will tell you how boring it can be, but it’s HOURS and DAYS spent on staring at the same screen, browsing items and trying to understand it… sometimes you hit a jackpot and crack it in 5mins, sometimes even 2 weeks of intensive reversing won’t bring ANYTHING useful… so… you have been warned
  • You need to learn about programming in general and actually start coding. I repeat: you can’t reverse engineer if you don’t program. How can you understand what you see if you don’t know what a loop is, a recursion, or statically or dynamically linked code is. It is simple as that. If you programmed before, move on to the next point. If you didn’t – don’t buy heavy C++, C#, Java, Python reference books just yet. Buy a book with silly, but practical examples of simple programs explaining the fundamental architecture of Windows. Try this classic book from Charles Petzold. Read it inside out, and take your time to actually _type_ the code listings. Yes, you heard that right. It’s mundane, it’s error-prone, yet this is how learning to program works. The only way is through a keyboard so, get ready to invest quite a lot of time – you will be fixing typos, compiler errors, getting completely unpredictable results and will encounter a lot of pain and stress as you go along.
    In any case, DO NOT START WITH JAVA OR .NET. Sweat a bit with C, even scripts in VBS, bash, and powershell.
  • Read other peoples’ code. Skim through it, and if you find something interesting, read more thoroughly and ‘get it’. Again, no need to understand everything, but if you want to understand, google around until you do. No, do not start reading Linux code just yet. Start with short code snippets on educational web sites. Look at the source code of some small, but interesting and potentially malware-related projects. Just see how people do stuff, try to figure it out. This is actually the most crucial part of reverse engineering – it is not only about reading the code, browsing through listings, spotting known APIs, running ‘strings’ on a file, or playing around with ‘Procmon’, ‘Dependency Walker’ and ‘GMER’. It is trying to wear authors’ shoes for a moment. If you can figure out his or her thought process that led to this and that implementation, you will be making a huge progress very quickly. Bonus: when you notice some code blocks, they will stick to you, so next time you see similar code, you can make an educated guess. Didn’t I tell you it’s all about an educated guess? Yes… ROI is important. You don’t want to disassemble the whole 1.5Mb binary.
  • Learn a small subset of x86 assembly language.
  • Choose easy targets first. Look at compiled sources from Iczelion projects, compare them to the source code, look at programs for Windows XP (things are not complicated on this system as they are on Vista/W7+), look at old software; nowadays, the programming is pretty complicated from a reverse engineering perspective (lots of RAD tools that use lots of wrappers that hide the actual program’s code and analysing it manually is a pain in the butt)
  • Refer to MSDN often. Anytime you come across a new function name, either google or MSDN it. make sure you read the concepts associated with the function (usually functions are associated with some ‘high level’ topic e.g. CreateFile with File Management). Seriously. Read the full description, don’t be lazy and it’s okay if you don’t ‘get’ everything in one go. Bits you pick up as you read stuff will provide you with an invaluable insight in the future.
  • Only now start googling for tutorials on how to reverse/crack/debug applications or buy books that will expand your knowledge. Yes, reversing requires a solid foundation from many aspects of IT; if you don’t know these basics, you will continue to be a tool user and no youtube video or book on IDA can help you here…
  • And the good news… read about forensics. When I started it was almost a non-existing branch of IT (Sec); now it provides a crazy amount of information with regards to artifacts and what is ‘good to know’ about the systems you look at. It classifies and solidifies knowledge about what is that malware does and what parts of the system it affects. Instead of knowing everything, you can focus on areas that are the most exposed, malware-wise. The Art of Memory forensics is gold. Read it.
  • The Art of Memory forensics will bring you closer to the OS internals; The OS internals and the system architecture is a must-understand bit when you do malware analysis; you need to learn about objects, files, Registry, but also mutexes, semaphores, memory layout, a difference between process, threads, fibers, process environment, process environment block, structured and vectored exceptions, and so on and so forth… Read Widows internals book
  • When you get a bit more experienced, read through corkami repo…. Ange is one of the best reversers I know personally and he is the magician beyond many tricks affecting a huge number of file formats; apart from Rolf Rolles he is one of my ‘zen reversers’ that I follow pretty much religiously
  • So.. yeah… join social media, groups and just start following reversers… there are lots of very good reversers on Twitter – find them!
  • Get used to the fact you will be re-learning reverse engineering often. Your tricks, and your tools expire. New file formats, modification to existing file formats, new programming frameworks, new obfuscators, new tools, it’s new stuff all the time (okay, maybe with the exception of macros in Office 😉

So… How to become the best Malware Analyst E-V-E-R?

You can’t cheat here. You need to do hours. Many hours.

btw if you are interested in SOC basics you may also try How to become the best SOC Analyst E-V-E-R