How would you analyse 200,000 executables?

Warning: This is a bit of a long one.

A few years ago while I was an Incident Responder, a supervisor asked me the question:

If you were given a few hundred executables and you had to determine which are malicious, how would you do it?

Well, a few weeks ago, I stumbled across a dataset from Michael Lester at Practical Security Analytics which contained around 200,000 executables. Roughly 110,000 were malicious and 80,000 benign. I thought this would be a fun challenge to test what I’ve learnt in Machine Learning over the last two months. For this challenge, I set a few constraints:

  • No using information provided in the dataset, except the label (malicious or benign)

  • No third party providers (e.g. Virus Total).

  • Only static analysis (No Cuckoo or other sand boxes).

  • No more than a weeks worth of work (I’ve got too many other projects I want to work on).

  • Binary analysis must be relatively quick (No point if it takes a minute per binary).

I also wanted to use this project to expand my knowledge on Machine Learning. It’s all fun and games to use the same model for every project, but I feel like that’s not going to teach you the lower intricacies of machine learning.

So, join me through the process of teaching a Machine Learning algorithm how to detect malicious executables! If you think I should have done anything differently, leave a comment or let me know on twitter!


Collecting the Data

With over 100,000 malicious samples, the last thing I wanted to do was infect my Windows desktop. To prevent this, I moved the binaries into an Ubuntu environment where I would write a python script to extract all the relevant data from the binaries.

I decided to extract the following data:

  • File Header / Magic

  • File Size

  • File Entropy

  • List of Imports

  • List of Exports

Note: I would have also liked to grab the cert info and the file strings, but listed it out of scope due to how long it was taking to process.

From the PE Header, I also grabbed:

  • Size of Code

  • Size of Image

  • Size of Stack Reserve

  • Size of Stack Commit

  • Size of Heap Reserve

  • Size of Heap Commit

  • Section Names

  • Sections Raw Size / Virtual Size

  • Whether the section is Code, Executable and Writable

The big problem that I encountered was that the whole process was slow. This was probably due to pulling some unnecessary features, but resulted in about 15 files being scanned a minute. If I left the script like that, it would have taken at least 9 days to complete.

So, I built the script around two queues to try and combat the speed:

  • The first queue contained a list of all of the files and whether or not they where malicious.

  • The second queue contained the processed results that needed to be written to the csv.

This allows us to have multiple threads processing files, while having a single thread write the results to a csv.

So, I needed to start by populating the list of files. I decided to go with a supervised model for this project, so the first thing I did was iterate through the samples.csv provided within the dataset to pull out the filename and label for each file. This tells us whether each file is malicious or benign, so while we’re training our model it’ll learn which files are which. I discarded all other information within the file to keep in the spirit of the challenge.

with open('/home/dev/data/pe-machine-learning-dataset/samples.csv') as f:
    for line in f:
        try:
            # Split the csv, and keep the filename (index) and a binary label
            l = line.split(',')
            sampleQueue.put([l[0].strip('"'), 1 if 'Blacklist' in l[6] else 0])
        except:
            continue # First like breaks

Each thread would then take a file off the queue, collect all the relevant information and put the results onto the next queue. I won’t go into the full code for how I process each file, but you can see the code on GitHub.

# Define what each thread will do. 
def ThreadJob(sampleQueue, writeQueue):
    while not sampleQueue.empty():
        # Take a sample off the queue.
        sample = sampleQueue.get()

        # Get a bunch of variables from the PE. 
        imports, exports, NumberOfSections, SizeOfCode, \
        SizeOfImage, SizeOfStackReserve, SizeOfStackCommit, \
        SizeOfHeapReserve, SizeOfHeapCommit, sections = parse_pe(path+sample[0])

        # Write the output to a file. 
        writeQueue.put([
            int(sample[0]),                     # File name/index
            sample[1],                          # Label 
            magic.from_file(path+sample[0]),    # Magic/File Header
            os.stat(path+sample[0]).st_size,    # File Size
            entropy(path+sample[0]),            # Shannon Entropy
            imports,                            # File imports
            exports,                            # File exports
            SizeOfCode,                         # PE Size of Code
            SizeOfImage,                        # PE Size of Image
            SizeOfStackReserve,
            SizeOfStackCommit,
            SizeOfHeapReserve,
            SizeOfHeapCommit,
            NumberOfSections,
            sections                            # List of section details.
        ])

The output thread then takes entries of the writeQueue and dumps them into the csv. I decided to write to a separate file so I can work on the rest of the project while the files were still being processed.

# Start the processing threads
for i in range(numThreads):
    worker = Thread(target=ThreadJob, args=(sampleQueue, writeQueue))
    worker.setDaemon(True)
    worker.start()

# Open the csv. 
with open('samples.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    
    # Write the header.
    writer.writerow([
        'index','label','file_type','file_size','file_entropy',
        'imports','exports','size_of_image','size_of_code',
        'size_of_stack_reserve','size_of_stack_commit','size_of_heap_reserve',
        'size_of_heap_commit','number_of_sections','sections',        
    ])

    # Keep looping through the queue. 
    while True:
        if writeQueue.empty():
            continue

        writer.writerow(writeQueue.get())

Visualising the Data

Visualising the data before you start working with it can help you determine which features are going to be useful vs what data might actually hurt your model. It can also be a really useful way to explain your dataset to people, so here we go!

Note: I’m only going to go through a couple of features so you can get an understanding of the process, otherwise this blog would drag on way too long.

Entropy

Entropy provides a scale for how random the contents of a file is, between 0 (less random) to 8 (completely random). A file with a higher entropy indicates that the file is compressed or encrypted. This is not inherently a malicious trait, but as seen in the figure below, it is more commonly seen with the malicious files than the benign files provided.

File Header

The file header represents the first several bytes of a file, which can be used to determine the type of file that it is. In this case all the binaries are PE executables, but as you can see in the figure below there are a large amount of file headers, and there’s not many distinct file headers that are regularly used.

As we can see from this initial view, a one hot encoding or embedding approach to this column would not be very useful. There’s just too much variety and not enough density. So what else can we do with it?

I decided to go with a bag of words approach. I split each of the file signatures on spaces, and collected a count for each word that was seen and how many times it was seen in each label. Pandas was a real pain in the butt with this step, so instead I converted the lists into a dictionary for good and bad, then converted it into a Pandas DataFrame.

word_count = {'good': {}, 'bad': {}}
file_type = df.groupby(['file_type', 'label']).size().reset_index(name='count')
for _, row in file_type.iterrows():
    for word in row['file_type'].split():
        if row['label'] == 0:
            if word in word_count['good']:
                word_count['good'][word] += row['count']
            else:
                word_count['good'][word] = row['count']
        else:
            if word in word_count['bad']:
                word_count['bad'][word] += row['count']
            else:
                word_count['bad'][word] = row['count']

words = pd.DataFrame(word_count)
words.head()

The resulting DataFrame lets us see the frequency of words between malicious and benign binaries. For example, if I search for the word ‘UPX’ (a common compression library), you can see a huge increase in malicious binaries using this word compared to benign.

File Size

File size was a bit of a surprise for me, because I expected there to be a larger difference in the size expectancy between malicious and benign files. As you can see in the figure below, there was no real difference. Malicious files were very steady across the size differences, while benign files had the occasional peaks.

Imports / Exports

For each file, I grabbed a list of all the functions that the executable imports and exports. We returned those values as a list and wrote it to the csv. When we load the data in, Pandas interprets the list as a string, so we need to convert it to a list.

# Strip wouldn't remove ' or whitespace for some reason. 
df['imports'] = df['imports'].apply(lambda i: i.strip('[]').replace("'", "").replace(" ", "").split(','))
df['exports'] = df['exports'].apply(lambda i: i.strip('[]').replace("'", "").replace(" ", "").split(','))

Looking at the imports, from a sample of ~93,000 files:

  • there were only ~34,000 unique entries.

  • The most common import list just being _CorDllMain.

  • The largest import list that we have is 24,000 imports long.

  • The smallest is 0.

count             92898
unique            33751
top       [_CorDllMain]
freq               9395
Name: imports, dtype: object
Maximum length of list:  24654
Minimum length of list: 0

The export list was completely empty. Because of this, we’re able to drop that column in the table.


Data Preparation

‘Put shit in, and you’ll get shit out!’

Data Preparation is potentially one of the most important parts of any ML model. It’s not enough to just feed the data to the model, you need to shape it and prepare it in a way that it will get the most out of it. This includes:

  • Turning Strings to Ints: ML requires all features to be integers!

  • Normalizing Values: The smaller the numbers and the less gap between values, the easier it is for the model to train. If the gaps are too big, it’ll focus too much on the gaps rather than training the weights.

  • Determining Categorical vs Numeric Data: With everything being an int, it’s important to state whether the distance between numbers is important or if each value has a distinct representation. (More details in my last blog).

Feature Preparation

During our visualisation phase, we’ve already noted some features that we can remove. There are quite also a few features in which, by themselves they’re not the most useful. But when linked with other features, they can provide a clear indication on the binaries intention.

An example of this, when you subtract the raw size of a section from the virtual size, if it’s a large value left over, that could indicate file compression. With this in mind, I created three new features:

  • Size Difference: Size of Image - File Size

  • Stack Difference: Size of Stack Reserve - Size of Stack Commit

  • Heap Difference: Size of Heap Reserve - Size of Heap Commit

I couldn’t find a way to keep file type and imports within the same model, every method I saw for handling strings required them to have their own model. I strip unwanted characters from file type and imports, and join them as a space separated string. The model that I use for them separates strings on spaces.

df = pd.read_csv('D:\\malware\\output.csv')

df['size_dif'] = df['size_of_image'] - df['file_size']
df['stack_dif'] = df['size_of_stack_reserve'] - df['size_of_stack_commit']
df['heap_dif'] = df['size_of_heap_reserve'] - df['size_of_heap_commit']

df['imports'] = df['imports'].apply(lambda i: ' '.join(i.strip('[]').replace("'", "").replace(" ", "").split(',')))

Sections was a particularly difficult dataset. While they were a list of lists, similar to file type and imports, they were not limited to strings. Each section contained two integer and three binary values. I cleaned up each list, removing unwanted characters, then split them into their own DataFrame.

I won’t post the full code here (see my GitHub), but essentially I:

  1. Interpreted the string list as a literal list

  2. Iterated through the lists, getting unique list names and counting their occurences

  3. Copied the top 40 section names into a new dataframe, filling empty values with 0.

This resulted in a DataFrame with 200 columns (top forty with a column for raw size, virtual size, code, executable and writable). That looked like this:

For the rest of this blog, I will refer to the three models as:

  1. Main Model

  2. Import Model

  3. File Type Model

  4. Section Model

Normalization

Main Model

Numeric Data is then normalized to reduce the burden on our model, scaling values to between 0 and 1.

# Help ensure our datasets are within the same range by rescaling them
# all between 0 and 1. 
def Normalize(df):
    return (df-df.min())/(df.max()-df.min())

normalize_candidates = [
    'file_size', 'size_of_image', 'size_of_code', 'size_of_stack_reserve',
    'size_of_stack_commit', 'size_of_heap_reserve', 'size_of_heap_commit',
    'size_dif', 'stack_dif', 'heap_dif', 'file_entropy', 'number_of_sections' 
]
for candidate in normalize_candidates:
    df[candidate] = Normalize(df[candidate])

After normalisation, the difference between our original data (fig1) can be seen against our new values (fig2):

Fig1: Actual values

Fig2: Normalised values

Note: The value count is low here, because I started processing the data before the feature collection script had finished.

Train, Validate, Test

I split of 20% for training and then a further 20% for validation for all four models.

# Split the data into training, validation and testing. 
train, test = train_test_split(df, test_size=0.2)
train, valid = train_test_split(train, test_size=0.2)

Feature Columns

Main Model

Surprisingly all of the remaining features are numeric values for the main model! All the complex data was taken out when we moved the Imports and Section data.

I just appended each column as a numeric column, resulting in the following code.

feature_columns = []

col = [
    'file_size','file_entropy','size_of_image','size_of_code','size_of_stack_reserve',
    'size_of_stack_commit','size_of_heap_reserve','size_of_heap_commit',
    'number_of_sections','size_dif','stack_dif','heap_dif'
]
for i in col:
    feature_columns.append(
        feature_column.numeric_column(i)
    )

File Types and Imports

I used the same feature layer preparation for file types and imports. I imported a model from the TensorFlow hub that as refined for space separated strings, such as movie reviews. This model embeds each layer, minimizing memory usage and helping the model refine large datasets more easily. Below is an example of the imports layers.

Sections

Sections ended being particularly difficult for feature encoding, and I still don’t entirely understand why.

For the raw and virtual size, we ran a Normalisation layer to minimize the values, similar (but not the same) as the normalisation that we talked about earlier.

For the binary data (code, executable and writable), I applied the category encoding.

I had to build this model with through the TensorFlow functional method. For some reason I just couldn’t get it working using the previous method, and I eventually found a guide for a similar data set.

Creating and Training the Models

I used Keras Sequential Models for all four models. This is the main and easiest model that I’ve worked with, and I don’t believe there’s anything too special about the data to require a different model.

Each model has five layers:

  • Feature Layer

  • Hidden 1: 64, relu, L2

  • Hidden 2: 32, relu, L2

  • Dropout: 10%

  • Output: 1, Sigmoid

For the sections model, I had to add an additional hidden layer, with the neurons now starting at 128. This extra layer increased the accuracy almost 20%.

The sections model was also built with the functional api, you can see my GitHub for more details.

This resulted in the following code:

def Build_Model(learning_rate, feature_layer, metrics):
    model = tf.keras.models.Sequential()
    model.add(feature_layer)
    model.add(layers.Dense(
        64,
        activation='relu',
        kernel_regularizer='l2',
        name='Hidden_1'
    ))
    model.add(layers.Dense(
        32,
        activation='relu',
        kernel_regularizer='l2',
        name='Hidden_2'
    ))
    model.add(layers.Dropout(0.1, name='Dropout'))
    model.add(layers.Dense(
        1,
        activation='sigmoid',
        name='Output'
    ))

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
        metrics=metrics
    )

    return model

To train the models, I just copied and pasted the code from my last blog. Code as follows:

def Train_Model(
    model, dataset, epochs,
    label_name, valid,
    batch_size=None, 
    shuffle=True
):
    '''Feed dataset and label, then train model. '''

    features = {name:np.array(value) for name, value in dataset.items()}
    label = np.array(features.pop(label_name))
    
    validation_set = {name:np.array(value) for name, value in valid.items()}
    validation_label = np.array(validation_set.pop(label_name))

    history = model.fit(
        x=features, 
        y=label, 
        batch_size=batch_size,
        epochs=epochs,
        shuffle=shuffle,
        validation_data=(validation_set, validation_label)
    )

    epochs = history.epoch

    hist = pd.DataFrame(history.history)

    return epochs, hist 

Training and Results!

Main Model

The main model took the most refining out of all of the models. Playing around with the sigmoid threshold, I would often go from 0% recall and 99% precision, to 99% recall and 0% precision. The best results I received were with a threshold of 0.55, which you can see below. These results are barely better than guessing, so there’s clearly a problem.

File Type Model

Running the File Type model, we result in ~87% accuracy, with precision at ~90% and recall at ~86%. This model could likely be refined to show higher results, but leaving it to a one week project, I’m pretty happy with these results.

Imports Model

File Imports showed very similar results to the file type model, but with higher success. This was expected due to the large amount of unique function imports within the malicious files.

Sections Model

This model performed worse than the Imports and File Type model, but not by much. There’s definitely some improvements that could be made between the features to increase the accuracy.


Combining the Results

So, comparing the main model to the other models, it did not go anywhere near as well. This isn’t overly unexpected based on the results of the data visualisation phase. There’s no real noticeable difference between malicious and benign binaries when it comes to size. Based on these results, I’m going to drop all the fields except file entropy.

We can also look at improving the structure of the current model. We’ve got four different models each outputting their own value on how suspicious they believe the file to be. One of the major flaws with this structure is that we can’t see the correlation between the different models. For example, the model might not think that the imports or sections are suspicious by themselves, but when put together they might trigger the alert.

To fix this solution, I take the float value that the Sigmoid activation returns and pass them into a final layer that combines all four layers.

We can convert the four fields using the concat command.

The initial results of this model have a lower accuracy then most of the models individually. I’ve been tweaking it’s model a lot, and the results are continually getting better, but this is the final results that I got:

The interesting part with these results is that the precision is quite high. So when we say a file is malicious, it often is. The problem is with the recall, we’re missing a lot of malicious files.

Second Thoughts

Thinking about the final model more, a more appropriate model could be to apply a count over the results of the four models. If the sum is higher than a threshold, then it may be worth investigating. It’s understandable why this final model was not as successful when you look at the data.

Some data would spike in one column, but drop off in the others. So for example, consider the following entry:

label	entropy	imports	type	sections
29638	1	0.790723	0.993574	0.984425	0.039087

In the above entry, imports and type are almost at 100% guaranteed malicious, while the sections is at 0.03% maliciousness. The normalised entropy of the file also would not have suggested compression of the data. This can make detection extremely difficult when half the indicators are saying malicious, and the other half are saying that it’s fine.

If we had a final count model, where if the score was over a 1.8, imports, type and sections would have scored around about a 2 and hence requiring further investigation. While the final neural network layer gave it a score of 37% for suspiciousness.


Final Thoughts

For a one week project, I’m pretty happy with the initial results. There’s clear lines of improvement for this model, and it could definitely be refined a lot more (by someone with a bit more experience in ML😅). Based on these findings, I have a few thoughts:

  • I think a lot of the models could be tweaked and improved by a large amount. A neural network probably isn’t the best solution for this, so it might be time to start looking at other classification models.

  • The input data could be improved. Strings and Certificate info are the main two, but it’d be worth spending some time to make the collection of the data quicker. Ideally, I’d like for less than a second per binary, but that would probably take a lot of refinement.

  • All the models were run on 5 epochs, which is probably nowhere near enough. That’s definitely one of the improvements I’d like to make.

  • You need more than just static data for this to be fully effective. There’s some useful things with static data, but dynamic or third party resources are going to provide you a much better dataset to work with.

Another massive thanks to Michael Lester for providing this dataset.

But, that concludes one week of looking at binaries. I’ll probably look at continuing this project in the future but with multiple big projects coming along this month, I think it’s time to move onto the next one. As always, thanks for reading. Leave a comment or hit me up on twitter if you have any questions or suggestions.

Previous
Previous

Extracting Cobalt Strike from Windows Error Reporting

Next
Next

Machine Learning and ETW