Searching for Exotic Particles in High Energy Physics

Posted by Christopher Mertin on June 09, 2017 in Project • 8 min read

Introduction

This data set comes from the UCI Machine Learning Repository and is composed of simulated particle interactions. The entire set consists of 5 million data points, each containing 18 features.

The first 8 features are “low-level features” which comes from the “experiment,” while there are 10 other “high-level” features which are derived from the first 8 principle features. All of the features in the set are described as

  1. lepton 1 pT
  2. lepton 1 eta
  3. lepton 1 phi
  4. lepton 2 pT
  5. lepton 2 eta
  6. lepton 2 phi
  7. missing energy magnitude
  8. missing energy phi
  9. MET_rel
  10. axial MET
  11. M_R
  12. M_TR_2
  13. R
  14. MT2
  15. S_R
  16. M_Delta_R
  17. dPhi_r_b
  18. cos(theta_r1)

I’m not going to delve into what each of these features are, but they are essentially features of each of the interactions. There is also a label for each which is [0,1], where 0 is for it being background noise and 1 being a signal. In performing basic exploratory analysis, the signal to noise ratio is roughly the same, so techniques such as up sampling are not required. The distribution of the values/features seems to be roughly similar as well.

Therefore, the point of this project was to try and devise a method that could identify particle interactions in large amounts of data by training a classifier. The classifiers chosen to be used are (in order): Random Forests, Naïve Bayes, Multi-Layer Perceptron, and a Deep Neural Network.

Random Forests

I first detailed the use of Random Forests in the post where I predicted Titanic Survival Rates. Random forests are really great as they’re essentially just an ensemble of decision trees.

One down side that they have is that they utilize a lot of memory. This is mostly true in the standard/typical decision tree, as random forests are an agglomeration of multiple decision tree subsets. This agglomeration helps prevent overfitting which is what decision trees are known for doing. I chose random forests as they are one of the most robust classifiers today.

The UCI site says to leave the last 500,000 items as the “test set,” so that’s what I did for each of these tests. After splitting the data, we can set up a random forest classifier with scikit-learn. First we can import the function as so

1
2
3
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.metrics import classification_report

The above code simply imports all of the necessary modules. The first is used for cross validation in choosing the right parameters, the second is the actual random forest model, while the third is going to help us visualize the results.

We can set up all of the options for cross validation as such

1
2
3
4
5
6
7
8
clf = RandomForest(n_jobs=-1, oob_score=True)
param_grid = {
  "n_estimators":[250,300],
  "criterion":["gini","entropy"],
  "max_features":[3,5]
  "max_depth":[10,20],
  "min_samples_split":[2,4]
}

The first line defines our classifier while the dictionary param_grid defines all of the values that we want to test. We can now get the model to get the best one, given as

1
2
3
grid_search = GridSearchCV(clf, param_grid, n_jobs=-1, cv=10)
grid_search.fit(X, y)
print(grid_search.best_params_)

Where X is the data we’re training on, and y is the labels. This will perform 10-fold cross-validation on the data to try to determine the best parameters in param_grid, thus keeping us from overfitting on the test set.

Finally, after running cross validation, we can take those parameters and plug them into a new random forest classifier model.

1
2
random_forest = RandomForest('''params go here''')
random_forest.fit(X, y)

We can then take our test set and predict its labels to get our accuracy.

1
2
predict_y = random_forest.predict(test_X)
print(classification_report(test_y, predict_y))
precision recall f1-score support
0 0.76 0.87 0.82 270571
1 0.82 0.68 0.74 229429
avg/total 0.79 0.79 0.78 500000

This above table shows the accuracy of the model as it tested against the test set. It achieved an accuracy of 79% on average between the two labels. This is a relatively good classifier as the world of high energy physics is noisy.

Gaussian Naïve Bayes

One benefit of Gaussian Naïve Bayes is that it is very fast in comparison to other algorithms. Something like this could be used for something such as medical surveys with patients after a surgery to predict if they will need readmission based on how they answer the survey.

How Naïve Bayes works is we start with Bayes’ rule which states

$$P(y|x_{1}, \ldots, x_{n}) = \frac{P(y)P(x_{1},\ldots,x_{n})}{P(x_{1},\ldots,x_{n})}$$

Where we take the naïve assumption of all of the variables being independent, meaning

$$P(x_{i}|y,x_{1},\ldots,x_{i-1},x_{i+1},\ldots,x_{n}) = P(x_{i}|y)$$

If we simplify for all \(i\) in the first equation and assume independence, we get

$$P(y|x_{1},\ldots,x_{n}) = \frac{P(y)\prod_{i=1}^{n}P(x_{i}|y)}{P(x_{1},\ldots,x_{n})}$$

Therefore, since \(P(x_{1},\ldots,x_{n})\) is constant in the given output, we can get the classification rule as being

$$P(y|x_{1},\ldots,x_{n}) \propto P(y)\prod_{i=1}^{n}P(x_{i}|y)$$
$$\hat{y} = \text{arg}\max_{y}P(y)\prod_{i=1}^{n}P(x_{i}|y)$$

In this data, the first 8 features are are derived from the interaction itself, while the last 10 are derived from the first 8. Therefore, to make this classifier more accurate, we should remove the last 10 features from the data.

We can define our model by

1
2
3
4
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X, y)

We can then test it against our test set, giving

1
2
predict_y = gnb.predict(test_X)
print(classification_report(test_y, predict_y))
precision recall f1-score support
0 0.70 0.92 0.80 270571
1 0.85 0.53 0.65 229429
avg/total 0.77 0.74 0.73 500000

Just as a sanity check, I also ran this without truncating the last 10 features, and got an average precision of 0.75. This shows that it was a good idea to actually remove those features to increase independence, as removing them brought it up 2%.

A 77% accuracy is surprisingly close to the random forest classifier, although it finished much faster.

Multi-Layer Perceptron

A Multi-Layer Perceptron (MLP) algorithm is a very basic Deep Neural Network. I had yet to use one as I jumped straight into deep learning, so this gave me a good excuse to compare the performance of these two on this data set.

Scikit-learn actually has a MLP implemented in their library, so I got to play with it. In order to keep it simple, I left it at a single hidden layer of size 300.

1
2
3
4
5
6
7
8
from sklearn.neural_network import MLPClassifier

hidden_layer_size = (300)
mlp = MLPClassifier(hidden_layer_size=hidden_layer_size, activation="relu", solver="adam")
mlp.fit(X, y)

predict_y = mlp.predict(test_X)
print(classification_report(test_y, predict_y))
precision recall f1-score support
0 0.79 0.87 0.83 270571
1 0.83 0.72 0.77 229429
avg/total 0.81 0.80 0.80 500000

As this shows, the MLP performed better than both the random forest and the naïve bayes classifier, scoring in at 81% accuracy.

I chose relu as the activation function and adam as the solver as they are the best right now in the world of deep learning, and have produced the best results. Both of them can be found in more detail on my post about my Masters project.

Deep Neural Network

Finally, we can delve into the world of using a Deep Neural Network (DNN) as a classifier on this data. In order to implement a DNN, Keras was used as it is a generalized module ontop of both Theano and TensorFlow — the two leading DNN libraries.

In order to build a DNN, we need to import quite a few modules. Those are

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

We can then start building our model. I opted to use a network with 5 hidden layers, consisting of [100,256,512,1024,2048] hidden features in the layers. I also added Dropout such that it would randomly remove \(1/5\) of the elements from the network at each training iteration, in order to prevent overfitting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
model = Sequential([
    Dense(100, input_shape=(18,)),
    Activation("relu"),
    Dropout(.2),
    Dense(256),
    Activation("relu"),
    Dropout(.2),
    Dense(512),
    Activation("relu"),
    Dropout(.2),
    Dense(1024),
    Activation("relu"),
    Dense(2048),
    Activation("relu"),
    Dense(1),
    Activation("sigmoid")
])

Finally, we have to compile the model

1
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

These loss functions, metrics, and activation functions are optimal choices for binary classification with Keras when using large amounts of data. I would have liked to have performed cross validation on the data to determine the optimal optimizer settings (for example, the learning rate, \(\beta\), \(\epsilon\)), but my computer is not computationally efficient enough to run this, and my GPU is not supported by CuDNN. However, after seeing the results I do not think it matters too much.

We can then fit our model against the training set with 10 epochs by running

1
model.fit(X, y, batch_size=256, epochs=10)

I opted to only post the results from the 10th epoch, which stated as

Epoch 10/10
4000000/4000000 [==============================] - 3575s - loss: 0.4313 - acc: 0.8010

Meaning it got an 80.1% accuracy on the training set after the 10 epochs. We can evaluate the model on the test set by running

1
model.evaluate(test_X, test_y)

This resturned a loss of 0.4266 and an accuracy of 0.8061. Finally, we can look at the ROC and AUC to get an idea on how it performed. The higher above 0.5, the better our model is at classifying. We can use this with a combination of scikit-learn to get it.

1
2
3
4
5
from sklearn.metrics import roc_curve, auc

y_pred = model.predict_proba(test_X)
fpr, tpr, _ = roc_curve(test_y, y_pred)
roc_auc = auc(fpr, tpr)

We can then plot it, which is shown as

roc_curve

As this shows, the Area Under the Curve (AUC) is 0.88, which is much better than “plain guessing” of 0.50. This shows that the model is a good performer in classifying the data, along with the 81% accuracy it achieved on the test set. A perfect classifier would have an area of 1.0.

Conclusion

In conclusion, the best models were the Multi-Layer Perceptron and the Deep Neural Network, but they also got the same accuracy in both instances. This was surprising, but DNNs aren’t the best for all types of data. Both of these models scored 81% accuracy on the test set.

Following this, the random forest obtained 79% accuracy, followed by Naïve Bayes with 77% accuracy.

Overall, the best method/model in time to accuracy ratio was the MLP as it did not take too long to run. The DNN overcomplicated the situation and took much longer to run (hours instead of minutes) on my CPU, and brought about the same accuracy as the MLP. With more time and processing power, I would have liked to attempted cross validation on the DNN to see if I could increase the accuracy — This goes for the random forest as well.