\ifdefined\isstandalone
\title{6.047/6.878 Lecture 14: \\ Gene Regulation 2: Classification {\tmabbr{}}}
\author{Fulton Wang (fultonw@mit.edu) \\
Salil Desai \\
David Charlton \\
Kevin Modzelewski \\
Robert Toscano
}
\maketitle
\chead{6.047/6.878 Lecture 14: \\ Gene Regulation 2: Classification }
\lhead{}
\rhead{}
{\pagebreak}
{\tableofcontents}
{\pagebreak}
{\listoffigures}
\else
% \cleardoublepage
\chapter{Gene Regulation 2 --Classification}
\author{Arvind Thiagarajan (athiagar@mit.edu \\
Fulton Wang (fultonw@mit.edu) \\
Salil Desai \\
David Charlton \\
Kevin Modzelewski \\
Robert Toscano
}
\chead{6.047/6.878 Lecture 14: \\ Gene Regulation 2: Classification }
\lhead{}
\rhead{}
\minilof
\fi
\pagestyle{fancy}
\ifdefined\isstandalone
\def\@mydir{images}
\fi
\ifdefined\ischapter
\def\@mydir{../Lecture14_GeneExpressionClassification/images}
\fi
\ifdefined\ismaster
\def\@mydir{Lecture14_GeneExpressionClassification/images}
\fi
% #########################################################################
\section{Introduction}
In the previous chapter we looked at \textit{clustering}, which provides a tool for analyzing data without any prior knowledge of the underlying structure. As we mentioned before, this is an example of
``unsupervised'' learning. This chapter deals with supervised learning, in which we are able to use pre-classified data to construct a model by which to classify more datapoints. In this way, we will
use existing, known structure to develop rules for identifying and grouping further information.
There are two ways to do classification. The two ways are analogous
to the two ways in which we perform motif discovery: HMM, which is a generative
model that allows us to actually describe the probability of a particular designation being valid, and
CRF, which is a discriminative method that allows us to distinguish between objects in a specific context. There is a dichotomy between generative and
discriminative approaches. We will use a Bayesian approach to
classify mitochondrial proteins, and SVM to classify tumor samples.
In this lecture we will look at two new algorithms: a generative
classifier, Naïve Bayes, and a discriminative classifier, Support
Vector Machines (SVMs). We will discuss biological applications of
each of these models, specifically in the use of Naïve Baye’s
classifiers to predict mitochondrial proteins across the genome and
the use of SVMs for the classification of cancer based on gene
expression monitoring by DNA microarrays. The salient features of both
techniques and caveats of using each technique will also be discussed.
Like with clustering, classification (and more generally supervised learning) arose from efforts in Artificial Intelligence and Machine Learning. Furthermore, much of the motivating infrastructure for classification had already been developed by probability theorists prior to the advent of either AI or ML.
\section{Classification - Bayesian Techniques}
Consider the problem of identifying mitochondrial proteins. If we look
at the human genome, how do we determine which proteins are involved
in mitochondrial processes, or more generally which proteins are
targeted to the mitochondria1?\footnote{Mitochondria is the energy
producing machinery of cell. Very early in life, the mitochondria
was engulfed by the predecessor to modern day eukaryotes, and now,
we have different compartments in our cells. So the mitochonria has
its own genome, but it is very depleted from its own ancestral
genome - only about 11 genes remain. But there are hundreds are
genes that make the mitochondria work, and these proteins are
encoded by genes transcribed in the nucleus, and then transported to
the mitochondria. So the goal is to figure out which proteins
encoded in the genome are targeted to the mitochondria. This is
important because there are many diseases associated with the
mitochonria, such as aging.} This is particularly useful because if
we know the mitochondrial proteins, we can study how these proteins
mediate disease processes and metabolic functions. The classification
method we will look considers 7 features for all human proteins:
\begin{enumerate}
\item targeting signal
\item protein domains
\item co-expression
\item mass spectrometry
\item sequence homology
\item induction
\item motifs
\end{enumerate}
Our overall approach will be to determine how these features are
distributed for both mitochondrial and non-mitochondrial
proteins. Then, given a new protein, we can apply probabilistic
analysis to these seven features to decide which class it most likely
falls into.
\subsection{Single Features and Bayes’ Rule}
Let's just focus on one feature at first. We must first assume that
there is a class dependent distribution for the features. We must
first derive this distribution from real data. The second thing we
need is the a priori chance of drawing a sample of particular class
before looking at the data. The chance of getting a particular class
is simply the relative size of the class. Once we have these
probabilities, we can use Bayes’ rule to get the probability a sample
is in a particular class given the data(this is called the posterior).
We have forward generative probabilities, and use Bayes’ rules to
perform the backwards inference. Note that it is not enough to just
consider the probability the feature was drawn from each class
dependent distribution, because if we knew a priori that one class(say
class A) is much more common than the other, then it should take
overwhelming evidence that the feature was drawn from class B's
distribution for us to believe the feature was indeed from class B.
The correct way to find what we need based on both evidence and prior
knowledge is to use Bayes’ Rule:
$\displaystyle P(\text{Class}|\text{feature}) = \left(\frac{P(\text{feature}|\text{Class})P(\text{Class})}{P(\text{feature})} \right)$
\begin{itemize}
\item \textbf{Posterior} : $P(\text{Class}|\text{feature})$
\item \textbf{Prior} : $P(\text{Class})$
\item \textbf{Likelihood} : $P(\text{feature}| \text{Class})$
\end{itemize}
This formula gives us exactly the connection we need to flip known
feature probabilities into class probabilities for our classifying
algorithm. It lets us integrate both the likelihood we derive from our
observations and our prior knowledge about how common something is. In
the case of mtDNA, for example, we can estimate that mitochondrial DNA
makes up something less than 10\% of the human genome. Therefore,
applying Bayes’ rule, our classifier should only classify a gene as
mitochondrial if there is a very strong likelihood based on the
observed features, since the prior probability that any gene is
mitochondrial is so low.
With this rule, we can now form a maximum likelihood rule for
predicting an object’s class based on an observed feature. We want to
choose the class that has the highest probability given the observed
feature, so we will choose Class1 instead of Class2 if:
$\left(\frac{P(\text{feature}|\text{Class1})P(\text{Class1})}{P(\text{feature})} \right) >
\left(\frac{P(\text{feature}|\text{Class2})P(\text{Class2})}{P(\text{feature})} \right) $
Notice that $P(\text{feature})$ appears on both sides, so we can cancel that
out entirely, and simply choose the class with the highest value of
$P(\text{feature} | \text{Class})P(\text{Class})$.
Another way of looking at this is as a discriminant function: By
rearranging the formulas above and taking the logarithm, we should
select Class1 instead of Class2 precisely when
$ \log{ \left(\frac{P(\text{X}|\text{Class1})P(\text{Class1})}{P(\text{X}|\text{Class2})P(\text{Class2})}
\right)} > 0 $
In this case the use of logarithms provide distinct advantages:
\begin{enumerate}
\item Numerical stability
\item Easier math (it’s easier to add the expanded terms than multiply
them)
\item Monotonically increasing discriminators.
\end{enumerate}
This discriminant function does not capture the penalties associated
with misclassification (in other words, is one classification more
detrimental than other). In this case, we are essentially minimizing
the number of misclassifications we make overall, but not assigning
penalties to individual misclassifications. From examples discussed in
class and in the problem set - if we are trying to classify a patient
as having cancer or not, it could be argued that it is far more
harmful to misclassify a patient as being healthy if they have cancer
than to misclassify a patient as having cancer if they are healthy. In
the first case, the patient will not be treated and would be more
likely to die, whereas the second mistake involves emotional grief but
no greater chance of loss of life. To formalize the penalty of
misclassification we define something called a loss function,$L_{kf}$
, which assigns a loss to the misclassification of an object as class
j when the true class is class k (a specific example of a loss
function was seen in Problem Set 2).
\subsection{Collecting Data}
The preceding tells us how to handle predictions if we already know
the exact probabilities corresponding to each class. If we want to
classify mitochondrial proteins based on feature $X$, we still need
ways of determining the probabilities $P(\text{mito})$, $P(~\text{mito})$, $P(X |
\text{mito})$ and $P(X | ~\text{mito})$. To do this, we need a training set: a set
of data that is already classified that our algorithm can use to
“learn” the distributions corresponding to each class. A high-quality
training set (one that is both large and unbiased) is the most
important part of any classifier. An important question at this point
is, how much data do we need about known genes in order to build a
good classifier for unknown genes? This is a hard question whose
answer is not fully known. However, there are some simple methods that
can give us a good estimate: when we have a fixed set of training
data, we can keep a holdout set that we don’t use for our algorithm,
and instead use those (known) data points to test the accuracy of our
algorithm when we try to classify them. By trying different sizes of
training versus \textit{holdout} set, we can check the accuracy curve
of our algorithm. Generally speaking, we have “enough” training data
when we see the accuracy curve flatten out as we increase the amount
of training data (this indicates that additional data is likely to
give only a slight marginal improvement). The holdout set is also
called the test set, because it allows us to test the generalization
power of our classifier.
Supposing we have already collected our training data, however, how
should we model $P(X | Class)$? There are many possibilities. One is
to use the same approach we did with clustering in the last lecture
and model the feature as a Gaussian – then we can follow the maximum
likelihood principle to find the best center and variance. The one
used in the mitochondrial study is a simple density estimate: for each
feature, divide the range of possibilities into a set of bins (say,
five bins per feature). Then we use the given data to estimate the
probability of a feature falling into each bin for a given class. The
principle behind this is again maximum likelihood, but for a
multinomial distribution rather than a Gaussian. We may choose to
discretize a otherwise continuous distribution because estimating a
continuous distribution can be complex.
There is one issue with this strategy: what if one of the bins has
zero samples in it? A probability of zero will override everything
else in our formulas, so that instead of thinking this bin is merely
unlikely, our classifier will believe it is \textit{impossible}. There
are many possible solutions, but the one taken here is to apply the
\textit{Laplace Correction}: add some small amount (say, one element)
into each bin, to draw probability estimates slightly towards uniform
and account for the fact that (in most cases) none of the bins are
truly impossible. Another way to avoid having to apply the correction
is to choose bins that are not too small so that bins will not have
zero samples in them in practice. If you have many many points, you
can have more bins, but run the risk of overfitting your training
data.
\subsection{Estimating Priors}
We now have a method for approximating the feature distribution for a
given class, but we still need to know the relative probability of the
classes themselves. There are three general approaches:
\begin{enumerate}
\item Estimate the priors by counting the relative frequency of each
class in the training data. This is prone to bias, however, since
data available is often skewed disproportionately towards less
common classes (since those are often targeted for special
study). If we have a high-quality (representative) sample for our
training data, however, this works very well.
\item Estimate from expert knowledge---there may be previous estimates
obtained by other methods independent of our training data, which we
can then use as a first approximation in our own predictions. In
other words, you might ask experts what the percentage of
mitochondrial proteins are.
\item Assume all classes are equally likely – we would typically do
this if we have no information at all about the true
frequencies. This is effectively what we do when we use the maximum
likelihood principle: our clustering algorithm was essentially using
Bayesian analysis under the assumption that all priors are equal.
This is actually a strong assumption, but when you have no other
data, this is the best you can do.
\end{enumerate}
For classifying mitochondrial DNA, we use method (2), since some
estimates on the proportions of mtDNA were already known. But there
is an complication – there are more than 1 features.
\subsection{Multiple features and Naïve Bayes}
In classifying mitochondrial DNA, we were looking at 7 features and
not just one. In order to use the preceding methods with multiple
features, we would need not just one bin for each individual feature
range, but one for each combination of features – if we look at two
features with five ranges each, that’s already 25 bins. All seven
features gives us almost 80,000 bins – and we can expect that most of
those bins will be empty simply because we don’t have enough training
data to fill them all. This would cause problems because zeroes cause
infinite changes in the probabilities of being in one class. Clearly
this approach won’t scale well as we add more features, so we need to
estimate combined probabilities in a better way.
The solution we will use is to assume the features are independent,
that is, that once we know the class, the probability distribution of
any feature is unaffected by the values of the other features. This is
the Naïve Bayes Assumption, and it is almost always false, but it is
often used anyway for the combined reasons that it is very easy to
manipulate mathematically and it is often close enough to the truth
that it gives a reasonable approximation. (Note that this assumption
does not say that all features are independent: if we look at the
overall model, there can be strong connections between different
features, but the assumption says that those connections are divided
by the different classes, and that within each individual class there
are no further dependencies.) Also, if you know that some features
are coupled, you could learn the joint distribution in only some pairs
of the features.
Once we assume independence, the probability of combined features is
simply the product of the individual probabilities associated with
each feature. So we now have:
$P(f_1,f_2,K,f_N|\text{Class})=P(f_1|\text{Class})P(f_2|\text{Class})KP(f_N|\text{Class})$
Where $f_1$ represents feature 1. Similarly, the discriminant function
can be changed to the multiplication of the prior probabilities:
$G(f_1,f_2,K,f_N)= \log{\left(\frac{\Pi P(f_1|\text{Class1})P(\text{Class1})}{\Pi P(f_1|\text{Class2})P(\text{Class2})} \right)}$
\subsection{Testing a classifier}
A classifier should always be tested on data not contained in its
training set. We can imagine in the worst case an algorithm that just
memorized its training data and behaved randomly on anything else – a
classifier that did this would perform perfectly on its training data,
but that indicates nothing about its real performance on new
inputs. This is why it’s important to use a test, or holdout, set as
mentioned earlier. However, a simple error rate doesn’t encapsulate
all of the possible consequences of an error. For a simple binary
classifier (an object is either in or not in a single target class),
there are the following for types of errors:
\begin{enumerate}
\item True positive (TP)
\item True negative (TN)
\item False positive (FP)
\item False negative (FN)
\end{enumerate}
The frequency of these errors can be encapsulated in performance
metrics of a classifier which are defined as,
\begin{enumerate}
\item \textit{Sensitivity} – what fraction of objects that are in a
class are correctly labeled as that class? That is, what fraction
have true positive results? High sensitivity means that elements of
a class are very likely to be labeled as that class. Low sensitivity
means there are too many false negatives.
\item \textit{Specificity} – what fraction of objects not in a class
are correctly labeled as not being in that class? That is, what
fraction have true negative results? High specificity means that
elements labeled as belonging to a class are very likely to actually
belong to it. Low specificity means there are too many false
positives.
\end{enumerate}
In most algorithms there is a tradeoff between sensitivity and
specificity. For example, we can reach a sensitivity of 100\% by
labeling everything as belonging to the target class, but we will have
a specificity of 0\%, so this is not useful. Generally, most
algorithms have some probability cutoff they use to decide whether to
label an object as belonging to a class (for example, our discriminant
function above). Raising that threshold increases the specificity but
decreases the sensitivity, and decreasing the threshold does the
reverse. The MAESTRO algorithm for classifying mitochondrial proteins
(described in this lecture) achieves 99\% specificity and 71\%
sensitivity.
\subsection{MAESTRO – Mitochondrial Protein Classification}
They find a class dependent distribution for each features by creating
several bins and evaluating the proportion of mitochondrial and non
mitochondrial proteins in each bin. This lets you evaluate the
usefulness of each feature in classification. You end up with a bunch
of medium strength classifiers, but when you combine them together,
you hopefully end up with a stronger classifier. Calvo et
al. \cite{Calvo} sought to construct
high-quality predictions of human proteins localized to the
mitochondrion by generating and integrating data sets that provide
complementary clues about mitochondrial localization. Specifically,
for each human gene product $p$, they assign a score $s_i(p)$, using
each of the following seven genome-scale data sets – targeting signal
score, protein domain score, cis-motif score, yeast homology score,
ancestry score, coexpression score, and induction score (details of
each of the meaning and content of each of these data sets can be
found in the manuscript). Each of these scores $s_1-S_7$ can be used
individually as a weak genome-wide predictor of mitochondrial
localization. Each method’s performance was assessed using large ‘gold
standard’ curated training sets - 654 mitochondrial proteins
$T_{\text{mito}}$ maintained by the MitoP2 database1 and 2,847
nonmitochondrial proteins $T_{~\text{mito}}$ annotated to localize to other
cellular compartments. To improve prediction accuracy, the authors
integrated these eight approaches using a naïve Bayes classifier that
was implemented as a program called MAESTRO. So we can take several
weak classifiers, and combine them to get a stronger classifier.
When MAESTRO was applied across the human proteome, 1451 proteins were
predicted as mitochondrial proteins and 450 novel proteins predictions
were made. As mentioned in the previous section The MAESTRO algorithm
achieves a 99\% specificity and a 71\% sensitivity for the
classification of mitochondrial proteins, suggesting that even with
the assumption of feature independence, Naïve Bayes classification
techniques can prove extremely powerful for large-scale
(i.e. genome-wide) scale classification.
\section{Classification – Support Vector Machines}
The previous section looked at using probabilistic (or generative)
models for classification, this section looks at using discriminative
techniques – in essence, can we run our data through a function to
determine its structure? Such discriminative techniques avoid the
inherent cost involved in generative models which might require more
information than is actually necessary.
Support vector machine techniques essentially involve drawing a vector
that’s perpendicular to the line(hyperplane) separating the training
data. The approach is that we look at the training data to obtain a
separating hyperplane so that two classes of data lie on different
sides of the hyperplane. There are, in general, many hyperplanes that
can separate the data, so we want to draw the hyperplane that
separates the data the most - we wish to choose the line that
maximizes the distance from the hyperplane to any data point. In
other words, the SVM is a maximum margin classifier. You can think of
the hyperplane being surrounded with margins of equal size on each
side of the line, with no data points inside the margin on either
side. We want to draw the line that allows us to draw the largest
margin. Note that once the separating line and margin are determined,
some data points will be right on the boundary of the margin. These
are the data points that keep us from expanding the margin any
further, and thus determine the line/margin. Such points are called
the support vectors. If we add new data points outside the margin or
remove points that are not support vectors, we will not change the
maximum margin we can achieve with any hyperplane.
Suppose that the vector perpendicular to the hyperplane is w, and that
the hyperplane passes through the point $\left(\frac{b}{|w|} \right)$.
Then a point $x$ is classified as being in the positive class if $w*x$
is greater than $b$, and negative otherwise. It can be shown that the
optimal w, that is, the hyperplane that achieves the maximum margin,
can actually be written as a linear combination of the data vectors
$\Sigma a_i*x_i$. Then, to classify a new data point x, we need to
take the dot product of w with x to arrive at a scalar. Notice that
this scalar, $\Sigma a_i*(x_i*x)$ only depends on the dot product
between x and the training vectors $x_i$’s. Furthermore, it can be
shown that finding the maximum margin hyperplane for a set of
(training) points amounts to maximizing a linear program where the
objective function only depends on the dot product of the training
points with each other. This is good because it tells us that the
complexity of solving that linear program is independent of the of
dimension of the data points. If we precompute the pairwise dot
products of the training vectors, then it makes no difference what the
dimensionality of the data is in regards to the running time of
solving the linear program.
\subsection{Kernels}
We see that SVMs are dependent only on the dot product of the
vectors. So, if we call our transformation $\phi (v)$, for two vectors
we only care about the value of $\phi(v_1)\cdot \phi(v_2)$ The trick
to using kernels is to realize that for certain transformations
$\phi$, there exists a function $K(v_1,v_2)$, such that:
$K(v_1,v_2) = \phi(v_1) \cdot \phi(v_2)$
In the above relation, the right-hand side is the dot product of
vectors with very high dimension, but the left-hand side is the
function of two vectors with lower dimension. In our previous example
of mapping $x \rightarrow (x,y= x^2)$, we get
$K(x_1,x_2) = (x_1x_1^2) \cdot (x_2,x_2^2)= x_1x_2 + (x_1x_2)^2$
Now we did not actually apply the transformation $\phi$, we can do all
our calculations in the lower dimensional space, but get all the power
of using a higher dimension.
Example kernels are the following:
\begin{enumerate}
\item Linear kernel: $K(v_1,v_2) = v_1 \cdot v_2$ which represents the
trivial mapping of $\phi(x) = x$
\item Polynomial kernel: $K(v_1,v_2)=(1+v_1 \cdot v_2)^n$ which was
used in the previous example with $n=2$.
\item Radial basis kernel: $K(v_1,v_2) = \exp(-\beta|v_1-v_2|^2)$ This
transformation is actually from a point $v_1$ to a function (which
can be thought of as being a point in Hilbert space) in an
infinite-dimensional space. So what we’re actually doing is
transforming our training set into functions, and combining the to
get a decision boundary. The functions are Gaussians centered at the
input points.
\item Sigmoid kernel: $K(v_1,v_2) = \tanh[\beta (v_1^Tv_2+r)]$ Sigmoid
kernels have been popular for use in SVMs due to their origin in
neural networks (e.g. sigmoid kernel functions are equivalent to
two-level, perceptron neural networks). It has been pointed out in
previous work (Vapnik 1995) that the kernel matrix may not be
positive semi-definite for certain values of the parameters $\mu$
and $r$. The sigmoid kernel has nevertheless been used in practical
applications \cite{Scholkopf}.
\end{enumerate}
Here is a specific example of a kernel function. Consider the two
classes of one-dimensional data:
$ \{ -5, -4, -3, 3, 4, 5 \} and \{ -2, -1, 0, 1, 2 \} $
This data is clearly not linearly separable, and the best separation
boundary we can find might be $x > -2.5$. Now consider applying the
transformation . The data can now be written as new pairs,
$ \{ -5,-4,-3,3,4,5 \} \rightarrow \{ (-5,25),(-4,16),(-3,9),(3,9),(4,16),(5,25) \} $
and
$\{-2,-1,0,1,2\} \rightarrow \{(-2,-4),(-1,1),(0,0),(1,1),(2,4)\}$
This data is separable by the rule $y > 6.5$, and in general the more
dimensions we transform data to the more separable it becomes.
An alternate way of thinking of this problem is to transform the
classifier back in to the original low-dimensional space. In this
particular example, we would get the rule $x^2 < 6.5$ , which would
bisect the number line at two points. In general, the higher
dimensionality of the space that we transform to, the more complicated
a classifier we get when we transform back to the original space.
One of the caveats of transforming the input data using a kernel is
the risk of overfitting (or over-classifying) the data. More
generally, the SVM may generate so many feature vector dimensions that
it does not generalize well to other data. To avoid overfitting,
cross-validation is typically used to evaluate the fitting provided by
each parameter set tried during the grid or pattern search process. In
the radial-basis kernel, you can essentially increase the value of
$\beta$ until each point is within its own classification region
(thereby defeating the classification process altogether). SVMs
generally avoid this problem of over-fitting due to the fact that they
maximize margins between data points.
When using difficult-to-separate training sets, SVMs can incorporate a
cost parameter $C$, to allow some flexibility in separating the
categories. This parameter controls the trade-off between allowing
training errors and forcing rigid margins. It can thereby create a
\textit{soft} margin that permits some misclassifications. Increasing
the value of $C$ increases the cost of misclassifying points and
forces the creation of a more accurate model that may not generalize
well.
Can we use just any function as our kernel? The answer to this is
provided by Mercer’s Condition which provides us an analytical
criterion for choosing an acceptable kernel. Mercer’s Condition states
that a kernel $K(x,y)$ is a valid kernel if and only if the following
holds – For any $ g(x)$
such that $\int{g(x)^2dx}$ is finite, we have:
$\int{\int{K(x,y)g(x)g(y)dxdy \ge 0}}$\cite{Burgess}
In all, we have defined SVM discriminators and shown how to perform
classification with appropriate kernel mapping functions that allow
performing computations on lower dimension while being to capture all
the information available at higher dimensions. The next section
describes the application of SVMs to the classification of tumors for
cancer diagnostics.
\section{ Tumor Classification with SVMs}
A generic approach for classifying two types of acute leukemias –
acute myeloid leukemia (AML) and acute lymphoid leukemia (ALL) was
presented by Golub et al. \cite{Golub}. This
approach centered on effectively addressing three main issues:
\begin{enumerate}
\item Whether there were genes whose expression pattern to be
predicted was strongly correlated with the class distinction
(i.e. can ALL and AML be distinguished)
\item How to use a collection of known samples to create a ``class
predictor'' capable of assigning a new sample to one of two classes
\item How to test the validity of their class predictors
\end{enumerate}
They addressed (1) by using a ``neighbourhood analysis'' technique to
establish whether the observed correlations were stronger than would
be expected by chance. This analysis showed that roughly 1100 genes
were more highly correlated with the AML-ALL class distinction than
would be expected by chance. To address (2) they developed a procedure
that uses a fixed subset of ``informative genes'' (chosen based on their
correlation with the class distinction of AML and ALL) and makes a
prediction based on the expression level of these genes in a new
sample. Each informative gene casts a ``weighted vote'' for one of the
classes, with the weight of each vote dependent on the expression
level in the new sample and the degree of that gene’s correlation with
the class distinction. The votes are summed to determine the winning
class. To address (3) and effectively test their predictor by first
testing by cross-validation on the initial data set and then assessing
its accuracy on an independent set of samples. Based on their tests,
they were able to identify 36 of the 38 samples (which were part of
their training set!) and all 36 predictions were clinically
correct. On the independent test set 29 of 34 samples were strongly
predicted with 100\% accuracy and 5 were not predicted.
A SVM approach to this same classification problem was implemented by
Mukherjee et al.\cite{Mukherjee}. The output of
classical SVM is a binary class designation. In this particular
application it is particularly important to be able to reject points
for which the classifier is not confident enough. Therefore, the
authors introduced a confidence interval on the output of the SVM that
allows for rejection of points with low confidence values. As in the
case of Golub et al.\cite{Golub} it was important for
the authors to infer which genes are important for the
classification. The SVM was trained on the 38 samples in the training
set and tested on the 34 samples in the independent test set (exactly
in the case of Golub et al.). The authors’ results are summarized in
the following table (where $|d|$ corresponds to the cutoff for
rejection).
\begin{table} [here]
\begin{tabular}{|p{3 cm}|p{3 cm}|p{3 cm}|p{4 cm}|p{2 cm}|}
\hline
\textbf{Genes}&\textbf{Rejects}&\textbf{Errors}&\textbf{Confidence level}&\textbf{$|d|$}\\
\hline
7129&3&0& $~$93\%&0.1\\
\hline
40&0&0& $~$93\%&0.1\\
\hline
5&3&0&$~$92\%&0.1\\
\hline
\end{tabular}
\label{table1}
\end{table}
These results a significant improvement over previously reported
techniques, suggesting that SVMs play an important role in
classification of large data sets (as those generated by DNA
microarray experiments).
\section{Semi-Supervised Learning}
In some scenarios we have a data set with only a few labeled data
points, a large number of unlabeled data points and inherent structure
in the data. This type of scenario both clustering and classification
do not perform well and a hybrid approach is required. This
semi-supervised approach could involve the clustering of data first
followed by the classification of the generated clusters.
\subsection{Open Problems}
\section{Current Research Directions}
\section{Further Reading}
\begin{itemize}
\item Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern classification (2nd edition), Wiley, New York
\item See previous chapter for more books and articles.
\end{itemize}
\section{Resources}
\begin{itemize}
\item Statistical Pattern Recognition Toolbox for Matlab.
\item See previous chapter for more tools
\end{itemize}
\nocite{*}
\bibliographystyle{plain}
\bibliography{Lecture14_GeneExpressionClassification}