Analyzing our DNA can give us so many insights into not only how our body works for new drug discoveries but medical information for future care. Advancements in sequencing technology have allowed us to sequence genomes cheaply and efficiently. So, DNA sequence data is currently growing at an exponential rate. By analyzing these gene sequences, we will be able to collect medically useful information for future care. Even in asymptomatic patients, genomic sequencing can offer information on genetic variations that potentially cause disease or increase the risk of disease development. Gene sequences are the next “big data” and are the perfect opportunity to mine data and patterns using AI or ML algorithms.
Some History and Future of Gene Data
The first entire genome we sequenced was for the virus Phage-Phi X174 in 1977, which was done in the most daunting and painstaking way. Since then, thankfully new and better ways to sequence DNA were created. Researchers added on to previous discoveries much like the discovery of the Atomic Theory (for those who remember their Grade 9 science) which accelerated the speed of innovation in this field. The first stage of sequencing is seen to represent the chain termination method (or commonly known as the Sanger Sequencing method) invented by developed by Sanger and Coulson in 1975 and the chain degradation method (or the first-generation sequencing technology) developed by Allan Maxam and Walter Gilbert in 1977. However, it had very high costs and low throughput. However, it began further innovation to make sequencing less painstaking.
After a lot of improvement, we had machines that had higher throughput and reduced costs, meaning we could analyze billions of sequencing reactions at the same time, allowing direct sequencing of single DNA molecules. According to NHGRI Genome Sequencing Program, the cost of sequencing one million nucleic acids (pairs) came down from $5000 in 2001 to less than a cent in October 2015. So we’ve got an exponentially rising amount of DNA data we can analyze for more insights. Can you guess what this means? 😏 DNA is becoming the next Big Data, making the perfect opportunity for some Machine Learning.
How a machine will be able to read DNA
DNA is a massive molecule that has a double helix structure made up of 4 basic nucleotides: A which pairs up with T and C which pairs up with G. How nucleotides are put together decides everything from the color of your eyes to why your fingers have different lengths. But how in the world does a computer understand these letters to figure out patterns or associations?
This is how traditional data (which computers can understand easily) is different from DNA
- DNA data is made up of characters, not numeric values (A, T, C, G)
- The sizes of DNA sequences are almost random. Some sequences are very short, while others are extremely long
- DNA data has biological significance, therefore some ML techniques may not work out well.
For a computer to understand DNA sequence data, we will need to convert the non-numeric “letters” into a numeric form so it can be a matrix input for training a model. This encoding to numeric values or forms which a computer can understand can be done in three main ways:
- One-hot encoding:
Each categorical value is converted into a new categorical column and given a binary value of 1 or 0. This is well suited for CNNs and deep learning algorithms in general.
- Sequential encoding:
Each categorical value is converted into a specific numeric value. For instance A,T,G,C will be encoded into 0,2.5,5.0,7.5. However, this strategy creates issues when using gradient descent for training.
- K-mer encoding:
K-mer refers to all of a sequence’s subsequences of length. The longer sequence will be decomposed into k-length overlapping fragments.
With 1 or a combination of these methods, we can use ML techniques to mine data from DNA sequences.
Machine Learning Algorithms for Data Mining
Data Mining is another word for discovering hidden and potentially useful information. It is used to determine important patterns that were previously unknown. The four applications of ML for data mining DNA sequence data is sequence alignment, sequence classification, sequence clustering, and sequence pattern mining
Aligning the Sequences
DNA Sequence Alignment derives information by aligning the base nucleotides of DNA. Information is uncovered by finding similarities between a sequence with unknown functions (query sequence) and a sequence whose function we know. If the degree of similarity between the two gene sequences exceeds a certain threshold (usually 30%), the sequences are considered to have a common ancestor. With this strategy, the structure, function, and evolution of the DNA can be predicted. Aligned sequences of nucleotides are usually represented as rows within a matrix, as shown below.
The comparison of two sequences is known as pairwise sequence alignment or PSA and the comparison of more than two sequences is known as multiple sequence alignment or MSA. However, the more sequences you align the harder and longer it takes to mine the data, requiring more data storage and computation power for higher complexity. Today we have plantly of web-based and free tools to help us align sequences like CLUSTAL, COFFEE, and MUSCLE.
However, finding a sequence that exceeds some threshold is tough. This is a big problem in this field of study because only around 2% of all our DNA has a functional role. This means that aligning whole sequences or looking for global similarities doesn’t always prove to be effective. Researching for global similarities will only be suited for sequences with a high degree of similarity as a whole. Thankfully we have a solution If there may not be similarities at a global level, there can be a local level. Local similarity research compares parts of a sequence, which can prove to be much more effective.
The Needleman-Wunsch algorithm can be used for deriving global sequence similarities and the Smith-Waterman algorithm can be used for deriving local sequence similarities between sequences. However, the Smith-Waterman algorithms tend to use lots of computing time.
Super easy to understand terms for sequence alignment
Match: Does match
Mismatch: Doesn’t match
Gap: Oh no! There is a gap!
Sequence aligning for both Needleman-Wunch and Smith-Waterman algorithms comes in three steps: initialization, matrix filling, and traceback, and is dictated by how penalties are added in the process. If there is a match, a penalty of 1 will be added. If there is a mismatch, a penalty of -1 will be added. Finally, if there is a gap, a penalty of -2 will be added. How this goes about, will make more sense soon.
Let’s align these two sequences by doing what a computer will do using the Needleman-Wunch algorithms: ACGAT and AGCT. The first step, initialization, consists of creating the matrix and preparing it for matrix filling. In this step, we create an m (length of one sequence)+ 1 by n (length of the second sequence) + 1 matrix. Then, we fill the first column and first row, with progressing gap penalty. This is because, the more we shift one sequence, the greater the gap will be. Therefore, with a gap penalty of -2, the cell values in the first row will decrease by 2. For any vertical or horizontal movement, a gap penalty of -2 will be added.
Now we move on to the next step, which is matrix filling. To fill one cell of the matrix we consider three things:
- The value from its left with the gap penalty added
- The value from its top cell with gap penalty added
- The value from its top-left diagonal with a match or mismatch penalty added (if both letters are the same match penalty is added, otherwise mismatch penalty is added)
The highest number of these three will be the number in the cell. In this way, we can fill the entire matrix in a left to right up-down manner. I hope you can see other patterns with the filling process.
The final step is traceback. We begin the traceback step from the bottom right corner to the upper left corner, where we started the filling process. If the letters are matched we go towards the top left diagonal of the cell. If the letters are not matched we go to the adjacent cell with the highest value.
To derive the alignment we look at the traceback created or the direction of the arrow. To know if the sequence has a corresponding match or a mismatch, that arrow will point diagonally to the top left. If the arrow points to a smaller valued cell, there is a match. Otherwise, there is a mismatch. If the arrow points directly to the left, there is a gap.
Voila! 2 sequences have been aligned globally. Of course, this is a very small example.
The Smith-Waterman algorithm is similar to the Needleman-Wunch Algorithm. The only technical difference is that all negative values calculated are converted into zeroes right away. Also, the traceback method is far simpler. We begin at the highest value(s) of the matrix that is greater than one, then we move towards zero diagonally. The arrows are counted to get a local similarity. Below, I have aligned the sequences ACGTT and CACG using the Smith-Waterman algorithm:
After some sequence similarity analysis of the aligned sequences, we can predict the properties of the sequence(s) we want further details of based on the sequence whose properties we do know.
DNA Sequence Classification
Classification is used to predict the category of items with unknown labels based on data derived from a training set. They work well with discrete variables or categorical data, acting as the perfect solution for analyzing DNA data. Sequence classification can be used to analyze DNA sequences for sequence similarity (of structure or function) and predict a category or class (in terms of function and relationship) for other sequences. Classification can help assist in gene identification in DNA molecules. Also, Convolutional Neural Network classifiers are a great method we can use for DNA Sequence Classification. They are behind text data problems like Gmail spam detection and sentiment classification on Grammarly. They are also great at extracting features from a raw dataset.
However, a raw text cannot be given as an input into a CNN for feature extraction and class prediction. Inputs have to be converted into a numerical representation so that they can be inputted into a neural network. To do so, we can encode normal text (group of words) into numeric values by creating a dictionary with words as values and specific numbers as keys. Then we can one-hot encode text based on the dictionary of words created (better for CNNs). However, unlike normal text data, DNA sequences have no words, but instead, one very word of hundred of letters without spaces. To solve this issue, we can k-mer encode the sequence into words and then use one-hot encoding to convert it into a numeric value based on a k-mer dictionary (by using a 3-mer format, there is a dictionary with 64 different words and a one-hot vector size of 64). In this manner, any text classification algorithm can be used to classify DNA.
Once the DNA sequence is preprocessed into numeric values, it can be inserted into a CNN model for training. The CNN architecture has convolutional layers, which extract a specific feature from its input creating a feature map, and max-pooling layers which reduce the dimensions outputting only the strongest information. Hyperparameters of a CNN include the number of filters (convolutional layers) and their sizes. Here is a quick 5 min intro to the CNN architecture by What’s AI.
Layers in a typical CNN algorithm for text classification.
- Embedding: The first layer is the embedding layer. It converts text into a vector space model, which is an algebraic model for representing text as vectors. (when training the CNN, this layer uses weights to learn how to embed examples in a training dataset and generalize the knowledge to any new example).
- Convolutional and Max-pooling Layers: These layers extract features using convolutional and max-pooling layers. Convolutional layers with specific sizes and non-linear activation functions (examples: Relu, Leaky Relu, etc.) extract features creating a feature map. And a max-pooling layer reduces the dimensions of the feature map holding on to the important information. There can be multiple Convolutional and Max-pooling pairs in a CNN.
- Flattening Layer: This layer flattens feature maps from convolutional and max-pooling Layers into a vector.
- Dense layers for classification: Here, the vector is passed into dense layers of neurons for classification (algorithms like Naive Bayes and Decision Trees can be used). The output is fed into a softmax activation function which computes the probability of each class, which needs to be predefined (in our case class of DNA in terms of function, structure, and evolution).
DNA Sequence Clustering
Clustering is one of the most common tools for pattern recognition and exploratory data mining (like classification) but it doesn't need any labels to train on not any predefined classes to group into. It clusters or groups data based on the similarity of given information on the spot. DNA Sequence Clustering can be used to cluster DNA with similar characteristics using sequence similarity research. To do so, we can use the k-means algorithm for clustering.
Before jumping into how the algorithm will work, we need to perform an extraction process for the DNA sequence. We will determine the frequency of a specific substring of nucleotides. The level of frequency can be used as an identifier for a string group. Because, frequency values are going to vary a lot, the data is normalized using the min-max normalization technique. The dataset might look something like this.
The K-means Clustering
The clustering process starts by determining the number of clusters (k) (maybe why the algorithm is called k-means) to group the data set into. The algorithm starts by placing arbitrary centers of each cluster (k number or clusters) on a high dimensional vector plane. Then the algorithm repeats the following until convergence:
- For each data point, the distance between it and the center of all of the clusters is calculated. Then, the minimum distance is selected and the cluster is assigned to that specific data point. In each iteration, data points will be grouped in different ways based on the distance.
- The position of each cluster center is recomputed by averaging all the data points that fall into that cluster (which was calculated in the previous step). The recalculating of the positions of the data points and the centers of each cluster is continuous until none of the cluster assignments change.
The k-means clustering algorithm requires you to provide the number of clusters, while a classification algorithm requires you to provide labels for the algorithm to train on to predict future DNA sequences. After clustering the dataset, we are going to get different groups of DNA based on similarities.
Today, there are still a lot of challenges with DNA sequence data mining from efficiency to ML’s “black box” nature, which limits the potential of machine learning to a certain extent. However, with innovations, this field of study can solve many problems in the world. ML strategies like Classification, Clustering, and Association Rule mining can accelerate how we derive information from DNA. They are being used to boost the drug discovery process from a couple of decades to a couple of years (that’s exactly what the company Deep Genomics is up to) or maybe even tell you if you might have a specific disease in the future! Machine Learning with DNA is doing great things today and will do extraordinary things in the future. 😎
You now have a pretty good understanding of ML with DNA sequences. Hopefully, you have learned a lot, and are inspired to learn more! If this article provided you with some insight, clap it, share it, and connect with me on Linkedin! If you have cool stuff related to AI, post it in the comments! And finally, subscribe to my monthly newsletter to follow along my journey through emerging technology!