Predicting Palmitoylation Sites in Protein Sequences

Aditya M
9 min readJul 14, 2024

--

Hello everyone! In this article I will give you a quick run down of my machine learning model to predict S-palmitoylation sites in protein sequences.
Here is a link to the project: https://spalmitoylationsitepredictor.streamlit.app/ (You can either enter a protein accession code or its raw sequence to get a list of possible palmitoylation sites in the proteins sequence.)

Palmitoylation is a post-translational protein modification that often occurs in the Endoplasmic Reticulum or Golgi Apparatus. S-Palmitoylation acts like a sticky tag that directs proteins to specific locations within the cell, capable of regulating its function. Palmitoylation is the addition of a long fatty acid chain or Hexadecanoic Acid, commonly palmitate, to cysteine residues of proteins. There are two types of palmitoylation, S-palmitoylation and N-palmitoylation. S-palmitoylation occurs on a cysteine residue in the middle of a peptide, while N-palmitoylation occurs on the N-terminus of a cysteine residue on an end on a peptide. S-palmitoylation is researched much more and for this project I have collected mostly samples of this variant. Palmitoylation can be added or removed to proteins by enzymes called palmitoyl acyltransferases and acyl protein thioesterases, respectively. Palmitoylation is not a permanent modification. There is stable palmitoylation where palmitoyl groups are rarely removed and there is cyclic palmitoylation where palmitoyl groups are constantly added and removed to regulate their function. For instance, S-palmitoylation enzymes are a part of the DHHC PAT family of proteins (includes 23 proteins). It is a part of the larger group of proteins, protein acyl transferases or PATs (which conduct post translational lipid modifications including palmitoyl groups). DHHC is a sequence of amino acids that is conserved in all 23 S-palmitoylating enzymes. S-palmitoylating proteins add the palmitoyl group or “sticky tag” to proteins that tethers/directs proteins to cell/organelle membranes (due to the hydrophobic nature of fatty acids). Palmitoylating or depalmitoylating enzymes can influence where proteins move to within the cell, regulating its function. Therefore, palmitoylation is involved in signaling cascades. For example, they govern molecular processes that cause synaptic activity in neurons.

Aberrant Palmitoylation

Aberrant palmitoylation is related to many diseases in the brain. Brain functions strictly depend on precise regulation of synaptic activity which is governed by post-translational modifications like S palmitoylation. The axon and axon initial segment (AIS) are important structures for signaling or action potential initiation and propagation. Their function relies on compartmentalisation, a process where specific proteins are tethered to specific locations in neuron membranes, that allow its function. A mechanism that regulates protein tethering and therefore neuron action potentials is S-palmitoylation. Palmitoylation is highly prevalent in neuronal proteins and is related to many neurological diseases.

  1. Disrupted palmitoylation caused by mutations in the ZDHHC9 gene are associated with X-linked intellectual disability and epilepsy. ZDHHC9 deficient mice also showed seizure activity and altered learning and memory.
  2. Similarly mutations in ZDHHC5 and ZDHHC8 are associated with schizophrenia and bipolar disorder. ZDHHC8 deficient mice also showed synaptic changes and schizophrenia-like behavior.

Many studies involving mouse models and human derived cells also show that disrupted/unregulated palmitoylation is related to various neurodegenerative diseases (involving loss of neurons).

  1. Disrupted depalmitoylation has also shown cases of lysosomal-storage neurodegenerative disease and Parkinson’s disease.
  2. Huntington’s Disease is a neurodegenerative disorder caused by a mutation or a CAG repeat in the HTT gene, which codes for a malfunctioning polyglutamine tract in the huntingtin (HTT) protein. HTT is normally palmitoylated at cysteine number 214, but mutant HTT is less palmitoylated. Increasing palmitoylation by inhibiting palmitoylation enzymes showed healthier behavior in affected mice (corrected HTT protein tethering)
  3. Alzheimer’s Disease is caused by the formation of toxic β-amyloid plaques in neurons, which gradually slow down neuron function or kill them. These plaques form overtime caused by improper cleavage of the amyloid precursor protein by the β-secretase enzyme and the γ-secretase enzyme. Amyloid precursor protein and γ-secretase (and subunits are Nicastrin and APH-1) palmitoylated. Amyloid precursor protein over palmitoylation enhances the disease causing breakdown that creates β-amyloid plaques. Palmitoylated Nicastrin and APH-1 in mice showed less β-amyloid plaque formation.

The Project

I collected large amounts of data from the SwissPalm dataset and the UniprotKB dataset and trained a CNN-biLSTM neural network on the data. I was inspired to do this project by Dr Shaun Sanders from the NeuroPalm Lab @ the University of Guelph, who researches the connection between palmitoylation and neurological disease. Dr Sanders stated that the “Precise control of neuronal excitability, or whether or not a neuron fires a nerve impulse to its neighbors, is essential for normal behavior and cognition, while aberrant excitability is a hallmark of many neurological diseases, including epilepsy, bipolar disorder, Schizophrenia, and episodic ataxias.” Since, S-palmitoylation plays a critical role in a variety of biological processes and is involved in several human diseases, I thought that identifying specific sites of this modification is crucial for understanding their functional consequences in pathology. S-palmitoylation site predictors could help synthetic biologists design novel proteins that have engineered properties (i.e. palmitoyl groups that can tether Cas 9 proteins to genes). A site predictor can be used for designing novel treatments to diseases such as Alziemer’s disease. Possible gene editing systems or drugs could be engineered to target palmitoylation on proteins to regulate their function. Finally and most importantly, a site predictor can 10x the time required in the research process. A computational prediction of S-palmitoylation sites can efficiently generate helpful candidates for further experiments.

Neural Networks

Neural networks are inspired by the brain’s structure and function. They do not have actual brain cells, but instead they use networks of interconnected nodes that process information like a simplified model of neurons. These nodes are arranged in layers, and connections between them have weights/biases/parameters that determine their function in modeling the input data. By adjusting these weights through a process called backpropagation, neural networks learn to model given datasets they are trained on. This makes them powerful tools for tasks like image recognition and speech understanding. The structure of my model consists of dense layers, bi-lstms layers, and CNN layers. Initially random parameters are assigned (LSTM gates, CNN kernels, and simple node weights). Therefore, in the beginning the model performs horribly. Every training round, a cost function measures the model’s accuracy (Binary Cross Entropy) by comparing the output to desired outcome. This difference called the error/loss is propagated backward through the network. An optimizer (ADAM) fine tunes the parameters in the model based on the error to achieve minimum loss. It asks what small change in the parameters/weights can help me lead to a smaller loss. This is repeated, gradually fine-tuning the network to produce more accurate outputs.

  • Convolutional Neural Network
    CNNs break down an image into smaller and progressively more complex features. This is done through the use of kernels, parameters that slide over 1 dimensional textual data generating feature maps. Max pooling then reduces the size of the input and helps the network focus on the most important parts. These two functions extract the most important visual aspects of textual input and are often fed into dense layers (in this project Bi-LSTM).
  • Bidirectional — Long Short Term Memory
    LSTMs read a sentence from left to right and remember long-term and short-term details, like keywords or previous sentence ideas, to make predictions or feature extractions. They often however struggle to consider future letters when understanding the current one. Bi-directional LSTMs on the other hand read forwards and backwards simultaneously with two “minds”. The Bi-LSTM layer uses gates to evaluate what is “important” and should be stored in long term memory. Important information is assigned a value closer to 1 and information to be forgotten is assigned a value closer to 0.
  • Dense layers
    Fully Connected layers consist of “neurons”. Each neuron receives input from multiple other nodes, performs a simple calculation, and sends its output to other nodes. Each input has a corresponding weight, which represents the strength of the connection between the two nodes. The node multiplies each input by its weight and adds all the products together. The output is passed through an activation function which introduces non-linearity, allowing it to model complex relationships between inputs and outputs. Weights are parameters that are fine tuned during training.

Data Collection

In order to predict Palmitoylation Sites, I needed to create a large dataset to train my model. I created a dataset with 8545 samples of an equal number of positive (sequence is S-palmitoylated) and negative samples (sequence is not S-palmitoylated). The SwissPALM dataset provided me with over 7600 palmitoylated proteins and the positions in their amino acid sequences they were palmitoylated. I removed inaccuracies in the data (mistakes in the position of cysteine residue and uncertain positions) leaving me with 4272 positive samples. I sourced non-palmitoylated samples from the UniprotKB dataset which holds over 500,000 different sequenced proteins. I randomly selected proteins, parsed through their sequences, and selected random cysteine residues (I used the Uniprot API to return sequences of proteins). The use of the API to return sequences of proteins was dependent on internet speed and this process took over 24 hours to complete. Once I got my sequences, I truncated the sequence to 100 amino acids on either side of the cysteine residue. I finally used one hot encoding to convert letters into numerical form that could be understood by my model. The final dimension of my data is: (8545, 201, 21).

Model Structure

In order to predict whether the input sequence is palmitoylated or not, the neural network is required to understand the “meanings” of different amino acids and capture the contextual features of amino acid sequences. After reading An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences by Siquan Hu, I learned about various neural network layers that could extract insights from amino acid sequences. Inspired by their model structure I designed the structure shown below:


+---------------+--------------+---------+
| Layer (type) | Output Shape | Param # |
+---------------+--------------+---------+
| Input | (201, 21) | |
| Conv1D | (192, 256) | 27,008 |
| MaxPooling1D | (96, 256) | |
| Dropout | (96, 256) | |
| Conv1D | (92, 256) | 82,048 |
| MaxPooling1D | (46, 256) | |
| Dropout | (46, 256) | |
| Bidirectional | (46, 256) | 263,168 |
| Bidirectional | (, 256) | 394,240 |
| Dense | (, 32) | 1,056 |
| Dense | (, 1) | 33 |
+---------------+--------------+---------+

Results

Overall after training my model, I realized that my model was able to extract vital information from sequences and predict to an exceeding expectation level of 83% on unseen sequences. I had multiple limitations to the structure of my model, from computation power to limitation in the quantity of data. A major problem I faced was early overfitting, which is when the training data is modeled too closely to the extent that the model is no longer able to generalize and make accurate predictions with unseen data. In order to fix this issue, I tried using drop out layers (which randomly sets input elements to zero to prevent ignored weights or large weights causing overfitting) and weight regularization (which adds a penalty term to the loss function during training that discourages the model from having very large weights). I also tried to reduce the complexity of my model (number of parameters), however I faced no improvement (faced a low accuracy of 68–69%). The cause of this issue was that the protein sequences used for training had significant variance causing my predictor not able to generalize accurately. The solution was to increase the number of amino acids to around 201. This I believe is due to the complex folding patterns of proteins. As seen in the training accuracy overtime, the model is overfitting at epoch 10, where the validation accuracy is not increasing. This is a limitation of my model due to potentially not having enough data (there is still large variance in the dataset). As I have exhausted all my available resources, the next step will be reaching out to University Labs to get access to other data.

Accuracy: 83.18%

(What percent of the predictions were correct):

Precision: 82.59%

(What percent of the positive predictions were correct?):

Recall: 87.55%

(What percent of the positive samples were predicted correctly?):

Specificity: 83.98%

(What percent of the negative samples were predicted correctly?)

Results on Testing Dataset:

Thank you for reading about my project! Reach-out to me on LinkedIn if you would like to collaborate with me and make this project better and more useful.

https://ca.linkedin.com/in/aditya-mahes

--

--

Aditya M

15 y/o student with a vision of making a difference in the world. Looking to learn at labs!