Deep machine learning uses the language of proteins to predict their properties

(ORDO NEWS) — Deep learning models have proven themselves well when working with texts and speech. However, they are also effective for solving problems in molecular biology and biomedicine, including the prediction of the functional properties of proteins based on their amino acid sequence.

Over the years, bioinformaticians, geneticists, neurophysiologists, and other specialists in the life sciences have continued to elucidate the biological functions of genes and their products, proteins.

To do this, they have to use large and sometimes complex data that simply cannot be handled without the help of machine learning and data analysis.

Recall that proteins are large biological molecules with a complex structure. They are long chains (polymers) consisting of many linked amino acid units (monomers).

Proteins can perform a wide variety of very specific functions, from forming the “cellular skeleton” to catalyzing chemical reactions, acting as “molecular machines” and regulating various biological processes.

This is possible due to their special three-dimensional structure, which, in turn, is determined precisely by the amino acid sequence of the protein.

At the same time, establishing a relationship between the amino acid sequence, the structure of a protein, and its functions is a difficult and far from solved problem.

Therefore, researchers from three different universities in Turkey published a paper in the journal Nature Machine Intelligence , in which they evaluated the possibility of using deep learning models, originally intended for linguistic analysis.

Deep learning is a type of machine learning based on neural networks. It is called deep because the structure of its networks consists of several input, output and hidden layers of neurons located between them. The authors of the new publication considered both the strengths of this approach and its disadvantages.

“The data obtained using molecular biology can be represented in the form of a language (in fact, the language of genes / proteins) in such a way that the sequence of a gene or protein turns out to be something like a sentence in a natural language that has a certain meaning,” said one of the authors, Tuncha Dogan (Tunca Dogan).

He believes that the meaning of such a “language of proteins” comes down to the special biological, physical and chemical properties of these biomolecules.

“In accordance with this, the goal of the work was to build machine learning models that use a vector representation in multidimensional space borrowed from language models (high dimensional numerical embeddings.) for proteins as input data and which accurately predict their functional properties”.

To successfully evaluate protein language models and their performance metrics, researchers first had to prepare large, robust datasets. Each of these sets has a certain “level of difficulty”.

With this method, the Turkish scientists were able to evaluate the suitability of different “language modeling” architectures (including BERT, T5, XLNet, and ELMO) for identifying hidden patterns in protein sequences.

The researchers believe that these seemingly invisible properties of the sequences provide valuable information about the functional characteristics of proteins.

“Perhaps the most remarkable result was that these deep learning models were able to successfully establish the functional properties of proteins based solely on the amino acid sequence, although this is quite a difficult task.

It is also in good agreement with the results of other recent structure prediction studies (eg Deepmind’s AlphaFold2 and Baker’s RoseTTAFold), which used the sequence as input,” added Dogan.

The new approach and techniques similar to it may have many practical applications, including the development of personalized therapies.


Contact us: [email protected]

Our Standards, Terms of Use: Standard Terms And Conditions.