Mr Shivashankar, people speak of a data explosion, meaning that the amount of data stored in computer systems has been doubling about every two years. What is the reason for the data explosion in your field?
G.V. Shivashankar: One reason is that scientists started to generate large datasets using single-cell sequencing methods. In the past, the information from many millions of cells was combined into an average value – for example, when examining a blood sample. But such averaged data is no longer adequate for modern personalised medicine: you want to read every single cell to understand how they behave. In the last ten years, technologies were developed to generate all types of data at the single-cell level: imaging data, sequencing data, proteomics data, and so on.
Why is it necessary to have such data for each individual cell?
Because there is large heterogeneity in cells, even if they are the same type of cell in the same tissue. The way the genome is expressed depends heavily on the microenvironment of a cell.
Why is that? Doesn’t the genome expression of a cell mainly depend on its genome and its genes?
Yes, that is what we had assumed for a long time. But in the last ten to fifteen years it has become clear that the way you pack the DNA within the nucleus is also critical in dictating how the genome is expressed. It also influences how diseases in different tissues develop.
How can that be?
Our DNA is about one metre long. In each one of our cells, it gets packed into a small nucleus, which is about ten microns in size, one hundredth of a millimetre. The stiffness of a tissue, tension, or other microenvironmental characteristics alter the way in which the DNA gets packed. Many age-related diseases are related to exactly this. Abnormal packing even plays a role in the development of neurodegenerative diseases and cancer.
How can you “measure” the way a DNA molecule is packed?
That’s where it gets interesting. A few years ago, we developed the hypothesis that by understanding how the DNA is packed, we could predict how the cell behaves and which genes are expressed. To investigate this hypothesis, we use an image-based approach. We take images of cells in their native environment, for example with light microscopy, and from these images, we try to infer how the DNA is packed and then link that to its function.
I assume you are not personally sifting around thousands of images.
Right, we use machine learning in collaboration with Caroline Uhler’s lab at MIT, one of the leading groups in this field. This way, we extract important information from the images of cells and tissues in different functional stages. The main question is: What are the features that differ between the different stages? Thus, we hope to be able to distinguish between a normal and an abnormal state of a cell.
How is that done in practice?
Each image of the cell nucleus is characterised by thousands of features: texture, brightness and intensity in different regions, and geometric features – for example, elongated or rounded structures. This gives us clues on how the DNA is packed, and using machine learning, we can extract and understand these features. To be able to use the information inside these images, we also need to represent them in a simplified form; we call it “bringing them into a lower dimension.”
Were you actually able to infer the cell function from the images?
Yes, by linking them to gene expression data. In particular, we developed a machine learning method for multi-domain data translation. This allows us to convert data of different types, such as images and sequencing data of a cell, which cannot yet be measured experimentally in the same cell. In this case, we took images of thousands of T cells, immune cells in the blood. We used our machine learning approach to link how the DNA is packed in those cells with single-cell expression data from the same population of cells. We wanted to know: If part of the DNA is packed more compactly, does it reflect a group of genes that are being shut off? Our hypothesis was that these regions won’t be transcribed because you can't retrieve the information as efficiently.
So, were you right?
Yes, and in the end, we were able to predict which genes will be expressed just from the way the DNA is packed. Without even measuring the genes. This is a big deal in the field because single-cell sequencing costs are quite high, and single-cell images can often be obtained more easily. But above all, these images of cells are made in their native microenvironment in the tissue.
What are medical applications of mechano-genomics and machine learning methods?
In another study, we recently showed that they can be combined to find new drugs. Or more precisely: to find out which active substances that are already on the market can help against other diseases. Diseases that were not even considered when the drugs were approved. For example, we looked for already known active substances that could help against Covid-19. We strongly suspect that an infection with Sars-CoV-2 hits older people so hard because their cells are older and stiffer. That’s why the virus is particularly good at interfering with the signaling pathways of the cell and can multiply better in these cells. So we wanted to know: Are there already active substances on the market that could restore an older infected cell to its normal state before infection?
How did you go about finding potential agents against Covid-19?
We looked at how the gene expression changes when cells are infected with Sars-CoV-2. We also used CMap, which is a database of thousands of chemical compounds showing how they change the transcription of genes in cells. Bringing all this information together, we identified two candidate groups of compounds that could roll back the effect of Sars-Cov-2. These active substances block certain enzymes in the cell and thus should help older Covid-19 patients.
Will you test this substance in the lab or on patients?
Surprisingly, we found out that clinical trials on one of the groups of drugs are already under way with Covid-19 patients. The pharmaceutical industry may have come to the same assumption in a different way than we have. So, we will soon know whether these drugs actually help.
How can such novel approaches in data science advance personalised medicine?
One example: In collaboration with Centre for Proton Therapy at PSI, we are starting a project to develop a biomarker that measures the efficacy of proton therapy. Our hypothesis is that the blood cells circulating in the body of a cancer patient receive signals from tumour cells. As a result, they change their DNA structure to express different genes. We will examine blood samples of patients coming to PSI for proton therapy. We will then use DNA packing as a biomarker to evaluate the treatment success in each patient. It might offer a way to more accurately tune the therapy depending on the results.
What is the next big step with regard to data science in your field?
Ultimately, we would like to understand how diseases start at the single-cell level within the microenvironment of a tissue. We are still far from this, but such an understanding is critical for early therapeutic interventions. Looking at all proteins inside a cell, though, is like opening Google Earth: There is too much information at once. To comprehensively understand what is going on, data science approaches could take us to the next level.
Interview: Paul Scherrer Institute/Brigitte Osterath