Why Genome Sciences?

Life, what a wonderful and elegant existence form on earth, brings the endless fantasy and surprise to the enormous universe!

How did the life originate, how did it inherit and how it evolves? What attributes to such incredible biodiversity here today? What creates the human being with intelligence? How does the life interact with each other and interact with the surroundings? Where do the diseases come from? Where is the future of us? Thousands of questions remain to be answered about it.

The mystery of life is so captivating that biologists took centuries desperately chasing for the knowledge behind the miracle created by nature, from the very beginning of the observation of inheritance by Mendel to the identification and confirmation of DNA as the main genetic materials. As what people found previously was that the phenotypes are actually inherited, scientists, who usually are also materialists, hypothesized there are some materials which contribute to the phenotype inheritance. Therefore, researchers were looking for the identity of those molecules. And the pioneer scientists who were actually chemists and physicists concluded by elegant experiments that DNA is the molecule that transmits the information to the descendants, separating them from proteins and other biomacromolecules. Later people understood the composition of the DNA along with the structure of the DNA, which were regarded as the beginning era of the modern biology, or molecular biology. I would say understand the composition of DNA molecules is more important than its structure, as, in terms of information, we are able to know DNA is actually polymerized by nucleotides, which contains four types of nitrogen bases. It is the information inside the combinatorial bases that attribute to the complexity and diversity of the life.

Later people were trying to figure out how DNA inherited our phenotype. Scientists discovered that when to disrupt some parts of the gene will generate phenotype alterations. Based on the genetic and biochemical study, people summarized the central dogma of genetic information flow that DNA finally transfer the genetic code to proteins, the functional products through the processes called transcription and translation. Then people felt excited to address the questions that DNA has encoded proteins information and is the center to command a cell behavior. Those DNA components, we called them gene.

Through those decades, people were trying them best to understand gene, and accumulated tens of thousands of knowledge in terms of cellular function, transcription, diseases and so forth. In order to understand the paranormal view of our whole information inside DNA, scientists launched Human Genome Project (HGP), aiming to sequence human genome to gain the script which necessary to human being formation.

But, here comes another question. In an individual, one may contain the same genome script, but what makes them different? Neuron, stem cells, immune cells, blood cells, cancer cells, fibroblast? As the study going deeper, people knew that it is not the numbers of gene but the gene regulation programs that contributes to the complexity of the life. Gene encodes transcription machinery, RNA polymerase for DNA transcription. Transcription factors bind sequence-specific regions and activate gene expression at a certain condition. Therefore, the gene is regulated by the interaction of cis elements inside genome and trans elements (transcription factors), which orchestrates a huge transcription network inside genome and achieves the identity of different cell types. Because we just know <1.5% of the genome, the other non-coding components were unknown, especially for those parts related to gene regulation. Therefore, it becomes important to understand the cis-regulatory elements along with transcription factor expression. But how to understand the location and distribution of the cis-regulatory elements? Perhaps, we could do the genetic screen or perform motif analysis genome-wide to help us to identify their location. But at the beginning, it is difficult to do perturbation in mammalian cells. Therefore we have ENCODE project.

ENCODE project aims to understand and annotate the function of the unknown regions of the genome. The way for us to understand the double Dutch is through epigenetic markers inside the DNA and histones. Labs around the world previously identified that some DNA modifications and histone markers are correlated with localization of cis-regulatory elements such as promoters and enhancers. Some other histone marks are able to help us identify the transcription activity and chromatin states. Other people developed chromatin accessibility assay such as DNase-Seq, MNase-Seq, and ATAC-Seq to regard the chromatin open region as active elements region and coding region. A systematic study using RNA-Seq also helps us to know the transcriptomic profiles of cells, inferring the activity of their regulatory elements. From the data we inferred from the epigenome signatures and landscape, we can understand the life better.

The story for us to understand the genome becomes more and more exciting when we are approaching the real states of our genome. Job Dekker is a biologist interested in chromatin conformation. Through his lab and other colleagues’ contributions, they developed a high-throughput method which could capture genome-wide chromatin contact to map the physical genome structure of us and others. Researchers performed Hi-C on cells and identify the structure of our chromatin. To our surprise, those chromatins are actually formed as individual topological domains across the cell types in one. Those domains mirror the hypothesized existence of insulator elements which block gene regulation by forming DNA loop. CTCF/cohesin are regarded as chromatin structure regulator helps to form DNA loops. Some studies suggested that the disruption of the insulator or Topological Domain boundary sequence causes dramatic phenotype disruption and ultimately contributes to cancer and developmental disorder.

And now here comes CRISPR. Everybody talks about it because it is such a revolutionary and powerful tool to manipulate mammalian genome. Using CRISPR, we can finally look at our genome and perform classic genetic screening to understand the elements functionally. With more precise technique, people are now also digging sequence variance, like SNP, CNV and their relationship with eQTL, TF binding preference, DNA looping variance. Besides, for a higher resolution of transcription profile, especially in heterogeneous populations like cancer tissues, stem cells, immune groups and brain structure, people develops single cell genomics, aiming to dissect cell population, link the cell to cell communications and finally bring the genome sciences at the highest resolution, the Single Cell Level.

It is very exciting, and it will be more exciting in the future. Because we are not only able to manipulate the genome but synthesize the genome. We are aiming to understand the transcription network systematically, trace the developmental lineage and transcriptome cascade of the cellular systems in single cell level, understand the causality of DNA elements function along with epigenome (those are the multiple layer information laying inside our linear sequence), synthesize the larger scale transcription module and design sophisticated artificial parts for our needs. That will be the moment when we can finally fully hack and unlock our genome to start the new era of biology. A new era for the knowledge and all mankind.