Brief Summary of the First Half Cancer Biology Study

目前 cancer biology 学习已经过半,我已经明显觉得自己学习进入了深水期——主要体现在于,前期的知识在后面广泛而且有用的出现,信息量逐渐加大,生物的复杂度已经开始充分体现。

从textbook物理的结构上看,目前的情况是已经过半了。textbook 首先介绍肿瘤的基本特征,主要是一些早期对肿瘤的分类和历史介绍,这一部分中介绍了一些 carcinogen 对癌症的诱导,这也是对后面的内容进行进一步提示的重要开始。

接下来从 tumor virus 的发现引入癌基因,介绍癌基因的遗传学(如何找到癌基因)接下来深入浅出的介绍了癌基因在生长中的作用以及他们突变导致的不利局面。发现 tumor virus 以及他们对细胞的 transformation 是非常重要的里程碑式的发现——因为尽管在人类中 tumor virus 比较少见,却暗示了我们可以用 tumor virus 进行 model study,进而 identify 出大量的癌基因。当然了最重要的一点在于,告诉了我们一部分 virus oncogene 的来源很有可能是来自细胞的!这是一个非常重要的发现,我觉得暗示了 virus-induced cancer and cell-self generated cancer 其实本质是相同的!这实在是太精彩了!可以说我对 cancer biology 的兴趣,从此点燃!

一开始呢,从 tumor virus 引入 oncogene genetics,介绍如何 identify oncogene。然后的部分花了很多的时间介绍 oncoprotein and its biochemistry。其中最让我感到振奋的就是 identify RTK 的异常功能和发现 Ras protein 的功能。Src 和 Ras 的 biochemistry 确确实实是非常经典的向我们阐释了 signaling disorder 的可怕。当然之后又很详细的介绍了其他的 signaling(除了 RTK-Ras-Ral/PIP3K,Raf-MAPK), Jak-STAT, Wnt-beta-catenin, Hedgehog/Gli, ECM signaling, TGF-beta, Notch, GPCR, NF-κB等等。很有意思的当然是我知道了 Glioblastoma 原来是最先因为发现 Gli 而来的,并且从之后的TSG 部分介绍 IDH link 了目前我所知道了 glioblastoma 中的 IDH mutation and CTCF looping, stemness of the cancer stem cells and single cell RNA-Seq. 非常振奋。生长信号,就像网络一样复杂,但是目前也只是暂时提到了 signaling pathway disorder 为止,事实上对 consequences——也就是 transcription disorder 还没有进行介绍,当然进入 TSG 之后就会有非常庞大的故事线出现,然后我就💊了。

接下来,就开始介绍抑癌基因,这也是进入深水期的开始——因为从生长的信号通路,我已经开始感觉信息量加大了,现在加入抑癌基因的部分,就是讲生物的复杂度再提升一个层次(正反,阴阳)

当然了抑癌基因也是从遗传学开始讲起的——如何发现的抑癌基因,抑癌基因的 mapping,接下来讲了三个非常好的例子 NF1, as the signaling inhibitor; Apc as the cytoplasmic transcriptional regulator and pVHL as the metabolism factors. 他们涵盖了三个很主要的癌症的特征和方面——肿瘤的信号通路、基因表达和代谢。非常好的例子。

我认为 identify TSG 是相比 oncogene 难很多的,因为 oncogene 的 phenotype 明显,但是 TSG 确确实实不是很容易。因为如果只是找 gene expression profile with the cancer tissues compared with WT 确确实实找不到因果。需要看 LOH,也需要看进行功能验证——如果引入之后能否 recover phenotype?这里面水很深,gene expression level is sometimes necessary such as haploinsufficiency 怎么办(Mast cell, NF1)?

接下来,抑癌基因重点介绍了 Rb,这也是最早发现的抑癌基因,有着最详细的阐释和说明——说从 Rb 开始故事变的越来越复杂,也是因为现在的故事从之前的一小节一个故事,变成了一整章是一个故事——这一整章围绕着 Rb, R-point, cell cycle 进了非常广泛的讨论。其中大量涉及癌基因和 mitogenic signaling,如果前面学得不好,这一部分就会觉得很吃力了——因为这里还介绍了新的一部分非常复杂的蛋白 machinery 体系——主要是 cyclin-CDK complex, CKI, cell cycle regulator(最主要的就是 Myc), the regulator of the cell cycle regulator(Myc-associated proteins)。介绍了 pRb 主要影响的蛋白体系,包括 E2F,以及其中的干扰项,一些重要的oncoprotein 和相关 signaling。我不得不说,这一个故事的庞大与复杂实在是让我叹为观止!too good to be true! (learned from Peiwei)

目前也是提示我最主要的事情就是 cell program disorder->based on the signaling disorder->results in the transcription disorder. But all is basically coming from the mutation and the genome instability——这也是另外一部分抑癌基因也就是 gatekeeper and caretaker 的区别了吧。

所以到底什么是癌症,癌症的 etiology 到底是什么?一切都来源于 genome instability 吗?then how about epigenome? epigenome 的混乱是否也是 genome instability 导致的?一个最大的问题就是,这么多的肿瘤,为什么突变的类型各不相同?有一个问题可以被 partially answered,那就是为什么不同的突变会有相同的结果。但是为什么很多时候相同的突变,会在一些 cell 中出现 phenotype 有的却不会?在这里 globally characterization of the genome landscape 其实是必须的,因为如果不进行系统的研究,很难 identify 在这么多的变化中,到底谁才是真正的 driver。当然了,如果只是简单做一些系统分析,不去抽丝剥茧的话,也很难说是 good science。

非常高兴现在终于明白了以前常常听到的 Src, Ras, PTEN, Myc, Rb, p53等等究竟是什么。接下来的部分将会继续承接 Rb and cell cycle 的复杂度,从 p53 和 apoptosis 开始,继续由大故事引导。并在接下来进一步介绍肿瘤的发生和发展,这也将是基于 molecular basis 介绍一些复杂过程的开始(angiogenesis, metastasis),当然还有最重要的部分就是 genome instability, 告诉我们 the nature of ourselves, the nature of evolution, mutation and change, sometimes cancer. 最后将会简单介绍 therapy,对此,我认为只要找到本,治疗总是有方法的,而对于治疗,不得不提 cancer immunology,这也将会简单涉及!

嗯,非常精彩,非常兴奋,期待尽快完成!

Why Genome Sciences?

Life, what a wonderful and elegant existence form on earth, brings the endless fantasy and surprise to the enormous universe!

How did the life originate, how did it inherit and how it evolves? What attributes to such incredible biodiversity here today? What creates the human being with intelligence? How does the life interact with each other and interact with the surroundings? Where do the diseases come from? Where is the future of us? Thousands of questions remain to be answered about it.

The mystery of life is so captivating that biologists took centuries desperately chasing for the knowledge behind the miracle created by nature, from the very beginning of the observation of inheritance by Mendel to the identification and confirmation of DNA as the main genetic materials. As what people found previously was that the phenotypes are actually inherited, scientists, who usually are also materialists, hypothesized there are some materials which contribute to the phenotype inheritance. Therefore, researchers were looking for the identity of those molecules. And the pioneer scientists who were actually chemists and physicists concluded by elegant experiments that DNA is the molecule that transmits the information to the descendants, separating them from proteins and other biomacromolecules. Later people understood the composition of the DNA along with the structure of the DNA, which were regarded as the beginning era of the modern biology, or molecular biology. I would say understand the composition of DNA molecules is more important than its structure, as, in terms of information, we are able to know DNA is actually polymerized by nucleotides, which contains four types of nitrogen bases. It is the information inside the combinatorial bases that attribute to the complexity and diversity of the life.

Later people were trying to figure out how DNA inherited our phenotype. Scientists discovered that when to disrupt some parts of the gene will generate phenotype alterations. Based on the genetic and biochemical study, people summarized the central dogma of genetic information flow that DNA finally transfer the genetic code to proteins, the functional products through the processes called transcription and translation. Then people felt excited to address the questions that DNA has encoded proteins information and is the center to command a cell behavior. Those DNA components, we called them gene.

Through those decades, people were trying them best to understand gene, and accumulated tens of thousands of knowledge in terms of cellular function, transcription, diseases and so forth. In order to understand the paranormal view of our whole information inside DNA, scientists launched Human Genome Project (HGP), aiming to sequence human genome to gain the script which necessary to human being formation.

But, here comes another question. In an individual, one may contain the same genome script, but what makes them different? Neuron, stem cells, immune cells, blood cells, cancer cells, fibroblast? As the study going deeper, people knew that it is not the numbers of gene but the gene regulation programs that contributes to the complexity of the life. Gene encodes transcription machinery, RNA polymerase for DNA transcription. Transcription factors bind sequence-specific regions and activate gene expression at a certain condition. Therefore, the gene is regulated by the interaction of cis elements inside genome and trans elements (transcription factors), which orchestrates a huge transcription network inside genome and achieves the identity of different cell types. Because we just know <1.5% of the genome, the other non-coding components were unknown, especially for those parts related to gene regulation. Therefore, it becomes important to understand the cis-regulatory elements along with transcription factor expression. But how to understand the location and distribution of the cis-regulatory elements? Perhaps, we could do the genetic screen or perform motif analysis genome-wide to help us to identify their location. But at the beginning, it is difficult to do perturbation in mammalian cells. Therefore we have ENCODE project.

ENCODE project aims to understand and annotate the function of the unknown regions of the genome. The way for us to understand the double Dutch is through epigenetic markers inside the DNA and histones. Labs around the world previously identified that some DNA modifications and histone markers are correlated with localization of cis-regulatory elements such as promoters and enhancers. Some other histone marks are able to help us identify the transcription activity and chromatin states. Other people developed chromatin accessibility assay such as DNase-Seq, MNase-Seq, and ATAC-Seq to regard the chromatin open region as active elements region and coding region. A systematic study using RNA-Seq also helps us to know the transcriptomic profiles of cells, inferring the activity of their regulatory elements. From the data we inferred from the epigenome signatures and landscape, we can understand the life better.

The story for us to understand the genome becomes more and more exciting when we are approaching the real states of our genome. Job Dekker is a biologist interested in chromatin conformation. Through his lab and other colleagues’ contributions, they developed a high-throughput method which could capture genome-wide chromatin contact to map the physical genome structure of us and others. Researchers performed Hi-C on cells and identify the structure of our chromatin. To our surprise, those chromatins are actually formed as individual topological domains across the cell types in one. Those domains mirror the hypothesized existence of insulator elements which block gene regulation by forming DNA loop. CTCF/cohesin are regarded as chromatin structure regulator helps to form DNA loops. Some studies suggested that the disruption of the insulator or Topological Domain boundary sequence causes dramatic phenotype disruption and ultimately contributes to cancer and developmental disorder.

And now here comes CRISPR. Everybody talks about it because it is such a revolutionary and powerful tool to manipulate mammalian genome. Using CRISPR, we can finally look at our genome and perform classic genetic screening to understand the elements functionally. With more precise technique, people are now also digging sequence variance, like SNP, CNV and their relationship with eQTL, TF binding preference, DNA looping variance. Besides, for a higher resolution of transcription profile, especially in heterogeneous populations like cancer tissues, stem cells, immune groups and brain structure, people develops single cell genomics, aiming to dissect cell population, link the cell to cell communications and finally bring the genome sciences at the highest resolution, the Single Cell Level.

It is very exciting, and it will be more exciting in the future. Because we are not only able to manipulate the genome but synthesize the genome. We are aiming to understand the transcription network systematically, trace the developmental lineage and transcriptome cascade of the cellular systems in single cell level, understand the causality of DNA elements function along with epigenome (those are the multiple layer information laying inside our linear sequence), synthesize the larger scale transcription module and design sophisticated artificial parts for our needs. That will be the moment when we can finally fully hack and unlock our genome to start the new era of biology. A new era for the knowledge and all mankind.