Bioinformatics and Genomics

Sunday, October 24, 2010

A few thoughts on TEs

1) Perhaps ecological risk assessment could be performed by looking at the transcriptome or the proteome of a population / sample. If the tscriptome has many non-coding RNAs that resemble TE, perhaps the organism is under stress. Similarly, if an organism is intensively expressing enzymes specific to retrotransposition, that may be an indicator of stress.

Side note: I believe the human body is highly resilient; it is a very intelligent machine. Yes, I am anthropomorphizing, and even though it was made through "tinkering" as Dr. King Jordan would say, I still believe that we live in the "best of all possible worlds."

2) Perhaps (and I know that i have suggested this before), the natural history of an organism could be reconstructed by looking at the type I TE to see how many of each kind exist, and how long each of the TEs have been a part of the genome. This approximate age could be determined by looking at the number of mutations away from a function TE it is in present form. However, it must be taken into account that mutations are almost certainly NOT random. Which brings me to a third idea...

3) Investigating why and how mutations are not random. Does it have to do with the 3-dimensional structure of the DNA (folds and everything)? Does the organism (cell) control if, when, and/or where mutagens can act?

4) Investigate the relative importance of the different means by which organisms deal with stress.

Excerpt from Madlung and Comai, 2004:

Stress, in any form, exerts strong evolutionary pressure on
all organisms. To survive, any organism must develop tolerance,
resistance or avoidance mechanisms. Tolerance
allows the organism to withstand the assault unharmed.
Resistance involves active countermeasures, while avoidance
prevents exposure to the stress.

I believe that it is easy to understand that plants would probably rely much more heavily upon tolerance and resistance whereas motile organisms like animals can take much greater advantage of avoidance. However, if people (humans) engage in behavior that causes the constant and continuous stress, tolerance might have to work overtime. Furthermore, if tolerance (namely the detoxifying mechanisms in the body e.g. the liver) can no longer prevent the degradation of the integrity of the (genomic) individual, then resistance (or other forms of tolerance) might be induced. As a side note, I think that is it very comical that humans consider themselves the smartest of all creation, and yet we are one of the only organisms that (across the board) find the most harmful chemicals, toxicants, etc. and make them habits and lifestyle choices. Most other organisms would "listen" to their bodies and realize that the action that they are taking is harmful and should be discontinued. With that being said, of course the people who put their bodies under greater stress will have the greater amounts of Transpositional events occurring in their DNA.

4b) With this in mind, I note that twins can be identical, even in their DNA, but have vastly different expression patterns for certain (very important) genes. (As a side note for myself, I have found it helpful to think of identical twins when thinking about gene expression and TE and stressors and DNA's 3-D structure.) Even if a scientist had the economic resources to perform a complete sequencing of the two genomes, it may be found that not a single mutation occurs in a gene-coding region. And still, the gene expression would be vastly different; perhaps because of DNA's 3-D structure; epigenetic control.

5) Furthermore, I feel as though I have just had an epiphany! Oh happy day! I have just synthesized a postulate that flies in the face of modern genetics. Genetics 101 claims that inheritance of acquired traits is utter non-sense and only happens in the rarest of cases. But what if inheritance of acquired traits is essential and very beneficial. I have an idea of how it could happen, but suffice it to say that expression of most (if not all) genes is affected by the structural formation taken by the DNA. The DNA takes it 3-D, tightly packaged shape from many smaller structures whose location and structure is intimately associated with the precise base pair sequence that the DNA has. What if stressful events modified the DNA (a mutation occurs); as a side note, the mutation could be exogenously (biotic or abiotic) induced or there may be some internal controls within the cell that 'cause,' 'select for,' or 'allow for' mutations to occur in a very specific place. The said mutation occurs not within any coded gene, rather, the change in nucleotide leads to conformational changes of the 3-d structure of the DNA, leading to epigenetic up- (or down-) regulation of specified genes.

Just a rant - there needs to be a faster and cheaper way to extract the entire transcriptome from a cell. Why don't you work on that, Kevin?

Comment on TEs

From Madlung and Comai:

To summarize, abiotic stress can result not only in well-programmed
physiological stress responses but also in
genome-wide changes. Stress-induced genomic responses
include transposon activation, transposition, and structural
genome changes. Like other stress responses transposon-mediated
alterations in transcriptional activity of affected
genes might lead to avoidance or tolerance of the stress.
Unlike many other stress responses, however, transpositional
activation appears to be a reaction not directly targeting
an evolutionarily developed physiological pathway but
is a hit-or-miss approach to finding an appropriate way of
handling an unusual challenge.

This is more or less what I was trying to say. It seems like it has already been said and much more eloquently. However, I disagree with his conclusion. They say that "transpositional activation appears to be a reaction not directly targeting..." Whereas I believe that while transposition could be a 'random' and disorganized process, perhaps it has evolved a complex regulatory pathway, and that certain parts of the 'junk' DNA (all 98% of it) somehow contain the information necessary to create an appropriate way of handling an unusual challenge. Considering that human life has been around for many hundreds of thousands of years, not to mention all the inherited genetic history, I believe that the human body has faced many of the same challenges that it is faced with today (even in the face of such new, xenobiotic chemicals and anthropogenic pollution). With this "experience" in our genetic "memory", the cellular machinery "knows" how to respond in a way that is at least a little better than hit-or-miss.

Here's a story: metabolites are accumulating in the cell because the protein that is supposed to process them is more affected than average to the stress. The increasing levels of the metabolite increase the expression of the affected protein (by some feedback loop). The high concentration of the metabolite in the cell and/or the stressor molecule induce the activation of DNA-affecting systems (e.g. - TEs), which have as a target regions the places where high levels of transcription are taking place. RTs have been shown to use the machinery of cellular division to insert themselves into the genome. Would it be to far of a leap to assume that they (or mutating, inserting elements) could utilize the machinery of transcription, as well?

Hints of hidden heritability in GWAS

Gibson, 2010

Although susceptibility loci identified through genome-wide association studies (GWAS) typically explain only a small proportion of the heritability, a classical quantitative genetic analysis now argues that considering together all common SNPs can explain a large proportion of the heritability of these complex traits. A related study provides recommendations for the sample sizes needed in future GWAS to identify additional susceptibility loci.

While GWAS have helped us identify genetic variants associated with many different types of diseases, these associations only explain a few percent of the heritability of complex disease. As I have addressed before, there are a few different reasons that SNPs in GWAS are not capturing heritability of disease: 1) we are improperly estimating heritability of the disease (or phenotype), or 2) the common variants of GWAS are not capable of (statistically significantly) capturing the genetic heritability of these phenotypes. It is important to note that these two explanations are not mutually exclusive.

If the former is true, we need to go back and refine our protocol and understanding of the problem of inheritance and more accurately estimate the proportion of phenotypic variation explained by inheritance.

If the latter is true, it would be an interesting question to examine why it might be true. Many researchers have proposed many different hypotheses, and generally they are not mutually exclusive. Rare variants, epistasis, epigenetics, and geneotype-environment interactions are listed by Greg Gibson as potential sources of heritability. Also noted, is the possibility that complex traits emerge from the interaction of thousands of (common) variants with small effects.

So let the great debate begin; what is the reason that GWAS do not identify causal variants (in most cases). Is it 1) that some rare variants of high impact, 2) or many common variants with small impact are affecting phenotypes. Both of the previous scenarios would lead to a situation of low statistical significance. The former because if a variant is rare it will only be present in a few people in the population. For instance, in a study of 5000 people, if the allele only has a prevalence of 0.2% in the population, no subjects would be expected to be homozygous and ~20 subjects are expected to be heterozygous. The stronger the effect, the more likely it would be for a statistical analysis to pick up the association. But just a few phenotypic outliers could dramatically alter the p-value for the association between the rare variant and the phenotype.
The latter is hard to decipher (ie - find statistically significant associations) because if a common variants only explain a small percentage of the variance, a few ouliers could also change the results. Another problem is that if there are 5000 people in the study and 1000 common variants are affecting the trait, there is the potential problem of multicollinearity caused by not having enough data to fit all of the parameters (for the 1000 genes).

Gibson states, "It is unlikely that GWAS will ever be sufficiently powered to uncover even the majority of the heritability" of complex disease. My reply to that is, what then should we be doing in increase our explanatory power (that is, what tests should we be running to elucidate and assign heritability to genes, regulatory elements or networks of these component parts).

"This [paper by Yang et al] presents an elegant argument that most of the heritability [of height] is hidden rather than missing and hence, that there is no pressing need to invoke more complex genetic mechanisms to explain height."
If that is the case, then I want to know how to uncover the hidden heritability because that is going yield causal variants which will lead to molecular mechanisms for the condition, be it height or complex disease like T2D or CVD.

Friday, September 17, 2010

Methylation

Coming to a greater understanding of methylation through two papers: "Principles and challenges of genome-wide DNA methylation analysis" and Establishing, maintaining and modifying DNA methylation patterns in plants and animals."

The methylation of cytosine bases in DNA is one more mechanism of controlling how a genome (and the organism for which it codes) responds to environmental perturbation from equilibrium. Methylation has the potential to not only modify the rate of expression of various genes but also the ability to more permanently remodel the three-dimensional structure of the DNA by changing chromatin from one form to another (e.g. from euchromatin to heterochromatin).

Methylation marks that control gene expression are generally more stable than histone modification. One reason for this is that methylation is generally conserved even through mitotic cellular division with the help of maintenance methyltransferases, whereas histones are not covalent modification of the DNA and only maintain their attachment to DNA through hydrogen bonding. This weak level of bonding for some histone components allows them to shift from place to place and not stably affect gene expression. Methylation marks may actually direct the histone modification locations.

In bacteria and archaea, methylated DNA bases assist in mismatch repair systems by helping the cell determine the copy and the template strands (i.e. new and old strands, respectively).

The amplification of DNA using PCR does not preserve the methylation marks because all PCR does is make copies of templates of DNA's nucleobases. No other features (like 3-D DNA structure or non-nucleobase modifications like methylation) are preserved in the process. Although techniques have been developed to quantify and localize the sites of DNA methylation.

As Law and Jacobsen's review states, the pathways that lead to the removal of methylation are less well characterized. These pathways are essential to developmental biology.
In animals, DNA methylation is prevalent throughout the genome except in CpG islands. This review makes the claim that DNMT3A and DNMT3B establish the methylation patterns of DNA in early embryogenesis (at around the time of implantation). What is not yet been directly states is when the DNA became unmethylated. Is the DNA of gametes always unmethylated? Are germ cells that give rise to gametes unmethylated or do they become unmethylated in the course of their production? Are they completely demethylated or are only most parts of the DNA demethylated (which would lead to the possibility of imprinting)?

Wow, if I would have had the patience to read the following paragraph of the review, I would have seen the answers to some of my questions.

Following a wave of demethylation that is required to erase DNA methylation imprints established in the previous generation, DNA methylation patterns are re-established at imprinted loci and transposable elements (TE) during gametogenesis by DNMT3A and a non-catalytic paralogue, DNMT3-like.

Interactions between unmethylated H3K4 and DNMT3L have been linked to gene imprinting.

Interesting questions emerge from our growing knowledge of gene expression from DNA to post-translational modification. Are there other modifications of DNA or histone proteins that change gene expression? Fructose has been shown to be 10 times more efficacious at generating non-enzymatic glycosylation species (also called advanced glycation end products (AGEs)). Is the high level of sugars, especially fructose, that the American public is consuming leading to more AGEs and consequently a dysregulation of gene expression? In that same vein, what if high sugar levels (in the blood and in the cell) are leading to AGEs of proteins that lead to non-functional proteins, which in turn requires the up-regulation of that transcript? What are the possible consequences of abnormally high levels of transcription? One recent seminar I attended suggested that an increase in transcription of a gene (that contained a tandem repeat) could lead to slippage and a consequent 2-5 nucleotide deletion often resulting in a frame-shift mutation in the gene. What if high levels of sugar consumption were leading to these mutations?

What if sugars (especially fructose) were creating glycation products with proteins like DNMT1 and subsequently preventing hemimethylated DNA from being restored to its fully-methylated state? Then, upon cellular division, there would be a cell that was missing methylation in the correct location which could lead to the dysregulation of the expression of that gene. If that gene was necessary for tissue integrity, its disregulation could lead to cancer.

Are there intercellular communication networks with which sugar interferes through the production of AGEs? Does sugar exist in a free state in the blood and/or in the cell, or instead is the sugar bound to specific carrier molecules that transport it to where it is needed? If carrier molecules are important in glucose localization, what happens when the system is overwhelmed?

Sunday, August 29, 2010

From Mendel to Sanger and beyond

Mendel introduced the idea of inheritance on alleles, discrete units of heritability that were later shown to reside on chromosomes. Linkage studies of family units furthered the understanding of inheritance and genetics. Three decades ago, Frederick Sanger introduced a method for elucidating the precise nucleotide sequence of DNA. He would later win his second Nobel Prize for his pioneering method, and the human genome would be sequenced using this same method over twenty years later. Recent developments in sequencing technology, dubbed "second-generation" sequencing, massively parallelize sequencing but still are surpassed by the Sanger method in reliability and read-length.

The rapid sequencing of numerous genomes has led to many important discoveries. Finding correlations between genetic variation and (complex) human disease has left much to be desired. With a few exceptions, the common disease / common variant hypothesis has not found support in GWAS studies of single nucleotide polymorphisms (SNP). One recent GWAS study, over 100,000 individuals were genetyped at ~2.6 million
SNP locations (either directly genotyped or imputed) which were then used to find correlations with four factors associated with coronary artery disease: total cholesterol (TC), LDL-C, HDL-C, and triglycerides (TG). While the study resulted in the identification of new loci associated with these risk factors, the combined power of all of these explanatory loci only account for ~25-30% of the genetic variation of the population.

What lessons are learned from this massive GWAS study? Perhaps, GWAS studies by themselves, no matter the size of the population studied or the large number of SNPs, will never be able to explain all or even most of the genetic variation of complex diseases like cardiovascular disease (CVD) or diabetes and their risk factors. Even considering this important limitation, GWAS genotyping studies give clues as to possible causitive pathways in complex disease. The use of these fast, high-throughput technologies still must be balanced with other methods of verification in wet labs. Unfortunately, as prices of sequence go down and speed goes up, the vast mound of "great ideas" for a potential target gene will continue to grow and outstrip the hard, detailed science that confirms a causative role in (and a potential fix for) the disease.

Putting the pessimism aside, an integration of new sequencing technologies will yield new insight into the possible molecular mechanisms for heritable, non-communicable disease. Creating and integrating High-throughput technologies and parallelizations for other aspects of sequencing like DNA methylation and histone spacing, packing, and modification with either methyl- or acetyl- groups.

While the generation of "great ideas" almost always outstrip the time and resources required to test them, the discrepancy between these two aspects of research grows like an ever-widening chasm. The essential ability to design and perform experiments that provide the greatest insight (the most bang for the buck) will become increasingly desirable. Intuition about data sets will still be important, but a thorough training and experience with bioinformatics tool-sets will be essential. A new breed of biologist is becoming increasingly valuable: a person who has a deep understanding of the mechanisms underlying the biological phenomena, who can translate that mechanism into data and analyze it using computational power, and who can interpret the results and translate them into real-world solutions. These three skills have always been important for scientists, but with the growing amount of data and increasing computational power to process it, the complexity of the problems require much more thought. Let the challenge begin!