Our results provide a global overview of the gene expression pattern of undifferentiated human ES cells and allow comparisons with other data sets. These results suggest the hESC are an actively dividing population of cells that exhibit high metabolic activity. Our analysis detected expression of approximately 10,600 unique transcripts, a figure that about a third of the total number of mapped genes. Unlike other cell types, however, a much larger fraction of unknown or novel genes was present. This high ratio likely represents the paucity of information available in existing libraries on this relatively newly characterized cell population rather than the possibility that ES cells use radically different pathways for self-renewal, survival, proliferation and differentiation.
Our results confirm the reported differences between rodent and human ES cells. We confirm the absence of expression of ERAS, Ehox and the orthologs PEPP1 and 2. The apparent lack of LIF requirement of hESC is reflected by the absence or low tpm levels for genes of the LIF pathway and high tpm for suppressors of LIF mediated signalling (see supporting information). The high level of expression of genes in the FGF pathway likely reflects the requirement of hESC for bFGF. The high level of FGFR1 expression suggests that FGFR1 is an important signal transducer and that FGF's other than FGF4 are important in hESC self-renewal. The high tpm of the fibronectin receptor also suggest that fibronectin or vitronectin are likely useful substitutes for matrigel and that activation of ras mediated signalling is likely critical, as has been described in the rodent ES cell analysis .
Comparing data from the MPSS analysis with microarray, SAGE and EST scan analyses suggest that MPSS is a powerful alternative to these techniques. MPSS identified virtually all of the genes highlighted as genes common between six different human ES cell lines surveyed by microarray. We noted that most genes detected by microarray were expressed at high tpm indicating that MPSS is more sensitive than microarray analysis. MPSS however appeared to be able to identify genes detected by microarray. Analysing an additional 400 markers detected by MPSS using focused microarray or RT-PCR confirmed their expression , (data not shown). Likewise, MPSS analysis showed good concordance with the EST scan data at a fraction of the price. In contrast to the EST scan, tpm levels determined by MPSS are highly correlated to the mRNA levels present in the cells, even at low tpm values , and (Lynx unpublished results). Due to the low sampling number of most EST scans, this is not true for relatively low number of EST's found for a particular gene, and can be used only as a rough estimate of gene expression. Unlike other in depth analyses, the absence of markers in MPSS runs is also a powerful control provided that the marker possesses a GATC site. The chromosomal distribution of the genes expressed in hESC did not reveal any bias for a particular chromosome or chromosomal region. While a couple of "hotspots" and several "cold spots" were identified, in no case was any region comprised of all transcribed or all silent genes.
Another important conclusion from our analysis is that selection of input RNA is critical. In our case we tested samples repeatedly to assess their purity and made considerable efforts to establish subclones that did not require feeder cells that could be potentially contribute transcripts to the analysis. Given the range of tpm of biologically relevant molecules (5 to 32,000 in this experiment) we predict that even a 5% contamination can confound results or detailed comparisons across different laboratories.
We note also that gene transcription from both the X and Y chromosome is observed indicating that at least subtle differences will exist between male and female lines even in the undifferentiated state. Sex-based gene expression, along with MHC gene expression and ratio of expression of imprinted genes could serve to distinguish between different ES cell populations. The present results further suggest that analysing embryoid bodies that differentiate stochastically or analysing tissue samples (with variable proportions of cells) by MPSS will prove more difficult and that results will be variable. We suggest that variability can be reduced by pooling samples, normalizing by careful testing for known markers of differentiation, by semi quantitative PCR, or by focused microarray analysis.
While MPSS is cost-effective and sensitive, it is by no means perfect. MPSS is limited by the requirement that DpnII sites (GATC) be present in a gene and be present in a unique locus such that the signature obtained is unique. For example, SNRF expression could not be assessed directly, as no GATC site is present. The signatures for ZFP42 are ambiguous and map to multiple transcripts. Although MPSS can distinguish between alternate transcript termination sites, MPSS cannot distinguish between alternative splicing events and possible incomplete digestion during the sample preparation process. Signature lengths are relatively short and it is possible to have to select between multiple genome hits (reviewed in . Sequencing is performed four bases at a time and transcripts that contain palindromic sequences (in particular double palindromes) are often undetected because of self-hybridization of single DNA strands on the bead. A survey of the genome suggests that this is a rare event (approximately 3% of all virtual signatures in human MGC database have double palindromes). The NODAL gene is an example for such an event, where the class 1 signature was lost and NODAL expression is detected only by a signature resulting from incomplete digestion during library construction (see results). The success of MPSS analyses also depends to a large extent on the quality of genomic information available and, in our opinion, currently is best utilized to analyse human cells. Furthermore, MPSS itself may not be the best method for routine, lower throughput analyses, given price per sample, sample processing time and the large amount of data generated, which requires considerable analysis. However, the database, once developed, is extremely valuable provided it is freely available to make comparisons and to select subsets of genes for further analysis. MPSS information can be effectively utilized by establishing a common database of markers expressed at a defined stage in the differentiation of cells. Additional data sets from sampling of cells at well-controlled stages of differentiation that can be readily accessed and compared to existing datasets will provide the most information while still being cost effective. The genome database is an example of such sharing that has proven to be an invaluable resource for our experiments. Such a strategy requires cooperative pooling of information and free sharing such that individual results can be readily compared against validated datasets. Our future experiments will be directed and developing additional data sets of ES cell differentiation, which can be shared in a manner similar to the present set.