Research article | Open | Published:
MPSS profiling of human embryonic stem cells
BMC Developmental Biologyvolume 4, Article number: 10 (2004)
Pooled human embryonic stem cells (hESC) cell lines were profiled to obtain a comprehensive list of genes common to undifferentiated human embryonic stem cells.
Pooled hESC lines were profiled to obtain a comprehensive list of genes common to human ES cells. Massively parallel signature sequencing (MPSS) of approximately three million signature tags (signatures) identified close to eleven thousand unique transcripts, of which approximately 25% were uncharacterised or novel genes. Expression of previously identified ES cell markers was confirmed and multiple genes not known to be expressed by ES cells were identified by comparing with public SAGE databases, EST libraries and parallel analysis by microarray and RT-PCR. Chromosomal mapping of expressed genes failed to identify major hotspots and confirmed expression of genes that map to the X and Y chromosome. Comparison with published data sets confirmed the validity of the analysis and the depth and power of MPSS.
Overall, our analysis provides a molecular signature of genes expressed by undifferentiated ES cells that can be used to monitor the state of ES cells isolated by different laboratories using independent methods and maintained under differing culture conditions
Multiple large-scale analytical techniques to assess gene expression in defined cell populations have been developed. These include microarray analysis, EST enumeration, SAGE and MPSS. Each of these techniques offers unique advantages and disadvantages. Technique selection largely depends on the expertise of the investigator, the cost, the availability of the techniques, the amount of RNA/DNA that is available, and the existence of the genome databases. The human genome dataset is the best annotated one available [1, 2]- making large scale gene expression analysis of human tissues and cells uniquely fruitful for investigators due to the increased ability to identify full length transcripts with predicted gene function instead of EST's.
Human ES cells have been isolated relatively recently and ES cell genes are underrepresented in current databases. More importantly, recent evidence has suggested that mouse ES and human ES cells differ significantly in their fundamental biology [3, 4] and one cannot readily extrapolate from one species to another. However, comparing results between species may provide unique insights. Given the wealth of SAGE and microarray data available from rodent ES cells examining human ES cells with similar techniques as has been done recently by several investigators [3–11] should be very useful in furthering our understanding of this special stem cell population. Until recently however, it has been difficult to obtain RNA from a homogenous population of undifferentiated hESC for such an analysis as cells could not be grown without feeders and few unambiguous ES cell markers had been described. However, we and others have now described markers that will clearly assess the state of ES cells using a combination of immunocytochemistry and RT-PCR [3, 12, 13] In addition, techniques of harvesting ES cells away from feeder layers have been developed and verified (our unpublished results) and methods of growing ES cells without feeders have been described . These techniques, have allowed us (and others) to obtain large amounts of validated RNA/cDNA samples for comparison by microarray [3–11], SAGE  or EST enumeration .
We selected MPSS for this analysis as it offers some unique advantages over other methods including SAGE [15, 16]. MPSS offers sufficient depth of coverage when over one million transcripts are sequenced  and is efficient, as the numbers of sequences obtained are an order of magnitude larger than with shotgun sequencing or SAGE. It is relatively rapid with a turnaround of a six to ten weeks, and if done with human tissues, more than 80% of transcripts can be mapped to the human genome with current tools. Further, independent analysis has suggested that expression at greater than 3 tpm (transcripts per million) is predictive of detectable, reliable expression, equivalent to roughly one transcript per cell – a sensitivity that is unparalleled when compared to other large-scale analysis techniques . Finally, MPSS libraries can be translated into SAGE libraries and compared to existing SAGE library sets using freely available tools such as digital differential display, allowing ready comparisons to existing SAGE/MPSS libraries of mouse ES cells. It is important to note that we found 14 base pair SAGE tags are generally not as specific as 17 base MPSS signatures and that SAGE sampling depth is usually insufficient. Newer technologies such as extended sequencing to 20 base pairs in MPSS, 24 base pairs in SAGE or cheaper bead alternatives such as those described by Illumina may offer additional depth of coverage and a cheaper price but these at present remain limited in availability.
We have utilized MPSS using a pooled sample of three human ES cell lines grown in feeder-free culture conditions over multiple passages [17, 18] to assess the overall state of undifferentiated ES cells. Our rationale for using pooled sample rather than individual samples was based on the fact that no standardized medium and culture conditions have been established for growing and propagating ES cell lines. Variation observed by sampling single lines may be due to culture conditions rather than intrinsic differences. We reasoned therefore that a need existed to establish a reference baseline using pooled samples to enhance the similarities and provide evidence for candidate genes that should be examined for differences such as expression of HLA genes, Y chromosome and X chromosome genes, imprinted genes and genes regulating the methylation state. Our results show that MPSS provides a greater depth of coverage than EST scan or microarray and provides a comprehensive expression profile for this stem cell type. The data set generated allows us and others to identify multiple genes that were not previously known to be expressed in this population, including novel gene as well as obtain a global overview of pathways that are active during the process of self-renewal.
MPSSS analysis of pooled samples
A pooled sample of undifferentiated human ES cell lines H1, H7, and H9 grown in feeder-cell free conditions  was used for the preparation of mRNA as previously described . Growth without feeders avoids complication from feeder contamination, which even with good harvesting techniques [14, 21] ranges between 1–3% (unpublished data) and is sufficient to be detected by MPSS (Dr. B. Lim-Harvard University personal communication). Under these conditions, 80–95% of the cells express SSEA-4, 91–94% express TRA-1-60, and 88–93% express TRA-1-81, previously described markers for undifferentiated hESC . Microarray analysis of 2802 genes suggests that these cells are remarkably similar in their gene expression profiles, with only 5 genes being more than 2-fold different between the three cell lines [17, 18] (and data not shown). The undifferentiated state of the cells was also assessed by RT-PCR of known markers of undifferentiated hESC on mRNA of the pooled hESC sample (Figure 1). In addition absence of early markers of differentiation was assessed. No expression of GATA, Sox-1, nestin, Pdx-1 or markers of trophoectoerm were detected in samples used (Supplementary table 3a, see also 3)
Pooled mRNA of the three hESC lines was subjected to MPSS analysis at Lynx Therapeutics (Hayward, CA), generating 22,136 distinct and significant signature sequences from a total of 2,786,765 sequences (see Methods and additional file 1). Each signature was ranked, as outlined in Methods (Table 1), based on its position and orientation within the transcript, and the presence of a polyadenylation signal and polyA in the transcript sequence. 16,675 signatures (75%) mapped to UniGene transcripts; 40 signatures (0.2%) mapped to mitochondrial transcripts; 3,818 signatures (17%) matched genomic sequences but did not map to a UniGene cluster; 927 (4%) signatures matched sequences present at more than 100 genome locations (class 0, representing transcripts containing repetitive elements in their 3' UTR). 676 (3%) signatures did not match to genome or UniGene sequences. Some UniGene clusters contain multiple signatures. These signatures likely represent either transcripts of alternative termination sites, or artefacts of MPSS library construction. Signature classification helps to distinguish artifactual signatures from signatures representing expressed transcripts. For example, signatures of class 1 to 3 are 3'most signatures in mRNA or EST sequences with poly (A) signal and/or polyA tail and most likely represent transcripts with multiple polyadenylation sites. Artifactual signatures constituted 1–3% of the tpm count of the "real" signature, although occasionally close counts were observed (data not shown; see supplementary data tables, additional files 2, 3). To simplify the MPSS data analysis and pair-wise comparison of ES cell data from this study to other datasets, multiple signatures mapping to the same Unigene ID (Hs build 169) were combined into one tpm count as the sum of tpm for signatures of class 1, 2, 3, 22, 23 if any found. These are 3'most signatures close to polyA signal and/or polyA tail, most probably representing true transcripts with alternative termination. If no signatures of above classes were found, then sum of class 4 (3'most, no polyA features) was used. If none the above, the sum of class 5 signatures was used for the tpm calculation per unigene cluster. Resulting table containing data for 8679 unigene clusters, 11 mitochondrial genes, and including 1991 signatures that did not map to unigene but uniquely matched genomic sequences (potential novel transcripts), is presented in supplementary table (additional file 4) and available for download from Lynx .
The frequency distribution of the signatures shows that the 200 most abundant signatures represent 99% of the total number of signature counts obtained from the hESC (Figure 1). Most of top 200 genes (unigene clusters, additional file 5) represent ribosomal genes and genes involved in protein and nucleic acid synthesis and are consistent with results obtained by EST scan and other analyses (data not shown, and [5, 8, 9]). We note that several ribosomal genes were identified as being overexpressed by microarray, SAGE and EST scan as well (see additional files 16, 17, 18). Comparison of the pattern of gene expression with other cell types showed a very similar expression profile with housekeeping genes being the predominant population of sequences in all cell types examined (data not shown). Only three known ES cell specific genes were present in the top 200 genes (additional file 5 and Figure 1). These included SOX-2, DNMT3β, and Oct-4. As in other cells cell type specific genes, transcription factors and cytokines were present at much lower abundance (<50 tpm on average). These low tpm level genes were often not detected by other methods (discussed below). The expression level of cell surface receptors for fibronectin are high (ITGB1 – 578 tpm) and their presence was confirmed by immunocytochemistry and RT-PCR, suggesting that feeder-free clones may grow well on this substrate (data not shown, see also Figure 2 and [14, 21]). The major signaling pathways represented in the top 200 most abundant genes are the FGF signaling pathway, with FGFR1 being most abundant (673 tpm, Figure 2), and the ras activated pathway, with two members of the ras family (NRAS-related and ran) being present in the top 200. This is consistent with data that E-Ras is critical for rodent ES cell self-renewal . No transcripts for HRASP (Homologue of ERAS pseudogene) were detected however (Figure 2), suggesting that these other ras family members may subserve this critical role of self-renewal . The absence of E-Ras was confirmed by RT-PCR (data not shown), as was the presence of FGFR1 (Figure 2, , and data not shown).
Major pathways present at detectable levels by MPSS
To gain a broad overview of the properties of hESC, we mapped the genes found in the hESC cells to the human genome to get an overview of the chromosomal distribution of genes expressed in hESC (Figure 3 and additional files 6, 7, 8, 9, 10, 11). Overall, MPSS detected gene expression in most of the previously identified zones of transcriptional activity within chromosomes. Two chromosomal regions contained more genes expressed in hESC – than expected, and several regions where fewer genes were expressed, compared to the total number of genes located within a particular chromosomal region. No bias to chromosome 17, 12 or X was seen either in overall gene expression or in a particular cytoband. The failure to detect a bias was confirmed by mapping EST scan data  as well. The overall distribution patterns were similar and did not show any bias at this level of resolution. Interestingly, gene expression from both X and Y chromosomes was observed. Unlike rodent ES lines both male and female ES lines have been obtained with roughly equal frequency  suggesting that when individual cell lines are examined differences between levels of expression between male and female will be present and detectable.
Likewise, MPSS detected expression of several MHC Class I and II genes, suggesting that MPSS can identify differences between ES cell samples when HLA gene expression is used to type cells [17, 18]. We also note that both H19 and Igf2 were expressed at detectable levels. H19 and Igf2 are located adjacent to each other on chromosome 11p15.5 and are reciprocally regulated by imprinting, H19 being paternally imprinted, and IGF2 being maternally imprinted [23, 24]. It is therefore likely that their ratio of expression is likely to differ between cell populations and may represent a simple assessment of the imprinting status of cells.
We classified genes expressed into ECM related, homeobox containing, zinc finger proteins, novel genes as well as genes which could assigned to major signaling pathways such as wnt, BMP/TGFβ, LIF, receptors, etc. This data is provided in excel files in the supplementary information provided (additional files 12, 13). Overall certain general themes emerged when genes were classified into such a fashion. We find that: A) hESC express markers characteristic of ES cells in general and few markers characteristic of differentiated cells confirming the initial purity of ES cells used for this analysis and the fidelity of the analysis B) Ribosomal protein transcripts, and mitochondrial genes are highly expressed in ES cells (relative to other transcripts) and constitute more than 50% of the total transcripts analyzed (Figure 1, additional files 5, 16, 17, 18). And this is similar to other samples analyzed [3–11], (Lynx Inc. data not shown) C) Positive regulators of the cell cycle, TERT and antisenescence related genes and DNA repair pathway regulators are expressed at high levels while proapototic genes, Rb and p53 pathways regulators are expressed at low levels (see table 2 for an example of TERT related gene expression, see supplementary tables (additional files 12, 13) for cell cycle, apoptosis and other pathways) D) The number of novel genes or genes of unknown function is high (2600/11,000) and constitutes approximately 25% of the unique signatures (see additional file 13 for a listing of genes of unknown function, their chromosomal mapping, and UniGene identity). Comparison with other samples suggest that the number of novel genes or genes of unknown function seen are higher in ES cells (25% versus 20%). E) Components of most major signaling pathways are present but so are negative regulators (including zinc finger proteins), suggesting that inhibition plays an important role in maintaining cells in an undifferentiated state (see additional file 13).
Examination of signaling pathways suggest that wnt, TGFβ and FGF signaling pathways are likely important in regulating the ES cell state while LIF/gp130 signaling is not as important. These conclusions are based on examining the expression of the positive and negative regulators of a particular pathway by MPSS, and EST scan. When critical components are low or absent we have tentatively assumed that the pathway is unlikely to be active. An example of the Igf/PTEN pathway is shown to illustrate the logic (Table 3) and other pathways along with verification with EST scan are summarized in the supplementary tables (additional files 12, 13). Note the high levels of soluble frizzled receptors and the expression of E-cadherin (negatively regulating β-catenin translocation). The expression of cadherin and β-catenin was confirmed by immunocytochemistry (Figure 2). The relatively fidelity of the conclusion was confirmed by examining the expression of E-cadherin by immunocytochemistry and localizing β-catenin expression.
We compared the signature sequences detected in the hESC to an MPSS database of 36 human tissues and cell lines to look for genes that are unique to, or highly overexpressed in hESC. A list of several hundred was generated when a cutoff of 30 tpm or higher (ten fold above detection level) that were elevated in ES cells when compared to neural stem cells examined in a similar manner was used. This list is provided in supplementary materials (additional file 14). A list of 13 highly enriched genes of unknown function is shown in Table 4, and the tpm values for the corresponding signatures in each of 36 tissues or cell lines is provided in the supporting information (additional file 15). The expression in ES cells, of these 13 genes was confirmed by designing PCR primers to different regions and examining gene expression (Figure 2). Several of these genes are highly expressed in hESC and absent in most other tissues tested (Table 4, additional file 15, and data not shown), are downregulated as ES cells differentiate (Figure 2), and are good novel, candidate markers for undifferentiated hESC.
Comparing with other data sets
Recently we and others have begun examining hESC with EST scan  and microarray analysis to develop a characteristic profile of this unique population [3–10]. We used this data to compare the sensitivity of MPSS with EST scan and microarray analysis. We have previously reported a set of 90 genes reported common to 6 different hESC lines . Of these, eighty-five were detected by MPSS showing a high degree of concordance (>90%). Of the five genes missing from the MPSS hESC data set, four of the genes had valid MPSS signatures (Table 5) and were readily detected in other human samples (data not shown). One gene (SNRPF) lacked a DpnII (GATC) site making it non-detectable by MPSS. GDF3 was detected at non-significant level in the hESC, though was detected by MPSS at higher level (10–30 tpm) in other ES cells tested (Dr. B. Lim-Harvard University personal communication, and additional file 17). Sperger et al., also used microarray to examine gene expression in undifferentiated cell lines . They compared expression in undifferentiated cells with expression in EC carcinoma lines and with microarray data from several other cell lines. They have identified 895 genes (GenBank accession numbers) which reduce to 718 number of unigene identities when mapped to the unigene build Hs161. We have compared this data with the MPSS data and see that MPSS identified the large majority of these genes as well (additional file 16). Similar results were obtained when data was compared with that reported by Sato et al., and Abetya et al., [6, 7] and a similar concordance in gene expression was observed (data not shown). Thus, MPSS provides an independent verification of the microarray results and in addition identifies other genes that may not be present on the arrays or detectable by current microarray techniques.
Comparison with an EST scan analysis of 37,081 EST sequenced from a similar pooled sample of hESC [9, 10] also showed a high degree of concordance. The EST scan analysis detected 8,801 distinct UniGene clusters in hESC versus 9,996 distinct UniGene clusters expressed at 4 tpm or higher in the MPSS dataset. Of the 8,801 UniGene clusters identified by the EST scan, 1,139 are singletons, i.e. identified by only one EST out of the 37,081 total EST's. 5,286 UniGene clusters have 5 or more ESTs as evidence, and only 118 UniGene clusters have more than 100 EST's as evidence. In contrast, all 9,996 UniGene clusters identified by MPSS were detected at 4 or more tpm and identified in multiple sequencing runs. More than 8,000 have at least 10 tpm, and over 1,000 have more than 100 tpm. Thus, although the EST's are longer in length and thus easier to assign to a particular gene, MPSS appears more sensitive than EST scan. MPSS for example identified almost twice as many genes as EST scan consistent with the difference in the depth of analysis (No of sequences MPSS/EST).
Richards et al  have used SAGE analysis to two ES cell lines. Their analysis revealed expression of approximately four thousand genes which was significantly fewer than that identified by MPSS consistent with the fewer number of gene tags sequenced. Comparison of the data sets however showed good concordance particularly for genes expressed at higher tpm levels. The entire comparison is presented in supplementary table (additional file 18) and is available for download from Lynx . Overall MPSS could identify genes that other methods identified with an average concordance rate of 70%. The depth of analysis with MPSS at 2.4 million signatures however was significantly greater. MPSS in general identified many more genes than microarray or EST scan or SAGE (see above). The most direct comparison is with EST scan or SAGE, which do not rely on comparative gene expression to establish significance of gene expression. Overall our comparison suggests that MPSS results provide a complementary global overview of the transcriptome of the ES cell. The data supplement and extend the microarray, SAGE and EST scan data sets and provide an independent verification of the same. MPSS in addition identifies additional genes expressed particularly at lower tpm, that are either not present on microarrays or not detected with a lower resolution analysis.
Our results provide a global overview of the gene expression pattern of undifferentiated human ES cells and allow comparisons with other data sets. These results suggest the hESC are an actively dividing population of cells that exhibit high metabolic activity. Our analysis detected expression of approximately 10,600 unique transcripts, a figure that about a third of the total number of mapped genes. Unlike other cell types, however, a much larger fraction of unknown or novel genes was present. This high ratio likely represents the paucity of information available in existing libraries on this relatively newly characterized cell population rather than the possibility that ES cells use radically different pathways for self-renewal, survival, proliferation and differentiation.
Our results confirm the reported differences between rodent and human ES cells. We confirm the absence of expression of ERAS, Ehox and the orthologs PEPP1 and 2. The apparent lack of LIF requirement of hESC is reflected by the absence or low tpm levels for genes of the LIF pathway and high tpm for suppressors of LIF mediated signalling (see supporting information). The high level of expression of genes in the FGF pathway likely reflects the requirement of hESC for bFGF. The high level of FGFR1 expression suggests that FGFR1 is an important signal transducer and that FGF's other than FGF4 are important in hESC self-renewal. The high tpm of the fibronectin receptor also suggest that fibronectin or vitronectin are likely useful substitutes for matrigel and that activation of ras mediated signalling is likely critical, as has been described in the rodent ES cell analysis .
Comparing data from the MPSS analysis with microarray, SAGE and EST scan analyses suggest that MPSS is a powerful alternative to these techniques. MPSS identified virtually all of the genes highlighted as genes common between six different human ES cell lines surveyed by microarray. We noted that most genes detected by microarray were expressed at high tpm indicating that MPSS is more sensitive than microarray analysis. MPSS however appeared to be able to identify genes detected by microarray. Analysing an additional 400 markers detected by MPSS using focused microarray or RT-PCR confirmed their expression , (data not shown). Likewise, MPSS analysis showed good concordance with the EST scan data at a fraction of the price. In contrast to the EST scan, tpm levels determined by MPSS are highly correlated to the mRNA levels present in the cells, even at low tpm values , and (Lynx unpublished results). Due to the low sampling number of most EST scans, this is not true for relatively low number of EST's found for a particular gene, and can be used only as a rough estimate of gene expression. Unlike other in depth analyses, the absence of markers in MPSS runs is also a powerful control provided that the marker possesses a GATC site. The chromosomal distribution of the genes expressed in hESC did not reveal any bias for a particular chromosome or chromosomal region. While a couple of "hotspots" and several "cold spots" were identified, in no case was any region comprised of all transcribed or all silent genes.
Another important conclusion from our analysis is that selection of input RNA is critical. In our case we tested samples repeatedly to assess their purity and made considerable efforts to establish subclones that did not require feeder cells that could be potentially contribute transcripts to the analysis. Given the range of tpm of biologically relevant molecules (5 to 32,000 in this experiment) we predict that even a 5% contamination can confound results or detailed comparisons across different laboratories.
We note also that gene transcription from both the X and Y chromosome is observed indicating that at least subtle differences will exist between male and female lines even in the undifferentiated state. Sex-based gene expression, along with MHC gene expression and ratio of expression of imprinted genes could serve to distinguish between different ES cell populations. The present results further suggest that analysing embryoid bodies that differentiate stochastically or analysing tissue samples (with variable proportions of cells) by MPSS will prove more difficult and that results will be variable. We suggest that variability can be reduced by pooling samples, normalizing by careful testing for known markers of differentiation, by semi quantitative PCR, or by focused microarray analysis.
While MPSS is cost-effective and sensitive, it is by no means perfect. MPSS is limited by the requirement that DpnII sites (GATC) be present in a gene and be present in a unique locus such that the signature obtained is unique. For example, SNRF expression could not be assessed directly, as no GATC site is present. The signatures for ZFP42 are ambiguous and map to multiple transcripts. Although MPSS can distinguish between alternate transcript termination sites, MPSS cannot distinguish between alternative splicing events and possible incomplete digestion during the sample preparation process. Signature lengths are relatively short and it is possible to have to select between multiple genome hits (reviewed in . Sequencing is performed four bases at a time and transcripts that contain palindromic sequences (in particular double palindromes) are often undetected because of self-hybridization of single DNA strands on the bead. A survey of the genome suggests that this is a rare event (approximately 3% of all virtual signatures in human MGC database have double palindromes). The NODAL gene is an example for such an event, where the class 1 signature was lost and NODAL expression is detected only by a signature resulting from incomplete digestion during library construction (see results). The success of MPSS analyses also depends to a large extent on the quality of genomic information available and, in our opinion, currently is best utilized to analyse human cells. Furthermore, MPSS itself may not be the best method for routine, lower throughput analyses, given price per sample, sample processing time and the large amount of data generated, which requires considerable analysis. However, the database, once developed, is extremely valuable provided it is freely available to make comparisons and to select subsets of genes for further analysis. MPSS information can be effectively utilized by establishing a common database of markers expressed at a defined stage in the differentiation of cells. Additional data sets from sampling of cells at well-controlled stages of differentiation that can be readily accessed and compared to existing datasets will provide the most information while still being cost effective. The genome database is an example of such sharing that has proven to be an invaluable resource for our experiments. Such a strategy requires cooperative pooling of information and free sharing such that individual results can be readily compared against validated datasets. Our future experiments will be directed and developing additional data sets of ES cell differentiation, which can be shared in a manner similar to the present set.
Our results provide a comprehensive data set that can be effectively utilized to analyse expression patterns of known and unknown genes. Comparison with other data sets provides independent confirmation of results and shows a high level of concordance. The caveats to all such large-scale comparisons are discussed and the importance of pooling data and comparing across multiple data sets is demonstrated.
MPSS was performed using RNA from three pooled ES cell lines (H1, H7, and H9) that had been maintained in feeder free culture conditions and evaluated for the presence of ES cell markers and absence of markers of differentiation. The mRNA was converted to cDNA and digested with DpnII. The last DpnII site and the downstream 16 bases were cloned into Megaclone vectors and their sequences determined according to the MPSS protocol [15, 16, 25]. A total of 2.786.765 sequences were read from four different runs and 48,388 unique signatures were identified. The abundance for each signature was converted to transcripts per million (tpm) for the purpose of comparison between samples. Signatures at an abundance of less than 4 tpm or those that were not detected in at least two runs were removed and a total of 22,136 sequences were analyzed further. All data is available for download from Lynx 
MPSS signature classification and annotation
To generate a complete, annotated human signature database, we extracted all the possible signatures ("virtual signatures") from the human genome sequence, the human Unigene sequences, and human mitochondrion. Each virtual signature was ranked, as outlined in the table 1a, based on its position and orientation in the original sequence. Unigene, genomic, and mitochondrial hits were combined and grouped by signature. The annotation was then assigned to the signature in following order of preference: repeat warnings (signature hits more than 100 genome locations); mitochondrial hits; Unigene hits; genome hits (if no transcript match found). If a signature matched only one Unigene cluster, the MPSS signature class is the lowest class of the member sequences of the cluster. If a signature hits multiple Unigene clusters, the best cluster hit is selected based on the lowest MPSS signature class or the largest number of member sequences. The resulting signature database was used to annotate the data from the experiments Initially the signatures were annotated using genome version hg15 (April 2003, Golden Path, UCSC,) and Unigene build #161 (additional file 2). Recently we re-annotated all signatures using genome version hg16 (July 2003, Golden Path, UCSC) and Unigene build #169 (additional file 3). Both annotations are available for download in supplemental tables .
Analysis was performed as described in Bhattacharya et al.,  using six different samples. These included two lines from Bresagen (01 and 02), the pooled sample from Geron comprising feeder free subclones of (H1, H7, H9), H1, grown in our laboratory on feeders and H9 and I6 from Dr. Itskovitz-Eldor grown following their published protocols.
Chromosomal mapping of MPSS signatures and UniGene clusters to the human genome
MPSS signatures with a hit to a UniGene cluster were mapped to the Giemsa staining cytobands of the hg15 release of the human genome (April 10, 2003 freeze, ). By this method, 7731 MPSS signatures were mapped to the cytobands of the human genome. Similar mapping was done for all UniGene clusters for which the chromosomal mapping is known. In order to achieve a gene-based rather than a transcript (i.e. splice variant) based distribution of genes splice variants the UniGene clusters were filtered using LocusLink data , since LocusLink captures all characterized splice variants of a particular gene. 23,828 UniGene clusters were identified by this method and mapped to the cytobands of the human genome. To discover differences in the number of genes mapped to each cytoband, the number of genes mapped to each cytoband was compared to the total number of genes analyzed, for both the MPSS signatures as well as for the UniGene clusters. The Fisher test  was used to determine the statistical significance, using a p-value = 0.05 as cutoff.
Gene detection by RT-PCR
Total RNA was isolated from cell pellets using RNAeasy Qiagen mini protocol and kit. cDNA was synthesized using 100 ng of total RNA in a 20-μl reaction. Superscript II (Gibco-BRL), a modified Maloney murine leukemia virus RT, and Oligo (dT)12–18 primers were used according to the manufacturer's instructions (Gibco-BRL). The list of primers used for RT-PCR and annealing conditions are described previously ).
Gadiraju S, Vyhlidal CA, Leeder JS, Rogan PK: Genome-wide prediction, display and refinement of binding sites with information theory-based models. BMC Bioinformatics. 2003, 4: 38-10.1186/1471-2105-4-38.
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R, et al: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology Genome-wide prediction, display and refinement of binding sites with information theory-based models. Nucleic Acids Res. 2004, 4: D262-6. 10.1093/nar/gkh021.
Ginis I, Luo Y, Miura T, Thies S, Brandenberger R, Gerecht-Nir S, Amit M, Hoke A, Carpenter MK, Itskovitz-Eldor J, et al: Differences between human and mouse embryonic stem cells. Dev Biol. 2004, 269: 360-80. 10.1016/j.ydbio.2003.12.034.
Pennacchio LA: Insights from human/mouse genome comparisons. Mamm Genome. 2003, 14: 429-36. 10.1007/s00335-002-4001-1.
Loring JF, Porter JG, Seilhammer J, Kaser MR, Wesselschmidt R: A gene expression profile of embryonic stem cells and embryonic stem cell-derived neurons. Restor Neurol Neurosci. 2001, 18 (2–3): 81-8.
Sato N, Sanjuan IM, Heke M, Uchida M, Naef F, Brivanlou AH: Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol. 2003, 260: 404-13. 10.1016/S0012-1606(03)00256-2.
Abeyta MJ, Clark AT, Rodriguez RT, Bodnar MS, Pera RA, Firpo MT: Unique gene expression signatures of independently-derived human embryonic stem cell lines. Hum Mol Genet. 2004, 13: 601-8. 10.1093/hmg/ddh068.
Richards M, Tan SP, Tan JH, Chan WK, Bongso A: The transcriptome profile of human embryonic stem cells as defined by SAGE. Stem Cells. 2004, 22: 51-64.
Brandenberger R, Wei H, Zhang S, Lei S, Murage J, Fisk GJ, Li Y, Xu C, Fang R, Guegler K, et al: Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol. 2004, 22: 707-16. 10.1038/nbt971.
Bhattacharya B, Miura T, Brandenberger R, Mejido J, Luo Y, Yang AX, Joshi BH, Ginis I, Thies RS, Amit M, et al: Gene expression in human embryonic stem cell lines: unique molecular signature. Blood. 2004, 103: 2956-64. 10.1182/blood-2003-09-3314.
Sperger JM, Chen X, Draper JS, Antosiewicz JE, Chon CH, Jones SB, Brooks JD, Andrews PW, Brown PO, Thomson JA: Gene expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc Natl Acad Sci U S A. 2003, 100: 13350-5. 10.1073/pnas.2235735100.
Zeng X, Miura T, Luo Y, Bhattacharya B, Condie B, Chen J, Ginis I, Lyons I, Mejido J, Puri RK, et al: Properties of pluripotent human embryonic stem cells BG01 and BG02. Stem Cells. 2004, 22: 292-312.
Pera MF, Reubinoff B, Trounson A: Human embryonic stem cells. J Cell Sci. 2000, 113: 5-10.
Amit M, Margulets V, Segev H, Shariki K, Laevsky I, Coleman R, Itskovitz-Eldor J: Human feeder layers for human embryonic stem cells. Biol Reprod. 2003, 68: 2150-6.
Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol. 2000, 18: 630-4. 10.1038/76469.
Jongeneel CV, Iseli C, Stevenson BJ, Riggins GJ, Lal A, Mackay A, RA Harris, O'Hare MJ, Neville AM, Simpson AJ, et al: Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc Natl Acad Sci U S A. 2003, 100: 4702-5. 10.1073/pnas.0831040100.
Carpenter MK, Rosler ES, Fisk GJ, Brandenberger R, Ares X, Miura T, Lucero M, Rao MS: Properties of four human embryonic stem cell lines maintained in a feeder-free culture system. Dev Dyn. 2004, 229: 243-58. 10.1002/dvdy.10431.
Rosler ES, Fisk GJ, Ares X, Irving J, Miura T, Rao MS, Carpenter MK: Long-term culture of human embryonic stem cells in feeder-free conditions. Dev Dyn. 2004, 229: 259-74. 10.1002/dvdy.10430.
Xu C, Inokuma MS, Denham J, Golds K, Kundu P, Gold JD, Carpenter MK: Feeder-free growth of undifferentiated human embryonic stem cells. Nat Biotechnol. 2001, 19: 971-4. 10.1038/nbt1001-971.
Carpenter MK, Rosler E, Rao MS: Characterization and differentiation of human embryonic stem cells. Cloning Stem Cells. 2003, 5: 79-88. 10.1089/153623003321512193.
Amit M, Shariki C, Margulets V, Itskovitz-Eldor J: Feeder layer- and serum-free culture of human embryonic stem cells. Biol Reprod. 2004, 70: 837-45.
Takahashi K, Mitsui K, Yamanaka S: Role of ERas in promoting tumour-like properties in mouse embryonic stem cells. Nature. 2003, 423: 541-5. 10.1038/nature01646.
Zemel S, Bartolomei MS, Tilghman SM: Physical linkage of two mammalian imprinted genes, H19 and insulin-like growth factor 2. Nat Genet. 1992, 2: 61-5. 10.1038/ng0992-61.
Leighton PA, Saam JR, Ingram RS, Tilghman SM: Genomic imprinting in mice: its function and mechanism. Biol Reprod. 1996, 54: 273-8.
Brenner S, Williams SR, Vermaas EH, Storck T, Moon K, McCollum C, Mao JI, Luo S, Kirchner JJ, Eletr S, et al: In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs. Proc Natl Acad Sci U S A. 2000, 97: 1665-70. 10.1073/pnas.97.4.1665.
Siegel S, Castellan N: Nonparametric Statistics for the Behavioral Sciences. 1988, 2
This work was supported by the NIA and an ALS center grant. JC, TM, MR were supported by the NIA. RP was supported by the FDA. ST, RB, JL are employees of Geron Inc. IK and TV are employees of Lynx Inc. We thank Drs. Ginis and Limke for careful manuscript reading and all members of our laboratories for constant stimulating discussions. MSR acknowledges the contributions of Dr. S. Rao that made undertaking this project possible.
RB, IK and MR were primarily responsible for the data analysis and writing and editing the manuscript. ST generated the ES cell samples and verified their quality, TM performed the RT-PCR and JC performed the immunocytochemistry. RP generated the microarray data and TV and JL provided support and supervision for RB, IK, ST.
Ralph Brandenberger, Irina Khrebtukova contributed equally to this work.