Genome wide profiling of human embryonic stem cells (hESCs), their derivatives and embryonal carcinoma cells to develop base profiles of U.S. Federal government approved hESC lines

Background In order to compare the gene expression profiles of human embryonic stem cell (hESC) lines and their differentiated progeny and to monitor feeder contaminations, we have examined gene expression in seven hESC lines and human fibroblast feeder cells using Illumina® bead arrays that contain probes for 24,131 transcript probes. Results A total of 48 different samples (including duplicates) grown in multiple laboratories under different conditions were analyzed and pairwise comparisons were performed in all groups. Hierarchical clustering showed that blinded duplicates were correctly identified as the closest related samples. hESC lines clustered together irrespective of the laboratory in which they were maintained. hESCs could be readily distinguished from embryoid bodies (EB) differentiated from them and the karyotypically abnormal hESC line BG01V. The embryonal carcinoma (EC) line NTera2 is a useful model for evaluating characteristics of hESCs. Expression of subsets of individual genes was validated by comparing with published databases, MPSS (Massively Parallel Signature Sequencing) libraries, and parallel analysis by microarray and RT-PCR. Conclusion we show that Illumina's bead array platform is a reliable, reproducible and robust method for developing base global profiles of cells and identifying similarities and differences in large number of samples.


Background
Embryonic stem cells (ESCs), derived from the inner cell mass of pre-implantation embryos, have been recognized as the most pluripotent stem cell population. Human ES cells (hESCs) can be maintained and propagated on mouse or human fibroblast feeders for extended periods in media containing basic fibroblast growth factor (bFGF) [1][2][3][4] while retaining the ability to differentiate into ectoderm, endoderm and mesoderm as well as trophoectoderm and germ cells. Gene expression in hESC has been investigated by a variety of techniques including massively parallel signature sequencing (MPSS), serial analysis of gene expression (SAGE), expressed sequence tag (EST) scan, large scale microarrays, focused cDNA microarrays, and immunocytochemistry [5][6][7]. Markers for hESCs that may also contribute to the "stemness" phenotype have been established and markers that distinguish ESCs from embryoid bodies (EB) have been developed. Novel stage-specific genes that distinguish between hESCs and EBs have been identified and allelic differences between ESC have begun to be recognized [8][9][10].
As the potential of hESCs and their derivatives for regenerative medicine is being evaluated, it has become clear that the overall state of the cells, degree of contamination and comparisons of the more than a hundred different newly derived lines will need to be performed. It will be necessary to develop methods to monitor and assess hESC and their derivatives on a routine basis. Since differentiated cells are often scattered within or at the edge of colonies [11] and the differentiation is so subtle that morphological characteristics and even immunohistochemistry are insufficient to detect it, larger scale methods of analysis need to be developed.
Our strategy was to compare a variety of different hESC lines that were derived and expanded by three different institutions (WiCell Research Institute, BresaGen, Inc., and Technion-Israel Institute of Technology), and cultured in two separate laboratories (Burnham Institute and NIA) to a baseline set of data against which cell samples can be compared. By using cells grown in different conditions we expected to be able to identify core commonalities and by comparing feeders and embryoid bodies (EB) with hESC identify measures of contamination and early markers of differentiation. Further, by comparing embryonal carcinoma cell (EC) and karyotypically variant lines with hESC, we would be able to directly assess their utility as surrogates (for quality control purposes) for hESC.
We employed a pre-commercial prototype of the Illumina HumanRef-8 BeadChip [12], a genome-scale bead based array technology that combines the sensitivity and low cost of a focused array with the coverage of a large scale array, while requiring much smaller sample sizes than MPSS, EST scan or SAGE. We show that the Illumina bead based array correctly identified blinded duplicates as the closest related samples and readily distinguished between hESC lines, as well as between ESCs and EBs derived from them. This array allowed us to estimate the degree of feeder contamination present in the cultures. Similarities and differences between EC line NTera2 and hESC lines could be determined and verified, and the database comparisons allowed us to identify core self-renewal pathways that regulate hESC propagation.

Multiple hESC lines can be assessed by Illumina bead array
Forty-eight samples were selected from multiple laboratories and gene expression profiles were examined using a bead array containing 24,131 transcripts derived from the Human RefSeq database that included full length and splice variants. Each gene was represented by sequences containing an average of thirty beads to provide an internal measure of reliability. Samples included 7 hESC lines BG01, BG02, BG03, I6, H1, H7 and H9, EBs that were differentiated from hESCs of the three BG lines, human fibroblast feeder HS27 (ATCC), hESC-derived fibroblasts, karyotypically abnormal hESC line BG01 Variant (BG01V) [13] and EC line NTera2 [14]. Samples were blinded and biological and technical repeats were examined at the same time. A single slide contained eight replicates and six such slides were used for the present set of samples. Results were normalized to average following Illumina Beadstudio manual and the quality of each sample was assessed by immunocytochemitsry and RT-PCR prior to subjecting them for analysis (data not shown). Results from the entire sample set are available for download as an excel spreadsheet (Additional file 1) and a CD of the results is available upon request. The total number of genes identified as expressed at >0.99 confidence is summarized in Table 1. Intensity results are reported in arbitrary units and ranged from 10 to 20,000 (a two thousand fold range). Although the sensitivity of the array has been reported to be high, in the present report we have restricted our analysis to expression of at least 100 units in any one sample. Using this cutoff, on average cells expressed approximately 8,000 transcripts (Table 1, 2), a number similar to the number detected by SAGE, MPSS and EST analysis [5][6][7]10,15,16]. As with other analysis, genes with the highest abundance were housekeeping genes, ribosomal genes and structural genes ( Table 2 and Additional file 1). These genes were similar in most samples though relative levels varied.
One of the advantages of the Illumina arrays is the ability of running multiple samples simultaneously thus allowing multiple pairwise comparisons to be performed readily. To show the similarity of relative gene expression between samples, we have used Illumina Beadstudio and clustering software packages Pcluster [17] and TreeView [18] to generate a heat-map ( Figure 1) and a dendrogram ( Figure 2). Based on their properties, we classified some of our samples into four groups, (A) undifferentiated hESCs (including a sample from karyotypically abnormal variant, designated as "ES", n = 11); (B) differentiated ES cells and EBs (designated as "EB", n = 6); (C) hESC derived neural cells (designated as "NS", n = 3); and (D) hESC derived mesenchyme and human fibroblast feeder cells (designated as "FB", n = 5) and these groups were shown in the heat-map. Comparing the overall pattern of expression, we made several important observations: 1) Duplicates clustered close to each other and were more related to each other than to any other sample; 2) ESCs appeared more similar to each other than to EBs; 3) NTera2 cells appeared more similar to ESCs while differentiated NTera2 and EBs can be readily distinguished from their parent populations ( Figure 2); 4) BG01V appeared similar to undifferentiated BG01 cells; 5) In general ESC lines grown in one laboratory appeared more similar than samples grown in other laboratories, suggesting that culture conditions affected gene expression but that this effect was much smaller than the effect of differentiation.
The global analysis suggested that the bead arrays used were sufficiently sensitive such that individual subsets of genes could be analyzed, different populations of cells could be readily distinguished and that a subset of candidate genes could be sufficient to distinguish between groups of cells. The comparison across multiple samples will allow a set of core stem cell markers to be identified. In subsequent sections we have performed such analysis. Readers are urged to analyze the expression of desired genes directly as it is impossible to test every gene given the large body of data generated.

Comparison between MPSS and Illumina bead array results
We have previously used EST scan and MPSS to analyze pooled samples of ESCs and EBs from three different WiCell lines (H1, H7 and H9) [5]. Comparison between the two methodologies indicated that while there is good concordance for genes expressed at high levels, this does not hold for genes expressed at lower levels. As a test of the quality of the data generated in these experiments and to evaluate whether comparisons can be made across different methodologies, we re-ran the identical samples on the bead array platform. The complete comparison of gene Unsupervised two-way hierarchical cluster analysis of differ-entially expressed genes illustrated in a heat-map Figure 1 Unsupervised two-way hierarchical cluster analysis of differentially expressed genes illustrated in a heat-map. Each row represents the relative levels of expression of a single gene. Each column represents a sample. The samples include four groups of cells, ES designates 11 samples of hESCs, EB contains 6 samples of differentiated ESCs and EBs, NS consists of 3 hESC derived neural cells and FB is a collection of hESC derived mesenchyme and fibroblasts. High expressions relative to mean are colored red. Low expressions are colored green. Black represents no significant change in expression level between mean and sample. Samples cluster closer within their own group than samples from other groups. Table 1: Correlation coefficients of paired samples in this bead array In order to test the reproducibility and reliability of the bead array, duplicate samples of hESC lines H9, I6, and EC line NTera2 and human fibroblast feeders (HS27) were run at the same time and correlation coefficients (R 2 ) of duplicates were generated using the entire data of all genes with expression level >0 ( §), or genes with detection confidence >0.99 (*), or genes with detection confidence >0.99 and expression level > 100 arbitrary units (#). Note that the correlation coefficients are in the range of 0.9382-0.9761 and the number of genes was in the range of 10,000-14,000.

Duplicate Samples
No. of all genes (expr.>0) § expression is shown in Additional file 2 and is summarized in Tables 3 and Table 4. Overall, concordance in Illumina array was better than that evident between EST scan and MPSS datasets [9], but clearly showed much wider differences than that seen with running duplicates in the same assay format. Nevertheless, this comparison provides an independent verification of the data and suggests that if a sample is detected in more than one large-scale analysis, the reliability of the gene expression detection is high, which also reduces the number of individual genes needed to be verified. Caution should be observed in comparing different samples run on different platforms, especially when there has not been rigorous bioinformatic matching of the source sequences used to identify genes in the platforms. Often genes called by the same symbol originate from different database records, which may originate from different splice variants or contain sequence differences due to polymorphisms or outright error [19].

Human feeders and hESCs can be readily distinguished and contamination can be readily assessed
For all samples, we conducted an unsupervised one-way hierarchical clustering analysis. The clustering analysis was based on the average linkage and Euclidean distances as the similarity metric using differentially expressed genes identified by ANOVA (P < 0.05). The analysis revealed the underlying features and variation patterns of gene expression in each cell types. Figure 2 shows results of the cluster analysis of relative gene expression in selected samples. As one of our purposes of this study was to distinguish between human fibroblast feeders cells and hESCs and hEBs, wishing to readily detect feeder contamination in hESCs, we included one of the human feeder cells HS27 (ATCC) in this study. We have been using HS27 as feeder cells for H9 hESCs for more than two years and all hESCs grown on HS27 had normal karyotype, expressed all undifferentiated markers, and made teratomas with all germ layers (data not show). The global pairwise comparison clearly showed that human feeders were far more dissimilar to hESCs than hESCs grown in different laboratories, hESCs compared to their differentiated EBs that contained mesodermal tissue, and hESCs compared to the karyotypically variant hESC line BG01V. Pairwise comparisons of human feeders with hESCs resulted in a correlation coefficient of 0.66, which was less than the correlation coefficient of 0.71-0.74 observed between hESCs and their corresponding EBs. The large difference between human feeders and hESCs suggested that it would be possible to identify markers that were robust and reliable in distinguishing the two populations, and these markers would be sufficiently sensitive in detecting contamination of feeders. We examined the data to Dendrogram of unsupervised one-way hierarchical clustering analysis of relative expression of genes in selected samples Figure 2 Dendrogram of unsupervised one-way hierarchical clustering analysis of relative expression of genes in selected samples. The clustering analysis was based on the average linkage and Euclidean distances as the similarity metric using differentially expressed genes identified by ANOVA (p < 0.05). hESCs clustered together and BG lines cultured in the same laboratory shared the largest similarities. EBs were separated from hESCs from which they were derived. EC line NTera2 and feeder cells can be distinguished from hESCs respectively.
develop a list of genes that had high levels of expression in human feeder cells maintained in hESC medium but whose expression was low or absent in either ESCs or EBs. The absence of expression in EBs was used as a control for spontaneous differentiation of ESC colonies (including mesodermal differentiation) which may occur and the markers selected should be able to distinguish between these two events. A complete list of genes expressed at least ten-fold higher in human feeders is provided in Figure 3. Quantitative RT-PCR (qPCR) was used to verify the fold change of the expression of 4 genes, including THBS1, MMP3, TNFRSF11B and KRTHA4 ( Figure 3C). Further confirmation can also be done using immunocytochemistry, as antibodies against these genes are commercially available.
Thus this comparison allowed us to distinguish between hESCs and human feeders and identify candidate markers that could detect feeder cell contamination should human feeders be used in the propagation of hESCs.

hESCs and EBs can be distinguished from each other
Illumina bead array analysis confirmed that hESCs could be readily distinguished from EBs by global analysis. This raised the possibility that specific subsets of markers could be identified. We and others have used MPSS and EST scan and generated array data to make lists of hESC-specific genes [5,9,10,20]. As discussed above, most hESC markers identified by MPSS have been detected in the present bead array analysis (Table 3), confirming the utility of these previously identified markers for use in assessing undifferentiated status of hESCs. In addition, we have generated a list of genes differentially expressed at higher level in EBs than in hESCs, a subset of which is shown in Table 5. These markers were common to all EB samples tested and included genes known to be expressed in ectoderm, endoderm and mesoderm. The entire set of differentially expressed genes is provided in Additional file 3. Thus, the bead array format, which allows multiple pairwise comparisons, can be used to identify genes that are expressed by all differentiating EB samples in the present study. Our data suggested that a core set of limited markers might be sufficient to monitor the process of differentiation. By suitable selection of different germ cell layer specific markers one may also assess the overall quality of differentiation toward germ cells.

Smaller but distinct differences among undifferentiated hESC lines
Our cluster analysis indicated that BG01, BG02 and BG03 cell lines were overall more similar to each other than to other lines ( Figure 1 and 2), but nevertheless showed additional differences than technical or biological repeats of the same sample. This raised the possibility that this microarray strategy may be sufficiently sensitive to identify relatively cell type specific candidate genes that could be used to distinguish one hESC population from another or to identify differences that were due to varied isolation and growth conditions. As a test we looked for differences between BG01, BG02 and BG03, which were grown in the same laboratory under the same conditions. Lists of candidate genes are shown in Figure 4A, C and 4E and the comparison of these three lines are shown in scatter plots in Figure 4B, D and 4F.
We reasoned as well that such a global comparison should allow us to distinguish between male and female lines if genes present on the Y chromosome were expressed at high levels in the undifferentiated state and were detected by the bead array. Several such candidate genes were identified. The most robust were RPS4Y, RPS4Y2, and EIF1AY ( Figure 5). To confirm that these were useful markers, we designed RT-PCR primers and tested their expression in a male (BG01) and a female (BG03) line ( Figure 5B). We noted that several of these continued to be expressed at high levels as ESCs differentiated to form EBs and upon further differentiation (data not shown), suggesting that these markers might be used in adult stem cell and germ cell populations as well.
In summary, our data suggest that the bead array format is sufficiently sensitive and global that it can distinguish one cell line from another even if those two cell lines are grown in the same laboratory under virtually identical conditions. Bead array can also be used to distinguish between male and female lines.

Comparison of diploid pluripotent cells with NTera2 and BG01 variant
Our previous results have suggested that EC lines share many of the properties of hESCs and can be used as a useful model for initial testing of biological questions [21]. More recently we have identified BG01V as a karyotypically abnormal variant that behaves much like its normal counterpart BG01, but is not subject to the same constraints of use as karyotypically normal hESCs [13]. Given the sensitivity of the bead array analysis, we tested its ability to detect the overall similarities and differences between NTera2 and a pooled ESC sample or between the karyotypically abnormal BG01V and its normal parent line ( Figure 6).
Our results showed that, while NTera2 shared a high similarity with hESCs [21], it did have important differences with hESC lines. Examining these differences (summarized in Figure 6C and 6E), we noted that some reflected the origin of the tumor cells from which this line was derived [14]. Several germ cell markers such as GAGE2, GAGE7 and GAGE8 were highly expressed in NTera2 but were absent (or present at low levels) in any of the hESC lines examined (See Figure 6C and Additional file 1. Note that the GAGE genes are highly similar in sequence, making it difficult to distinguish one family member from another through hybridization; thus, while all of these GAGE gene probes gave positive signal, it is difficult to say if the signal came from the specific gene itself or from cross-hybridization from one of the other family members). None of these were present in BG01V, indicating that the karyotypically abnormal variant is not the equivalent of a teratocarcinoma line such as NTera2. In addition to the expression of germ cell markers, we noticed a significant difference in the expression of genes in the TGFβ pathway, such as GDF3 ( Figure 6C), TGFBI, CDKN1A, IGFBP7, IGFBP3, NODAL, CER1 and BMP2 ( Figure 6E). This is consistent with the postulated role of this pathway in germ cell differentiation [22,23] and suggests that TGFβ pathway cannot be reliably tested using NTera2 as a model for hESC.
The BG01V showed clear differences from its normal counterpart and some major changes are summarized in Figure 6D and 6F. Early markers of differentiation appeared to be present at higher levels in BG01V as compared to any of the hESC lines examined, although hESC specific genes continued to be expressed at high levels (see Additional file 4). In particular, the Wnt pathway and the TGFβ signaling pathway ( Figure 6D), both of which involved in the early process of differentiation [24,25], appeared to be activated (Additional file 4), suggesting that the role of growth factors and signaling in these early events cannot be readily studied in this cell line.
In summary, the analysis highlighted the utility of the potential reference standards NTera2 and BG01V, demonstrated their general similarity and provided detail on potential caveats to their application.

Global arrays provide a snapshot of the state of the cells and identify core self-renewal pathways
We have utilized a small fraction of the data to demonstrate the overall utility of this approach and its sensitivity in identifying small differences in cell populations. An additional potential application of such an analysis is the ability to examine the general state of a particular signaling pathway and determine whether it is active. By comparing across many samples, a procedure previously expensive and difficult in terms of the RNA and replicate To test whether we could use such multiple pairwise comparisons to elucidate the major regulatory pathways that may be required for hESC self-renewal, we examined several metabolic pathways. The results of the analysis of the insulin/insulin-like growth factor (IGF) signaling pathway are shown in Figure 7. Using the same 4 groups of samples as in Figure 1, we conducted PAM (Prediction Analysis of Microarray) [26], in search for biomarkers used in diagnostic identification of these four groups, ES, EB, NS, and FB. In PAM, a list of significant IGF pathway genes whose expression characterizes each diagnostic class was obtained. The average gene expression level in each class was divided by the within-class standard deviation.
The nearest centroid classification computed took the gene expression profile from a new sample and compared it to each of these class centroids. For cross-validation of prediction results, multiple classification processes were performed on two data sets randomly constructed each time from the entire gene expression dataset. The first dataset, consisting of 70% of the total data, was used as the training dataset, and the other dataset, containing the remaining 30% of data, was used for the data prediction and verification process. The final biomarkers were determined in such a way that the misclassification error rate was minimal. The resulting graph (Figure 7) showed the shrunken class centroids for genes that had at least one nonzero difference in each diagnostic class. The genes with nonzero components in each class were almost mutually exclusive and represented candidate biomarkers

Figure 3
Human fibroblast feeder cells can be distinguished from hESCs and EBs. Bead array identified lists of genes that were uniquely expressed in human fibroblast feeders as compared to hESCs (A) and hEBs (B). The four genes whose expression was confirmed by qPCR (C) were in bold. In the graph (C), gene expression of each gene in feeder cells was designated as 1 fold and the bars represented fold decrease for each gene.
for the diagnosis of each class. All data analyses were performed using the bioconductor package [17].

Discussion
Undifferentiated hESCs have been analyzed by EST scan, MPSS, SAGE and microarray [5,10,16]. The goal of these experiments including our own is to develop a low cost reliable method to assess multiple samples to generate a global database of markers and to provide a method of identifying core measures of similarities and differences across multiple laboratories. We and others have pro-posed three alternative methods of assessment: Quantitative RT-PCR [9,20], focused arrays [27] or a large scale array with bioinformatics tools being utilized to focus on appropriate subsets of genes [5,7,15,16,28]. Each of these methods has its advantages and disadvantages. The present results suggest that the global Illumina bead array retains the advantages of low cost per sample associated with focused arrays yet still has the strength of the global attributes of MPSS or EST scan while requiring much less RNA and turnaround time. To test this array format we examined samples from a variety of laboratories in a blinded fashion to determine whether the array was sufficiently sensitive and rapid for routine assessment. Duplicates using 100 ng of RNA were run and results obtained forty-eight hours later. The resolution was sufficient that ESC samples could be distinguished from one another and a variant karyotypically abnormal subclone could be distinguished from the parent population (correlation coefficient = 0.9043).
Aliquots of the pooled ES and pooled EB samples, which we had prepared for MPSS, were included in this run to compare these two methods directly. The current analysis confirms that comparison across platforms is difficult and BG lines show small but distinct differences as assessed by bead array Figure 4 BG lines show small but distinct differences as assessed by bead array. These three hESC lines share high similarities as shown by the scatterplots of BG01 vs BG02 (B), BG01 vs BG03 (D) and BG02 vs BG03 (F). Comparisons of all three lines were made and lists of selected genes that were specifically expressed in BG01 (A), BG02 (C) and BG03 (F) are shown. Correlation coefficients (R 2 ) were generated using all genes with expression level >0 (black and blue dots), or all genes with detection confidence >0.99 (blue dots). Genes outside the two thin red lines were detected at >2.5-fold difference.
that only positive results can be treated with any reliability. The absence of expression cannot be readily interpreted. In particular, genes expressed at low levels (greater than 70% of all genes detected) should not be assessed in cross platform comparisons. The limited concordance at low levels raises a question as to how many genes are actually expressed by any one cell line and whether the cutoff of 3 tpm used for MPSS or 100 intensity units for bead arrays is a reasonable cutoff. We used 100 units for our analysis and we would suggest that readers exercise similar caution.
Nevertheless even at this higher cutoff the arrays were remarkably sensitive and allowed us to readily distinguish between samples including cells grown in the same laboratory. The basis of the sensitivity could be attributed to a limited set of genes and those genes could be identified for future use. For example BG01V, while much more similar to BG01 than to any other cell type, could still be distinguished from a biological replicate of BG01 by the expression of a particular subset of differentiation markers ( Figure 6). EC cells such as NTera2 could be distinguished from hESCs by the expression of germ cell markers and the presence of a partially inactivated TGFβ (BMP) signaling pathway ( Figure 6).
Distinguishing ESCs from EBs was relatively straightforward. We have confirmed the utility of previously identified markers for use in this platform as well as identified an additional set of markers that can serve as biomarkers to distinguish between the hESC and EB states. A subset of these markers have been used to develop a qPCR assay that shows such a high sensitivity that changes in cell behavior can be detected after as little as twenty-four hours and the development of EBs can be reliably staged [10,20].
During the identification of ES and EB specific markers, we have noticed that some known hESC markers, such as Nanog, was not detected in all populations of hESCs that were included in this analysis. Several ESC-specific gene, including Lin41, Sox2 and FoxD3, were not detected in the array either (Table 3). We believe that the problem with Lin 41, Sox2 and FoxD3 is a technical one as we were able to confirm expression using alternate methods. We are in progress of redesigning appropriate probes for these genes. In the case of the gene Nanog, there are several pseudo genes in the genome for Nanog and it has been a major technical challenge designing primers or probes that are specific and sensitive. We believe that a partial explanation for the variability in Nanog expression is due to the lack of sensitivity to this gene. However, immunocytochemistry while not strictly quantitative shows similar variability when used to assess Nanog expression in different cell lines [9,27,28].
This large comparison between samples allowed us to identify markers that distinguish human feeder cells from hESC. While we have listed 19 potential markers ( Figure  3) and identified several hundred potential markers as shown in Additional file 5, we suggest that as few as 3-4 genes may be sufficient. Previously we found that as few as four were satisfactory to distinguish between hESCs and hEBs, which are two much more closely related samples [9]. In this study we have confirmed by qPCR the differential expression of four genes, THBS1, MMP3, TNFRSF11B and KRTH4, to separate human fibroblast feeders and hESCs ( Figure 3). Several markers such as MMP3 and Male and female hESC lines can be distinguished by genes identified by bead array Figure 5 Male and female hESC lines can be distinguished by genes identified by bead array. Five potential genes RPS4Y, RPS4Y2, EIF1AY, VCY, and AMELY are located in the Y chromosome. By comparing the expression level of these genes in all hESC lines, we have found that 3 out of 5 were specifically expressed in male hESC lines I6, BG01 and BG02 (A) and this was verified by RT-PCR in male line BG01 and female line BG03 (B). G3PDH was used as an internal control. *: represents the gene expression level is detected at <0.99 confidence.
Diploid pluripotent EC cell line NTera2 and karyotypically abnormal hESC line BG01V can be distinguished from normal hESCs using Illumina array Figure 6 Diploid pluripotent EC cell line NTera2 and karyotypically abnormal hESC line BG01V can be distinguished from normal hESCs using Illumina array. Comparison of NTera2 and pooled hESC sample resulted a correlation coefficient of 0.8997. Two lists of genes, which were specifically expressed in NTera2 (C) or in hESCs (E) were identified. Likewise, while sharing similarities with BG01 (B, correlation coefficient= 0.9043), BG01V was different from BG01 in expression for many genes, particularly genes from the TGFβ pathway (D, F). Black dots represent genes that were detected at >0 expression level, blue dots represent genes that were detected both at > 0 expression level and at >0.99 confidence. Genes plotted outside the two thin red lines were detected at >2.5-fold difference.
TNFRSF11B have commercially available antibodies (R&D systems) that may be used to further confirm contamination of feeder cells by immunocytochemistry. Efforts to identify other useful antibodies based on these results continue [29].
While we have focused on the immediate utility of the Illumina array platform, it is important to remember that this array provides a global snapshot of cell state and the data obtained can be readily compared in order to determine key signaling pathways. The ability to compare multiple samples in one run enhances data selectivity and reliability. To make such analysis more readily available, we utilized several software tools including the software package available through Illumina. The BeadStudio software provided with the BeadLab and BeadStudio genetic analysis systems for use with the bead array datasets pro-vides a useful set of analytical and presentation tools that allow straightforward comparisons, which are sufficient for average users. For detailed analysis we recommend using more specific commercial tools or software packages developed by NCBI.

Conclusion
In summary, the Illumina bead array has several key strengths including high throughput, low cost and high sensitivity. By using this array, we can readily detect contaminating feeders and spontaneous differentiation, differentiate male and female lines and distinguish between one undifferentiated population and another. Such a global analysis allows us to assess context dependent signaling and identify biomarkers of particular states of cells.
Our future efforts will focus on data mining and developing better cross platform comparison tools and generating focused high throughput arrays for quality control in clinical and research settings. Different hESC lines were grown in slightly different culture conditions as described above. H lines were grown on Matrigel coated dishes, while BG lines on fibronectin treated dishes. These coating substrata supported the growth of hESCs similarly, as evaluated by colony mor-

Figure 7
Identification of diagnostic markers by PAM. The shrunken class centroids for genes which have at least one nonzero difference are shown. The genes with nonzero components in each class were almost mutually exclusive and were the candidate molecular markers for the diagnosis of the four groups of cell populations, including, (from left to right) hESC derived mesenchyme and human fibroblast feeder cells ("FB", n = 5), undifferentiated hESCs ("ES", n = 11), hESC derived neural cells ("NS", n = 3), and differentiated ES cells and EB, ("EB", n = 6). The identified biomarkers can be used to distinguish the four groups of cell populations.
phology, immunocytochemistry and proliferation rate (data not shown).
Embryoid bodies (EBs) were prepared from BG lines as described in [5]. Cells were aggregated and cultured on non-adherent substrata for fourteen days.
Other cells NTera2 cells were purchased from ATCC and cultured in parallel with hESCssamples using protocols described previously [21]. HS27 embryonic human newborn foreskin cells (ATCC CRL-1634) were grown in DMEM with 10%FBS.
All samples included in this study can be found in Additional file 6.
Bead array gene expression analysis RNA was isolated from cultured cells using the Qiagen RNEasy kit (Qiagen, Inc, Valencia, CA). Sample amplification was performed using 100 ng of total RNA as input material by the method of Van Gelder et al [30]. Amplified RNA synthesized from limited quantities of heterogenous cDNA [30] was performed using the Illumina RNA Amplification kit (Ambion, Inc., Austin, TX) following the Manufacturer instructions. Labeling was achieved by use of the incorporation of biotin-16-UTP (Perkin Elmer Life and Analytical Sciences, Boston, MA) present at a ratio of 1:1 with unlabeled UTP. Labeled, amplified material (700 ng per array) was hybridized to a pilot version of the Illumina HumanRef-8 BeadChip according to the Manufacturer's instructions (Illumina, Inc., San Diego, CA). Amersham fluorolink streptavidin-Cy3 (GE Healthcare Bio-Sciences, Little Chalfont, UK) following the BeadChip manual. Arrays were scanned with an Illumina Bead array Reader confocal scanner according to the Manufacturer's instructions. Array data processing and analysis was performed using Illumina BeadStudio software.

Identification of differentially expressed genes and clustering analysis
Differentially expressed genes between ES and EB were identified by ANOVA at p value 0.05 using bioconductor [17]. Unsupervised hierarchical clustering analysis and principal component analysis (PCA) were conducted using software Pcluster [31] and TreeView [18].
Identification of diagnostic markers PAM (prediction analysis of microarray) was employed for the identification of diagnostic markers from insulin pathway genes by using the software package bioconductor [17]. PAM is a class prediction method for expression data mining. It can provide a list of significant genes whose expression characterizes each diagnostic class. The average gene expression level in multiple classes, such as ES, EB, NS, and FB, was divided by the within-class stand-ard deviation for that gene. The nearest centroid classification computed by PAM takes the protein expression profile from a new sample, and compares it to each of these class centroids [26].

Authors' contributions
YL contributed significantly in array data analysis and validation and drafted the manuscript. SS provided hESC samples and validated array results. XZ provided a large number of the hESC samples for this report and participated in manuscript drafting. HL performed bioinformatics study under the supervision of MZ who also participated in all discussions of the manuscript. RG and FJM isolated total RNA from the human ES cell line WA09 (H9) cell line grown on feeder (human HS27 fibroblast) and feeder-free conditions for > 20 passages at the Burnham Institute for Medical Research. These RNA samples were analyzed by Illuminas Bead Array under the supervision of JFL. CMS provided NTera2 sample, participated in analysis of this EC line and helped draft the manuscript. HX performed validation of the array results. SCB, EC, DLB, TKM and SO participated in the array design and analysis. MPM monitored the whole project. MSR conceived of the study, designed the whole project, coordinated the collaboration, supervised YL, SS, HX and CMZ, and helped draft the manuscript.