The cell line transcriptome

The word transcriptome refers to the full set of transcribed RNA molecules within a cell at a given time point. In contrast to the genome, which is characterized by its stability over different cells within an organism, the transcriptome varies greatly. This plastic nature of the transcriptome has made it appealing to study, owing to its potential to serve as a proxy for cellular identity and diversity. In the Cell Atlas all 19613 protein-coding human genes are classified according to their expression across a large number of in vitro cultured cell lines (Figure 1) (Uhlén M et al, 2015). The cell lines have been harvested during log phase of growth and extracted high quality mRNA was used as input material for library construction and subsequent sequencing. The expression level of gene-specific transcripts is given as Transcript Per Million (TPM) values. Genes with a TPM value ≥1 are considered as detected. Altogether the transcriptome of 64 cell lines have been analyzed to form a basis of different expression categories.

Approximately one third of all protein-coding genes (n=6693) were expressed in all cell lines, consistent with a "housekeeping" function for the corresponding proteins. 11% (n=2069) of all genes were not detected in any of the analyzed cell lines, suggesting that corresponding proteins are only expressed in highly specialized cell types, during specific developmental stages or under specific conditions such as cell stress. 43% (n=8455) of the protein-coding genes show a more restricted pattern of expression across the analyzed cell lines, some expressed in only a few or even just a single cell line. In Table 1 the specific expression profile for each analyzed cell line is shown with clickable numbers for total detected genes, cell line enriched genes, group enriched genes and cell line enhanced genes.

  • 255 genes found only in cell lines and not tissues
  • 1220 genes found only in tissues and not cell lines

The cell line transcriptome was compared with the transcriptome of 37 different normal tissues and organs. 255 genes were only expressed in cell lines and not in any of the analyzed normal tissue types. These genes serve an interesting starting point to study the function and role of corresponding proteins in human biology. Furthermore, 1220 genes were only found to be expressed in normal human tissues but not in any of the analyzed cell lines. Several of the proteins corresponding to these genes have functions associated with differentiated cells in specialized tissues or subcompartments of tissues, exemplified by ACR (acrosin) the major proteinase present in the acrosome of mature spermatozoa in normal testis and ABCB11, the major canalicular bile salt export pump in normal liver.

Figure 1. Pie chart showing the number of genes in the different RNA-based categories of gene expression in the panel of cell lines.

Table 1. Table showing the number of detected genes per cell line based on RNA sequencing (TPM ≥1), and the number of genes in the enriched and enhanced categories.

Cell line Detectable genes Enriched genes Group enriched genes Enhanced genes
A-431 11451 8 33 164
A549 11875 9 32 207
AF22 12011 22 83 405
AN3-CA 11275 12 28 229
ASC diff 11243 22 48 262
ASC TERT1 11240 1 29 205
BEWO 11783 54 79 436
BJ 11658 0 15 168
BJ hTERT+ 11536 0 0 0
BJ hTERT+ SV40 Large T+ 11502 0 0 0
BJ hTERT+ SV40 Large T+ RasG12V 11583 0 0 0
CACO-2 11515 19 73 251
CAPAN-2 11896 12 55 361
Daudi 10093 6 59 199
EFO-21 12367 14 60 296
fHDF/TERT166 11392 5 21 229
HaCaT 11746 18 59 324
HAP1 11638 6 35 199
HBEC3-KT 11108 4 14 131
HBF TERT88 11248 0 2 57
HDLM-2 11180 71 72 392
HEK 293 12072 9 24 223
HEL 11238 40 121 328
HeLa 11504 10 18 142
Hep G2 11132 85 104 266
HHSteC 11413 5 29 176
HL-60 10266 1 26 121
HMC-1 11522 50 94 437
HSkMC 11684 11 57 242
hTCEpi 11248 13 37 189
hTEC/SVTERT24-B 11543 2 11 131
hTERT-HME1 10880 1 12 97
HUVEC TERT2 11138 13 61 225
K-562 10917 14 73 206
Karpas-707 10712 20 61 296
LHCN-M2 11311 9 19 158
MCF7 11288 7 14 227
MOLT-4 10516 30 45 172
NB-4 10874 11 51 200
NTERA-2 12573 40 111 399
PC-3 11843 5 28 191
REH 11010 15 48 218
RH-30 11347 31 40 260
RPMI-8226 10968 19 64 240
RPTEC TERT1 11670 30 53 278
RT4 11570 32 54 335
SCLC-21H 12773 90 180 753
SH-SY5Y 12351 46 141 516
SiHa 11669 5 21 208
SK-BR-3 11146 26 42 255
SK-MEL-30 11379 23 36 193
T-47d 11868 19 46 357
THP-1 11338 21 54 232
TIME 11310 2 53 273
U-138 MG 11521 5 13 185
U-2 OS 12751 21 68 300
U-2197 11418 16 34 218
U-251 MG 11215 1 7 73
U-266/70 11409 19 90 385
U-266/84 10922 18 78 231
U-698 10117 16 55 185
U-87 MG 11877 13 36 289
U-937 10845 15 61 238
WM-115 11898 15 42 273

A diversity of cell lines

The 64 different cell lines used in the Human Protein Atlas have been selected to represent various cell populations in different tissue types and organs of the human body. A vast majority of the selected cell lines have been derived from human cancer and thus are best described as human cancer cell lines with limited resemblance to normal cell types. Cell lines are in general adapted to cultivation in vitro and can only approximate the lives of normal cells that perform their function in a complex tissue content. As cancer is a composite tissue with heterogeneous cancer cell populations in addition to the stromal component, it is not surprising that several features of a normal cell corresponding to the putative progenitor cell are lacking in the corresponding cancer-derived cell line. Despite the evident differences between primary cells in tissue and in vitro cultured cell lines, a global analysis based on an unbiased hierarchical clustering analysis (Figure 2) shows that cell lines in fact do cluster as expected from similarities in origin and phenotype of the cancer cells from which the respective cell line was derived from. This can be exemplified by the derivatives of the isogenic BJ fibroblast model that mimics the four stages of malignant transformation (normal, immortalized, transformed and metastasizing) by cumulative addition of defined genetic elements (Hahn WC et al, 1999). At the highest level of separation, cell lines that grow in solution and also represent hematopoietic and lymphoid cell systems cluster together and separate into two major clusters dependent on myeloid or lymphoid origin/phenotype. Moreover, several related cell lines cluster together such as the versions of immortalized and transformed fibroblastic cell lines (BJ derivatives), glioma (U-138 MG and U-251 MG), melanoma (WM-115 and SK-MEL-30), breast cancer (SK-BR-3, MCF7 and T47d) and endothelial cell lines (TIME and HUVEC).

The selection of human cancer cell lines for the Cell Atlas was aimed to correspond to the origin and phenotype of solid cancer types represented in the Pathology Atlas of the Human Protein Atlas. A special emphasis has been made to represent cells in the hematopoietic and immune system as these corresponding tumor types are more scarcely represented in the Cancer Atlas. Data from altogether 7 and 8 cell lines representing different stages of myeloid and lymphoid differentiation, respectively, has been generated and analyzed. In addition to cancer-derived cell lines there are also a number of cell lines that have been generated through in vitro protocols for immortalization of growing cells as well as stem cells. Details regarding the different cell lines can be found here.

Figure 2. Hierarchical clustering based on RNA sequencing data for the 64 cell lines. The color of the cell line name represents its origin: red - myeloid, yellow - lymphoid, brown - lung, periwinkle - brain, turquoise - renal, urinary and male reproductive system, green - breast and female reproductive system, pink - sarcoma, purple - fibroblast, dark blue - abdominal, black - miscellaneous. Cells immortalized by the introduction of telomerase are indicated by an asterisk (*).

Cell line enriched genes

A majority of the cell line enriched genes also belong to the tissue elevated gene expression categories (tissue enriched, group enriched and tissue enhanced). The expression pattern in normal tissues and function of these proteins relate to the specific traits and functions of the corresponding normal tissue type and organ. Examples are presented in Figure 3 and include: The secreted proteins AHSG and ALB that are only expressed in normal liver and the liver derived cell line Hep-G2, where immunofluorescent analysis shows localization to the Golgi apparatus and vesicles respectively. The transcription factor HOXB13 that is only expressed in the nuclei of prostate, colon and rectum tissue as well as in the prostate-derived cell line PC-3. The adhesion glycoprotein CDH15 that is enriched in skeletal muscle tissue and in the sarcoma cell line RH-30. The enzyme TYR that is exclusively expressed in skin and in the melanoma derived cell line SK-MEL-30. The epidermal growth factor receptor EGFR enriched in female tissues and skin, and in the skin-derived cell line A-431.

The RNA-seq data for all 64 cell lines expressing 89% (n=17544) of all protein-coding human genes are presented in the Cell Atlas and can be used as a tool for selection of suitable cell lines for an experiment involving a particular gene or pathway or for further studies on the transcriptome of established human cell lines.

AHSG - Hep G2
ALB - Hep G2
HOXB13 - PC-3

CDH15 - RH-30
EGFR - A-431

Figure 3. Examples of proteins with enriched expression in a cell line and the corresponding tissue of origin. The proteins are AHSG, ALB, HOXB13, CDH15, TYR, and EGFR. The immunohistochemical (IHC) staining shows the protein expression pattern in tissue in brown. The immunofluorescent (IF) staining shows the protein subcellular expression pattern in cell lines in green. The nucleus and microtubules are shown in blue and red respectively in the IF images.

Relevant links and publications

Hahn WC et al, 1999. Creation of human tumour cells with defined genetic elements. Nature.
PubMed: 10440377 DOI: 10.1038/22780

UhlĂ©n M et al, 2015. Tissue-based map of the human proteome. Science
PubMed: 25613900 DOI: 10.1126/science.1260419