Skip to content

TCGA-LUAD RNA-seq 和临床数据下载

本文档记录如何使用 R 语言和 TCGAbiolinks 下载 TCGA-LUAD 队列的 RNA-seq STAR Counts 数据和临床数据。

一、数据说明

TCGA-LUAD 是 TCGA 项目中的肺腺癌数据集。

本流程下载的数据包括:

  • RNA-seq 表达数据
    • Project:TCGA-LUAD
    • Data Category:Transcriptome Profiling
    • Data Type:Gene Expression Quantification
    • Workflow Type:STAR - Counts
  • 临床数据
    • 使用 GDCquery_clinic() 下载 clinical 表格

根据 GDC 查询结果,TCGA-LUAD STAR Counts 共有:

text
总文件数:601
Primary Tumor,样本类型 01:540
Solid Tissue Normal,样本类型 11:59
同时有肿瘤和正常样本的患者:58

二、安装 R 包

如果还没有安装 TCGAbiolinks,先执行:

r
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install(c(
  "TCGAbiolinks",
  "SummarizedExperiment"
))

加载包:

r
library(TCGAbiolinks)
library(SummarizedExperiment)

三、创建下载目录

建议单独创建一个目录保存 TCGA-LUAD 数据:

bash
mkdir -p tcga_luad

后续所有文件都保存在:

text
tcga_luad/

四、查询 RNA-seq 数据

r
library(TCGAbiolinks)
library(SummarizedExperiment)

project <- "TCGA-LUAD"

query <- GDCquery(
  project = project,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)

查看查询结果:

r
res <- getResults(query)
dim(res)
head(res)

五、统计样本数量

TCGA barcode 中第 14-15 位代表样本类型:

  • 01:Primary Tumor,原发肿瘤
  • 11:Solid Tissue Normal,癌旁正常组织

统计代码:

r
res <- getResults(query)

barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)

cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")

has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))

cat("Patients with both tumor and normal:", length(paired_patients), "\n")
cat("Paired tumor files:", sum(patient %in% paired_patients & sample_type_code == "01"), "\n")
cat("Paired normal files:", sum(patient %in% paired_patients & sample_type_code == "11"), "\n")

本次查询结果:

text
Total files: 601
Primary Tumor 01: 540
Solid Tissue Normal 11: 59
Patients with both tumor and normal: 58
Paired tumor files: 70
Paired normal files: 58

六、下载 RNA-seq 数据

TCGA-LUAD STAR Counts 数据量约 2.5 GB。下载时建议使用较小分块,避免网络中断导致 tar 包不完整。

r
setwd("tcga_luad")

GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

如果下载失败,常见报错类似:

text
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive

这通常表示某个分块 tar 包下载不完整。可以删除临时 .tar.gz 文件后重新下载:

bash
rm -f tcga_luad/*.tar.gz

然后重新运行:

r
GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

七、整理 RNA-seq 数据

下载完成后,使用 GDCprepare() 整理为 SummarizedExperiment 对象:

r
se <- GDCprepare(query)

查看 assay 名称:

r
assayNames(se)

常见 assay 包括:

text
unstranded
stranded_first
stranded_second
tpm_unstrand
fpkm_unstrand
fpkm_uq_unstrand

导出 counts、TPM、FPKM:

r
counts <- assay(se, "unstranded")
write.csv(counts, "TCGA_LUAD_STAR_counts_unstranded.csv")

if ("tpm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}

if ("fpkm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}

导出样本信息:

r
sample_info <- as.data.frame(colData(se))
write.csv(sample_info, "TCGA_LUAD_sample_info.csv")

保存 R 对象:

r
saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")

八、下载临床数据

r
clinical <- GDCquery_clinic(
  project = "TCGA-LUAD",
  type = "clinical"
)

write.csv(
  clinical,
  "TCGA_LUAD_clinical.csv",
  row.names = FALSE
)

九、完整下载脚本

可以保存为 download_tcga_luad.R

r
library(TCGAbiolinks)
library(SummarizedExperiment)

outdir <- "tcga_luad"
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
setwd(outdir)

project <- "TCGA-LUAD"

cat("Query RNA-seq STAR Counts...\n")
query <- GDCquery(
  project = project,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)

cat("Count sample types...\n")
res <- getResults(query)
barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)
has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))

cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")
cat("Patients with both tumor and normal:", length(paired_patients), "\n")

cat("Download RNA-seq data...\n")
GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

cat("Prepare SummarizedExperiment...\n")
se <- GDCprepare(query)

cat("Export expression matrices...\n")
write.csv(assay(se, "unstranded"), "TCGA_LUAD_STAR_counts_unstranded.csv")

if ("tpm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}

if ("fpkm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}

write.csv(as.data.frame(colData(se)), "TCGA_LUAD_sample_info.csv")
saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")

cat("Download clinical data...\n")
clinical <- GDCquery_clinic(project = project, type = "clinical")
write.csv(clinical, "TCGA_LUAD_clinical.csv", row.names = FALSE)

cat("Done.\n")

运行:

bash
Rscript download_tcga_luad.R

十、输出文件

下载整理完成后,目录中应包含:

text
tcga_luad/
├── TCGA_LUAD_STAR_counts_unstranded.csv
├── TCGA_LUAD_STAR_tpm_unstranded.csv
├── TCGA_LUAD_STAR_fpkm_unstranded.csv
├── TCGA_LUAD_sample_info.csv
├── TCGA_LUAD_clinical.csv
└── TCGA_LUAD_STAR_counts_se.rds

其中:

  • TCGA_LUAD_STAR_counts_unstranded.csv:原始 counts,适合差异分析
  • TCGA_LUAD_STAR_tpm_unstranded.csv:TPM 表达矩阵,适合展示和部分相关性分析
  • TCGA_LUAD_STAR_fpkm_unstranded.csv:FPKM 表达矩阵
  • TCGA_LUAD_sample_info.csv:样本注释信息
  • TCGA_LUAD_clinical.csv:临床信息
  • TCGA_LUAD_STAR_counts_se.rds:完整 SummarizedExperiment 对象