TCGA-LUAD RNA-seq 和临床数据下载
本文档记录如何使用 R 语言和 TCGAbiolinks 下载 TCGA-LUAD 队列的 RNA-seq STAR Counts 数据和临床数据。
一、数据说明
TCGA-LUAD 是 TCGA 项目中的肺腺癌数据集。
本流程下载的数据包括:
- RNA-seq 表达数据
- Project:
TCGA-LUAD - Data Category:
Transcriptome Profiling - Data Type:
Gene Expression Quantification - Workflow Type:
STAR - Counts
- Project:
- 临床数据
- 使用
GDCquery_clinic()下载 clinical 表格
- 使用
根据 GDC 查询结果,TCGA-LUAD STAR Counts 共有:
text
总文件数:601
Primary Tumor,样本类型 01:540
Solid Tissue Normal,样本类型 11:59
同时有肿瘤和正常样本的患者:58二、安装 R 包
如果还没有安装 TCGAbiolinks,先执行:
r
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c(
"TCGAbiolinks",
"SummarizedExperiment"
))加载包:
r
library(TCGAbiolinks)
library(SummarizedExperiment)三、创建下载目录
建议单独创建一个目录保存 TCGA-LUAD 数据:
bash
mkdir -p tcga_luad后续所有文件都保存在:
text
tcga_luad/四、查询 RNA-seq 数据
r
library(TCGAbiolinks)
library(SummarizedExperiment)
project <- "TCGA-LUAD"
query <- GDCquery(
project = project,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts"
)查看查询结果:
r
res <- getResults(query)
dim(res)
head(res)五、统计样本数量
TCGA barcode 中第 14-15 位代表样本类型:
01:Primary Tumor,原发肿瘤11:Solid Tissue Normal,癌旁正常组织
统计代码:
r
res <- getResults(query)
barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)
cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")
has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))
cat("Patients with both tumor and normal:", length(paired_patients), "\n")
cat("Paired tumor files:", sum(patient %in% paired_patients & sample_type_code == "01"), "\n")
cat("Paired normal files:", sum(patient %in% paired_patients & sample_type_code == "11"), "\n")本次查询结果:
text
Total files: 601
Primary Tumor 01: 540
Solid Tissue Normal 11: 59
Patients with both tumor and normal: 58
Paired tumor files: 70
Paired normal files: 58六、下载 RNA-seq 数据
TCGA-LUAD STAR Counts 数据量约 2.5 GB。下载时建议使用较小分块,避免网络中断导致 tar 包不完整。
r
setwd("tcga_luad")
GDCdownload(
query,
method = "api",
files.per.chunk = 10
)如果下载失败,常见报错类似:
text
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive这通常表示某个分块 tar 包下载不完整。可以删除临时 .tar.gz 文件后重新下载:
bash
rm -f tcga_luad/*.tar.gz然后重新运行:
r
GDCdownload(
query,
method = "api",
files.per.chunk = 10
)七、整理 RNA-seq 数据
下载完成后,使用 GDCprepare() 整理为 SummarizedExperiment 对象:
r
se <- GDCprepare(query)查看 assay 名称:
r
assayNames(se)常见 assay 包括:
text
unstranded
stranded_first
stranded_second
tpm_unstrand
fpkm_unstrand
fpkm_uq_unstrand导出 counts、TPM、FPKM:
r
counts <- assay(se, "unstranded")
write.csv(counts, "TCGA_LUAD_STAR_counts_unstranded.csv")
if ("tpm_unstrand" %in% assayNames(se)) {
write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}
if ("fpkm_unstrand" %in% assayNames(se)) {
write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}导出样本信息:
r
sample_info <- as.data.frame(colData(se))
write.csv(sample_info, "TCGA_LUAD_sample_info.csv")保存 R 对象:
r
saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")八、下载临床数据
r
clinical <- GDCquery_clinic(
project = "TCGA-LUAD",
type = "clinical"
)
write.csv(
clinical,
"TCGA_LUAD_clinical.csv",
row.names = FALSE
)九、完整下载脚本
可以保存为 download_tcga_luad.R:
r
library(TCGAbiolinks)
library(SummarizedExperiment)
outdir <- "tcga_luad"
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
setwd(outdir)
project <- "TCGA-LUAD"
cat("Query RNA-seq STAR Counts...\n")
query <- GDCquery(
project = project,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts"
)
cat("Count sample types...\n")
res <- getResults(query)
barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)
has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))
cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")
cat("Patients with both tumor and normal:", length(paired_patients), "\n")
cat("Download RNA-seq data...\n")
GDCdownload(
query,
method = "api",
files.per.chunk = 10
)
cat("Prepare SummarizedExperiment...\n")
se <- GDCprepare(query)
cat("Export expression matrices...\n")
write.csv(assay(se, "unstranded"), "TCGA_LUAD_STAR_counts_unstranded.csv")
if ("tpm_unstrand" %in% assayNames(se)) {
write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}
if ("fpkm_unstrand" %in% assayNames(se)) {
write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}
write.csv(as.data.frame(colData(se)), "TCGA_LUAD_sample_info.csv")
saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")
cat("Download clinical data...\n")
clinical <- GDCquery_clinic(project = project, type = "clinical")
write.csv(clinical, "TCGA_LUAD_clinical.csv", row.names = FALSE)
cat("Done.\n")运行:
bash
Rscript download_tcga_luad.R十、输出文件
下载整理完成后,目录中应包含:
text
tcga_luad/
├── TCGA_LUAD_STAR_counts_unstranded.csv
├── TCGA_LUAD_STAR_tpm_unstranded.csv
├── TCGA_LUAD_STAR_fpkm_unstranded.csv
├── TCGA_LUAD_sample_info.csv
├── TCGA_LUAD_clinical.csv
└── TCGA_LUAD_STAR_counts_se.rds其中:
TCGA_LUAD_STAR_counts_unstranded.csv:原始 counts,适合差异分析TCGA_LUAD_STAR_tpm_unstranded.csv:TPM 表达矩阵,适合展示和部分相关性分析TCGA_LUAD_STAR_fpkm_unstranded.csv:FPKM 表达矩阵TCGA_LUAD_sample_info.csv:样本注释信息TCGA_LUAD_clinical.csv:临床信息TCGA_LUAD_STAR_counts_se.rds:完整SummarizedExperiment对象