TCGA-LUAD RNA-seq 和临床数据下载

本文档记录如何使用 R 语言和 TCGAbiolinks 下载 TCGA-LUAD 队列的 RNA-seq STAR Counts 数据和临床数据。

一、数据说明

TCGA-LUAD 是 TCGA 项目中的肺腺癌数据集。

本流程下载的数据包括：

RNA-seq 表达数据
- Project：TCGA-LUAD
- Data Category：Transcriptome Profiling
- Data Type：Gene Expression Quantification
- Workflow Type：STAR - Counts
临床数据
- 使用 GDCquery_clinic() 下载 clinical 表格

根据 GDC 查询结果，TCGA-LUAD STAR Counts 共有：

text

总文件数：601
Primary Tumor，样本类型 01：540
Solid Tissue Normal，样本类型 11：59
同时有肿瘤和正常样本的患者：58

二、安装 R 包

如果还没有安装 TCGAbiolinks，先执行：

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install(c(
  "TCGAbiolinks",
  "SummarizedExperiment"
))

加载包：

library(TCGAbiolinks)
library(SummarizedExperiment)

三、创建下载目录

建议单独创建一个目录保存 TCGA-LUAD 数据：

bash

mkdir -p tcga_luad

后续所有文件都保存在：

text

tcga_luad/

四、查询 RNA-seq 数据

library(TCGAbiolinks)
library(SummarizedExperiment)

project <- "TCGA-LUAD"

query <- GDCquery(
  project = project,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)

查看查询结果：

res <- getResults(query)
dim(res)
head(res)

五、统计样本数量

TCGA barcode 中第 14-15 位代表样本类型：

01：Primary Tumor，原发肿瘤
11：Solid Tissue Normal，癌旁正常组织

统计代码：

res <- getResults(query)

barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)

cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")

has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))

cat("Patients with both tumor and normal:", length(paired_patients), "\n")
cat("Paired tumor files:", sum(patient %in% paired_patients & sample_type_code == "01"), "\n")
cat("Paired normal files:", sum(patient %in% paired_patients & sample_type_code == "11"), "\n")

本次查询结果：

text

Total files: 601
Primary Tumor 01: 540
Solid Tissue Normal 11: 59
Patients with both tumor and normal: 58
Paired tumor files: 70
Paired normal files: 58

六、下载 RNA-seq 数据

TCGA-LUAD STAR Counts 数据量约 2.5 GB。下载时建议使用较小分块，避免网络中断导致 tar 包不完整。

setwd("tcga_luad")

GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

如果下载失败，常见报错类似：

text

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive

这通常表示某个分块 tar 包下载不完整。可以删除临时 .tar.gz 文件后重新下载：

bash

rm -f tcga_luad/*.tar.gz

然后重新运行：

GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

七、整理 RNA-seq 数据

下载完成后，使用 GDCprepare() 整理为 SummarizedExperiment 对象：

se <- GDCprepare(query)

查看 assay 名称：

assayNames(se)

常见 assay 包括：

text

unstranded
stranded_first
stranded_second
tpm_unstrand
fpkm_unstrand
fpkm_uq_unstrand

导出 counts、TPM、FPKM：

counts <- assay(se, "unstranded")
write.csv(counts, "TCGA_LUAD_STAR_counts_unstranded.csv")

if ("tpm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}

if ("fpkm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}

导出样本信息：

sample_info <- as.data.frame(colData(se))
write.csv(sample_info, "TCGA_LUAD_sample_info.csv")

保存 R 对象：

saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")

八、下载临床数据

clinical <- GDCquery_clinic(
  project = "TCGA-LUAD",
  type = "clinical"
)

write.csv(
  clinical,
  "TCGA_LUAD_clinical.csv",
  row.names = FALSE
)

九、完整下载脚本

可以保存为 download_tcga_luad.R：

library(TCGAbiolinks)
library(SummarizedExperiment)

outdir <- "tcga_luad"
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
setwd(outdir)

project <- "TCGA-LUAD"

cat("Query RNA-seq STAR Counts...\n")
query <- GDCquery(
  project = project,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)

cat("Count sample types...\n")
res <- getResults(query)
barcodes <- res$cases
sample_type_code <- substr(barcodes, 14, 15)
patient <- substr(barcodes, 1, 12)
has_tumor <- tapply(sample_type_code == "01", patient, any)
has_normal <- tapply(sample_type_code == "11", patient, any)
paired_patients <- names(which(has_tumor & has_normal))

cat("Total files:", length(barcodes), "\n")
cat("Primary Tumor 01:", sum(sample_type_code == "01"), "\n")
cat("Solid Tissue Normal 11:", sum(sample_type_code == "11"), "\n")
cat("Patients with both tumor and normal:", length(paired_patients), "\n")

cat("Download RNA-seq data...\n")
GDCdownload(
  query,
  method = "api",
  files.per.chunk = 10
)

cat("Prepare SummarizedExperiment...\n")
se <- GDCprepare(query)

cat("Export expression matrices...\n")
write.csv(assay(se, "unstranded"), "TCGA_LUAD_STAR_counts_unstranded.csv")

if ("tpm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "tpm_unstrand"), "TCGA_LUAD_STAR_tpm_unstranded.csv")
}

if ("fpkm_unstrand" %in% assayNames(se)) {
  write.csv(assay(se, "fpkm_unstrand"), "TCGA_LUAD_STAR_fpkm_unstranded.csv")
}

write.csv(as.data.frame(colData(se)), "TCGA_LUAD_sample_info.csv")
saveRDS(se, "TCGA_LUAD_STAR_counts_se.rds")

cat("Download clinical data...\n")
clinical <- GDCquery_clinic(project = project, type = "clinical")
write.csv(clinical, "TCGA_LUAD_clinical.csv", row.names = FALSE)

cat("Done.\n")

运行：

bash

Rscript download_tcga_luad.R

十、输出文件

下载整理完成后，目录中应包含：

text

tcga_luad/
├── TCGA_LUAD_STAR_counts_unstranded.csv
├── TCGA_LUAD_STAR_tpm_unstranded.csv
├── TCGA_LUAD_STAR_fpkm_unstranded.csv
├── TCGA_LUAD_sample_info.csv
├── TCGA_LUAD_clinical.csv
└── TCGA_LUAD_STAR_counts_se.rds

其中：

TCGA_LUAD_STAR_counts_unstranded.csv：原始 counts，适合差异分析
TCGA_LUAD_STAR_tpm_unstranded.csv：TPM 表达矩阵，适合展示和部分相关性分析
TCGA_LUAD_STAR_fpkm_unstranded.csv：FPKM 表达矩阵
TCGA_LUAD_sample_info.csv：样本注释信息
TCGA_LUAD_clinical.csv：临床信息
TCGA_LUAD_STAR_counts_se.rds：完整 SummarizedExperiment 对象

TCGA-LUAD RNA-seq 和临床数据下载 ​

一、数据说明 ​

二、安装 R 包 ​

三、创建下载目录 ​

四、查询 RNA-seq 数据 ​

五、统计样本数量 ​

六、下载 RNA-seq 数据 ​

七、整理 RNA-seq 数据 ​

八、下载临床数据 ​

九、完整下载脚本 ​

十、输出文件 ​

TCGA-LUAD RNA-seq 和临床数据下载

一、数据说明

二、安装 R 包

三、创建下载目录

四、查询 RNA-seq 数据

五、统计样本数量

六、下载 RNA-seq 数据

七、整理 RNA-seq 数据

八、下载临床数据

九、完整下载脚本

十、输出文件