Chapter 7 Genomics
7.1 Single Cell Analysis
7.1.1 Minimising the size of a Seurat Object
Single cell analyses are ofetn collated in Seurat objects which can be huge. We reduced the size of our Seurat object by pulling the relevant sections using the code below reducing the object to <20% of its original size. This works by only including the sparse matrices. This is not a workable example, but the PBMC dataset could be used if necessary.
#library(Seurat)
#library(dplyr)
#Load in the data
<-"mac_shiny/data/Global_sc_object"
pathtoglobaldata <-readRDS(pathtoglobaldata)
object_global
# Get sparse assay data
<-GetAssayData(object = object_global, slot = "data") # counts don’t work, scaled breaks my pc with memory errors, not sure it matters that much
sizetest_global<- CreateSeuratObject(sizetest_global)
object_global_small
# Add in classifications.
@reductions$tsne<-object_global@reductions$tsne # reduction embeddings for tsnegraph
object_global_small$Phenotype<- object_global$Phenotype
object_global_small
@meta.data$final_classification<- object_global@meta.data$final_classification
object_global_small$final_classificaiton<- object_global$final_classification #cell classifications
object_global_smallIdents(object = object_global_small) <- "final_classificaiton" #set forever
#same
saveRDS(object_global_small,"mac_shiny/data/global_object_sml")
7.2 Nextflow
Nextflow enables scalable and reproducible scientific workflows using software containers. These can either work using singularity (HPC and secure compute), docker (insecure compute - requires root) and using conda. Singularity is the preferred option.
The core workflows can be found at here.
For a basic alignment job for RNASeq or DNASeq, you will require a machine with around 200GB memory and 24 cores. Fast scratch space is advantageous (NVMe fast disk is ideal). Fast disk space is needed for the working directory (cached files and staged files) and for the results storage. At least three times the size of the input data is ideal.
7.2.1 RNASeq example with sbatch (slurm)
#!/bin/bash --login
#
#SBATCH --job-name=single-node-multicore ### Replace with the name of your job
#SBATCH --mail-user=thomas.drake@glasgow.ac.uk ### Your email address
#SBATCH --mail-type=END,FAIL # Conditions when an email is sent
#SBATCH --output=%x.%j.out ### output file: <job-name.job-id.out>
#SBATCH --nodes=1 ### Keep the job on one computer/node.
#SBATCH --ntasks=1 ### single task
#SBATCH --cpus-per-task=64 ### each task uses up to 64 cpu cores (defaults to 1 if absent)
#SBATCH --time=10-00:00:00 ### WallTime (10 days)
#SBATCH --mem=480G
#SBATCH --partition=compute ### Partition Name
date
hostname
# Set OMP_NUM_THREADS to the same value as --cpus-per-task
# with a fallback option in case it isn't set.
module load nextflow-22.04.0
module load singularity/singularity-3.8.7
nextflow run nf-core/rnaseq -w "/nfs/sharedscratch2/users/tdrake/nf_work/" --input "/nfs/sharedscratch2/users/tdrake/sample_sheet_rna_hpc.csv" --genome GRCm38 --remove_ribo_rna -profile singularity --outdir "/nfs/sharedscratch2/users/tdrake/rna_seq_output/"
date
7.2.2 DNASeq example with sbatch (slurm)
#!/bin/bash --login
#
#SBATCH --job-name=single-node-multicore ### Replace with the name of your job
#SBATCH --mail-user=thomas.drake@glasgow.ac.uk ### Your email address
#SBATCH --mail-type=END,FAIL # Conditions when an email is sent
#SBATCH --output=%x.%j.out ### output file: <job-name.job-id.out>
#SBATCH --nodes=1 ### Keep the job on one computer/node.
#SBATCH --ntasks=1 ### single task
#SBATCH --cpus-per-task=64 ### each task uses up to 64 cpu cores (defaults to 1 if absent)
#SBATCH --time=10-00:00:00 ### WallTime (10 days)
#SBATCH --mem=480G
#SBATCH --partition=compute ### Partition Name
date
hostname
# Set OMP_NUM_THREADS to the same value as --cpus-per-task
# with a fallback option in case it isn't set.
module load nextflow-22.04.0
module load singularity/singularity-3.8.7
nextflow run nf-core/sarek -w "/nfs/sharedscratch2/users/tdrake/nf_work/" --input "/mnt/data/R09/seq_store/izy_22_2_r_wes_nextflow_sarek/izy_wes_samplesheet.csv" -profile docker --genome GRCm38 --outdir "/mnt/data/R09/seq_store/izy_22_2_r_wes_nextflow_sarek/" --skip_tools baserecalibrator --tools vep,snpeff,sarek,cnvkit
date
7.2.3 ATACSeq example with sbatch (slurm)
#!/bin/bash --login
#
#SBATCH --job-name=single-node-multicore ### Replace with the name of your job
#SBATCH --mail-user=thomas.drake@glasgow.ac.uk ### Your email address
#SBATCH --mail-type=END,FAIL # Conditions when an email is sent
#SBATCH --output=%x.%j.out ### output file: <job-name.job-id.out>
#SBATCH --nodes=1 ### Keep the job on one computer/node.
#SBATCH --ntasks=1 ### single task
#SBATCH --cpus-per-task=64 ### each task uses up to 64 cpu cores (defaults to 1 if absent)
#SBATCH --time=10-00:00:00 ### WallTime (10 days)
#SBATCH --mem=480G
#SBATCH --partition=compute ### Partition Name
date
hostname
# Set OMP_NUM_THREADS to the same value as --cpus-per-task
# with a fallback option in case it isn't set.
module load nextflow-22.04.0
module load singularity/singularity-3.8.7
nextflow run nf-core/atacseq -w "/nfs/sharedscratch2/users/tdrake/nf_work/" --input "/nfs/sharedscratch2/users/tdrake/atac_seq_ctnnb_p53_myc/sample_atac.csv" --genome GRCm38 -profile singularity --outdir "/nfs/sharedscratch2/users/tdrake/atac_seq_output_broad/"
date