Research | Teaching | NGS | Curriculum Vitae | Publications | Photography | Contact | Home

Next-Generation Sequencing: Reduced-Representation Libraries
Next-generation sequencing (NGS) technologies advanced by companies such as Illumina are revolutionizing the life sciences, especially in the realms of population and comparative genomics. Although it is now easier and cheaper to obtain whole genomes than ever before, there are major cost and bioinformatics obstacles to scaling up to the population level or comparing whole genomes of highly divergent species, at least for researchers studying non-model organisms. However, these barriers are poised to disappear soon. For example, Illumina's HiSeq X Ten is leading the way to true population-level genomics by generating up to 18,000 whole human genomes per year, likely ushering in a new era of personalized medicine.

In the meantime, the typical comparative or population biologist, especially those studying non-model organisms, will benefit from reduced-representation library (RRL) approaches that harness the high throughput of NGS platforms to obtain a reasonable amount of data at a reasonable cost - at least for a few more years. Here, I briefly introduce two RRL methods that I am most familiar with: restriction-associated DNA sequencing (RADseq) and targeted enrichment. Both methods have pros and cons, but are complementary in combination. For a case study comparing the utility of these methods for phylogenomics please see our 2015 paper in Genome Biology and Evolution.

Restriction-Associated DNA sequencing (RADseq)

This approach works by digesting the genome with restriction enzymes, ligating on uniquely barcoded adapters to the overhanging ends, size selecting a small range of fragments, and using PCR to amplify these fragments with uniquely indexed Illumina primers. I have successfully used this method to genotype thousands of Single Nucleotide Polymorphisms (SNPs) for up to 96 individuals at >10x coverage on a single lane of an Illumina HiSeq 2500 (100 bp reads). There are a number of similar methods out there, including genotyping-by-sequencing (GBS) and double-digest RADseq (ddRADseq). The most important considerations are your choice of enzymes and your size selection scheme, as these will dictate the number of loci, the depth of coverage, and the number of individuals that can be multiplexed. I used the ddRADseq approach of Peterson et al. (2012) for most of my dissertation and postdoctoral research. The main advantage of RADseq is the low cost per sample - this is probably the cheapest way to get population genomic data. However, because restriction sites are constantly evolving, it is more difficult to compare deeply divergent lineages due to locus drop-out. Also, the loci produced are anonymous unless BLASTed to a reference genome, and one cannot target specific regions of the genome.

There are two commonly used bioinformatics pipelines for quality control and de novo assembly/alignment of RADseq data: STACKS and pyRAD, the latter being my personal choice. The advantage of pyRAD is its use of an alignment-clustering method (vsearch) that accommodates indel variation, improving identification of homology across divergent lineages, but it also works well at shallower (population) scales. PyRAD is an open-source package written in python, so it is freely available and customizable. It can take advantage of parallel processing which makes it ideal to use on a high-throughput computing (HPC) Linux cluster, allowing large jobs to be finished in a matter of days or hours. The parameters file is easy to use and it can output alignments in a variety of popular formats (e.g. fasta, nexus, vcf, structure) for downstream analysis. For more information on how to run pyRAD please see my tutorial on Smithsonian's Peer-led Bioinformatics Series on Github.com (in the spring 2016 folder). For an example of the types of downstream analyses that RADseq data can be used for, please see our 2016 paper in Molecular Phylogenetics and Evolution.

Targeted Enrichment

Unlike RADseq, targeted enrichment approaches allow the researcher to choose the loci of interest to be sequenced. The method works by first preparing standard NGS libraries (e.g. Illumina TruSeq or Nextera libraries) then using a custom RNA probe set (e.g. MYcroarray MyBaits) to capture loci of interest. The RNA probes bind to streptavidin-coated magnetic beads, allowing the researcher to wash away unwanted DNA before amplifying the captured DNA (hence, targeted enrichment). Two popular types of targets are ultra-conserved elements (UCEs) and protein-coding genes (introns and exons). UCEs are highly conserved regions of organismal genomes shared among evolutionary distant taxa, and thus make ideal targets for non-model organisms with no reference genomes. However, the function of UCEs is unknown, as is the true mutation model. Thus, other researchers argue for the enrichment of protein-coding genes, but this may require a more customized probeset. Although either variety is great for comparing divergent lineages, the downside of targeted enrichment (compared to RADseq) is a higher cost per sample, more complex lab work, and generally low levels of variation at the population level. For more resources on designing your own UCE/target capture experiment, see ultraconserved.org and BadDNA.org. I also recommend reading Faircloth et al. (2012) and Li et al. (2013).

For bioinformatics, I am currently using a customized version of the phyluce software installed on the Smithsonian's Hydra HPC cluster to assemble raw read data into contigs, separate UCE loci from assembled contigs, generate and trim alignments, call SNPs, etc. I am working with paired-end data (150 bp reads) collected on the new Illumina NextSeq. For more information on how to run this pipeline please see Smithsonian's Targeted Enrichment Workshop on Github.com.
Copyright 2016 Andrew Gottscho. Last updated Dec 5 2016.