As Pooled sequencing gains popularity in the world of genomics, more and more programs are being introduced to specifically handle Pool-seq data. However, for students such as myself, it can be difficult to find the right program for each step of the bioinformatics process, and to know the correct data format and parameters used with each individual program.
Here is a short recap of the Pool-seq bioinformatic programs that I have gathered (in no particular order), including input and output files, and any additional notes I had on the program.
~
input: a samtools pileup file (Note: must be made by samtools version 0.1.18 and below, and cannot take mpileup files)
output: a pool-specific sync file, from which population specific diversity indices (such at Tajima’s pi & D, Watterson’s theta) can be calculated with the Variance-sliding.pl
input: a samtools mpileup file (Note: must be made by samtools version 0.1.18 and below)
output: a sync file including multiple pools, from which diversity metrics (FSTs) can be calculated. The sync file can also be converted into a genepop file for further analysis.
Notes of Popoolation/Polpoolation2:
The sync2GenePop conversion step requires custom scripts to be written so that it can be run on all loci at once, instead of each locus individually. Feel free to contact me if you wish to see perl scripts used to create a GenePop file for an entire pool.
input: a sync file from Popoolation2, VCF file, Baypass or Selestim input files
output: global and per-SNP FST values, a pairwise FST matrix, Baypass or Selestim input files
Notes: Remember that the poolsize list is haploid!
BayPass -
input: 1. pooled read count data, 2. poolsize file, 3. population name file (all of which can be output by poolfstat R package)
output: This program is used to find either FST-based or environmental-based outlier loci.
If you want FST-based outliers, you would run the core model. Then run the R code under section 5.1.1 of the manual. Then you will generate a POD genofile with the ‘simulate.baypass’ R script, which you will use as the input in another core model. Then you will compare the outputs of the 2 core model runs with the XTX calibration R scripts.
If you want environmental-based outliers, then you would run the auxiliary model. This will give you an ‘summary_betai’ output file. Within this file, if the BF(dB) column has a value > 20, then you that locus can be identified as an outlier, corresponding to the given environmental predictor in the first column.
If you need any help with BayPass, I am happy to share my scripts upon request.
input: BAM files & a reference fasta file
output: a pool VCF file
CRISP -
input: BAM files, pool sample sizes, reference fasta file (NB: requires at least 2 pools/2 BAM files)
output: a pool VCF file
SNVer -
input: BAM files & a text file with the BAM file names, number of haploids, number of samples, and filtering thresholds
output: a pool VCF file
Notes on FreeBayes, CRISP & SNVer:
Even though these programs make VCF files, I personally could not find a way to convert these VCF files to other formats such as genepop/genind/genlight with conversion tools such as PGDspider, vcf2genepop, or genomic_converter. However, it seems like it can be input into the R package poolfstat to calculate F-statistics, and be converted into files for BayPass and SelEstim.
input: a samtools pileup file (not mpileup)
output: allele frequency output files
npstat -
input: a samtools mpileup file + individual BAM files (per pool). Can also take GFF3 annotations to perform McDonald-Krietman tests.
output: population specific diversity indices (such at Tajima’s pi & D, Watterson’s theta), and population differentiation (FSTs)
SNPgenie -
input: a VCF file (for a single pool), a reference fasta file, and GTF annotation file
output: πN/πS, dN/dS, and other evolutionary parameters (has a great number output files)
Note: This program is only for model species with annotated genomes.
PCAdapt -
input: a frequency matrix with n rows and L columns (where n is the number of populations and L is the number of genetic markers)
output: several plots and list of potential outlier loci
Note: For my data, this program suggested quite a large number of putative outliers, especially compared to Bayescan, Loistan, and Popoolation.
SelEstim -
input: an allele/read count file
ouput: a ‘delta’ output file which is input in R to identify outlier loci
Note: I personally tried SelEstim with ~ 8,000 loci, and the program ran for about 4 weeks without finishing, so could not use due to time restraints.
Pool-hmm -
input: a samtools mpileup (Note: the mpileup command should not have -u or -g option)
output: allele frequency spectrum, plus probability of “Neutral”, “Intermediate” and “Selection” for each polymorphic site, with additional stats files
~
Additionally, here is a short list of some pipelines available for Pool-seq & non-model organisms:
Also, for complete bioinformatics beginners, Galaxy is a great way to understand various bioinformatic programs and their command lines.
~
Happy analysing!