wgsim, a tool for simulating NGS reads from reference sequences.

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [0, use the current time]
         -A FLOAT      discard if the fraction of ambiguous bases higher than FLOAT [0.05]
         -h            haplotype mode

以绵羊(Ovis aries)的24号染色体序列为例，模拟20×的 Illumina PE reads， read length=100，假设没有base error。 number_of_reads（N）= 20× length_of_ref/(read_length × 2) =4.2e6

wgsim -e 0 -1 100 -2 100 -S 44 -N 4200000 oar_chr24.fa test_F.fq test_R.fq

使用dsk计算K-mers的coverage distribution，取K-mer size=31，

# k-mers distribution of reference chromosome, oar_chr24
dsk -file oar_chr24.fa -kmer-size 31 -histo-max 200 -out oar_chr24
# output file: oar_chr24.h5

h5dump -y -d histogram/histogram oar_chr24.h5 | grep "^\ *[0-9]" |tr -d ", " | paste - - >oar_chr24_kmerSize31_coverage.txt

# k-mers distribution of simulated reads
dsk -file test_F.fq,test_R.fq -kmer-size 42 -histo-max 200 -out test
# output file: test.h5

h5dump -y -d histogram/histogram test.h5 | grep "^\ *[0-9]" |tr -d ", " | paste - - >test_kmerSize31_coverage.txt

接下来分别看kmer size=31时，chr24参考序列和模拟的reads的31-mers coverage distribution

1) chr24参考序列的31-mers分布情况

coverage	counts
1	38458117
2	441456
3	85126
4	37148
5	21574
6	13682
7	9055
8	6752
9	5197
10	3896

也就是说38458117个31-mers仅出现一次，441456个31-mers出现两次，85126个31-mers出现三次，多次出现的kmer可能是由于基因组上的重复序列。

2) simulation NGS reads （20×）的31-mers分布情况

kmers coverage distribution如下图所示： kmers_simulation

模拟的NGS数据的31-kmer的分布，peak出现在coverage=14的位置。因为模拟的reads长度为100nt，则每一条read最多可能有100-31+1=70个不同的31-kmers，模拟的是20×的NGS数据量，绝大多数的31-kmers在参考序列中是unique的，所以 simulation NGS reads的31-mers distribution的peak出现就在X轴（coverage）14（70×20/100=14）的位置。

如果模拟的NGS reads不是 error free，31-mers分布可能会出现偏移。