Newer
Older
The goal of this Nextflow pipeline is to align markers to a reference genome and provide statistics on the alignment.
One can start from a CSV or a FASTA file of the marker sequences. In the case only the position of markers on a genome version is known, a VCF or BED file can be used, along with the genome on which these coordinates apply, to reposition markers on a new genome version.

Baptiste Imbert
committed
## Input of markers in CSV format
The CSV file (with no header) is expected to have the first column being the markers id then the second one must be the sequences. The sequence of the markers can include the value of the polymorphism between square brackets ("[]"). The IUPAC code can be used or the possible bases separated by a "/".
```
Test_markers_01 ATGCATGCATGCATCGATCGATCGATCGATCGATCG[M]TCGATCGATCGATCGATCG
Test_markers_02 ATGCATGCATGCATCGATCGATCGATCGATCGATCG[A/C]TCGATCGATCGATCGATCG
```
The order of the bases between the square brackets does not matter.
Baptiste Imbert
committed
## Input of markers in FASTA format
If you already have the file in FASTA format with known polymorphism, it can be indicated similarly as to the CSV format.
For example:
```
>test_markers_01
GCCGTTCGGGTTGAAGAACGCAGGGGCAACTTATCAACGCGCCATGGTAGCTTTATTTCA
ATGATGTGATATGATACATCATGAGATTGAAGTTTA[C/A]GTAGACGATATGATAGCTC
CGGTCACAAA
>test_markers_02
GCCGTTCGGGTTGAAGAACGCAGGGGCAACTTATCAACGCGCCATGGTAGCTTTATTTCA
ATGATGTGATATGATACATCATGAGATTGAAGTTTA[M]GTAGACGATATGATAGCTC
CGGTCACAAA
```
Baptiste Imbert
committed
## Input of markers in VCF format
The pipeline can reposition variants described in a VCF file from one genome version to another.
In this case, the input VCF file must be supplemented with the genome on which the coordinates apply.
If the VCF is lacking (some or all) IDs, they can be generated based on the chromosome and the position of the SNP. Anyhow, the SNP IDs currently have to follow the format: `CHROM\_POS\_REF\_ALT`. Since IDs are mandatory for the pipeline, it is advised to keep the parameter add_ids_to_vcf to true.
Baptiste Imbert
committed
### Example
#### getfasta
With 2 nucleotides on each side in the bedtools slop gives:
```
>Lcu.2RBY.Chr1_58402::Lcu.2RBY.Chr1:58399-58404
CAACA
```
The first position in the header is 1-based (from the VCF). The next positions are 0-based (from the BED).
In IGV, the first C would be at position 58,400 and the last A at 58,404.
## Input of markers in BED format
Baptiste Imbert
committed
## Input of the genome
The genome input must in FASTA format.
All programs and libraries are installed with containers when possible or using Conda when several python packages are required.
## With the example datasets
CSV input
```
nextflow run main.nf -c conf/example_dataset_csv.config --outdir example_dataset_csv
```
VCF input
```
nextflow run main.nf -c conf/example_dataset_vcf.config --outdir example_dataset_vcf
```
BED input
nextflow run main.nf -c conf/example_dataset_bed.config --outdir example_dataset_bed
You can adapt the example config files to suit your needs and run `alignmarkers` as shown previously.
`--markers_csv` : markers in CSV format with the first column being the marker ID and the second the sequence
Baptiste Imbert
committed
or
`--markers_vcf` : markers in VCF format
or
`--markers_bed` : markers in BED format
`--genome` : the reference genome in FASTA format, on which the markers should be aligned
If the option `--vcf` or `--bed` is selected, also provide:
`--snp_genome` : the genome on which the SNP markers were originally identified
Baptiste Imbert
committed
Non-mandatory options :
`--slop_extend` : length of the extension in each direction from the provided locus for the sequence extraction
`--filter_evalue` : filter by evalue (default = false)
`--filter_length` : filter by length of alignment (default = false)
`--filter_indel` : filter by indel i.e. no insertion / deletion in the alignment (default = false)
`--filter_polymorphism` : filter by polymorphism (position of the polymorphism not included in the alignment, value of the polymorphism not correct compared to that of the genome)(default = false)
`--stats` : option to show stats (histogram, boxplot)
Baptiste Imbert
committed
Output options :
`--outdir` : directory where you will have all the results of the different processes (default = 'results')
- annotate/
- *.vcf annotated VCF with IDs for the SNPs
- vcf2bed/
- *.bed BED file extracted from the input VCF
- faidx/
- *.fai chromosome sizes required by bedtools slop
- slop/
- *.bed BED file with extended features
- getFasta/
- *.fa a FASTA file of the markers
- csv_markers/
- *.csv a CSV file of the markers
- alignment/
- alignment.paf paf file of the aligment of markers on the genome using minimap2
- filtered/
- filtered_paf.csv PAF filtered acording to users parameters
- alignment_query_subject_sequences.txt sequences of the query and reference for each marker
- position_marker.tsv summary table of the polymorphism found on the query and reference along with alignment statistics
- unaligned_markers.log list of all markers not found on the genome by minimap2
Baptiste Imbert
committed
- fastaPolymorphismPosition/
- markers_n.fa fasta file with the polymorphism replaced by "N"
- marker_position_value.tsv tsv file with the polymorphism position and value per marker (uses IUPAC for biallelic SNPs)
- stats/
- original_paf/ (before filtering)
- boxplot_filtered_pident.svg boxplot of percent identity alignments of markers on the genome
- boxplot_filtered_plenght.svg boxplot of percent alignment length of markers on the genome
- histogram_multimapping_filter.svg histogram of the number of alignment by markers
- marker_multialignment.txt number of markers aligned once, twice, etc.
- filtered_paf/ (after filtering)
- boxplot_filtered_pident.svg boxplot of percent identity alignments of markers on the genome
- boxplot_filtered_plenght.svg boxplot of percent alignment length of markers on the genome
- histogram_multimapping_filter.svg histogram of the number of alignment by markers
- marker_multialignment.txt number of markers aligned once, twice, etc.
- if there is not match in your .paf file it will causes a bug (I'm working on it)
- if the polymorphism is at the end (or the beginning) of the sequence it will not be aligned with the rest
The nextflow will by defaut without option check the polymorphism and the output files will be given without the alignement who didn't pass the filter polymorphism
Baptiste Imbert
committed
On the TSV file you have the differents columns :
| col | type |description |
|----------:|:-------------:|:-----------|
| sseqid | str | Target sequence name |
| sstart | str | Query start (0-based; BED-like; closed) |
| send | str | Query end coordinate (0-based; BED-like; open) |
| qseqid | str | Query sequence name |
| strand | str | ‘+’ if query/target on the same strand; ‘-’ if opposite |
| pos_poly_on_alignment | int | Position Polymorphism on the alignment |
| pos_poly_on_ref | int | Position Polymorphism on the reference(target) |
| base_gen | str | nucleotides on the genome that matches the nucleotide at position (pos_poly_on_alignment + sstart) |
| value_poly | str | polymorphism value for a sequence from fasta file |
| pident | float |percentage of identity which comes from the following calculation (Number of matching bases in the mapping / total number of sequence matches, mismatches and gaps in the mapping) * 100|
| plength | float | percentage of marker alignment that comes from the following calculation (end position - start position) / length of aligned marker * 100 |
If you have any problem or some doubts about the value of the polymorphism or anything about the alignment or the result there is a data marker generated in the filterAndStats process where there every information used to generate the tsv file. For the position polymorphism if the value is -1 it means that the position of the polymo
