Skip to content
Snippets Groups Projects
README.md 8.07 KiB
Newer Older
# Alignment Pipeline
The goal of this Nextflow pipeline is to align markers to a reference genome and provide statistics on the alignment.
One can start from a CSV or a FASTA file of the marker sequences. In the case only the position of markers on a genome version is known, a VCF or BED file can be used, along with the genome on which these coordinates apply, to reposition markers on a new genome version.
Baptiste Imbert's avatar
Baptiste Imbert committed

![alignmarkers metro](pictures/alignmarkers_pipeline.png)

# Requirements
Baptiste Imbert's avatar
Baptiste Imbert committed

The CSV file (with no header) is expected to have the first column being the markers id then the second one must be the sequences. The sequence of the markers can include the value of the polymorphism between square brackets ("[]"). The IUPAC code can be used or the possible bases separated by a "/".
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed

```
Test_markers_01 ATGCATGCATGCATCGATCGATCGATCGATCGATCG[M]TCGATCGATCGATCGATCG
Test_markers_02 ATGCATGCATGCATCGATCGATCGATCGATCGATCG[A/C]TCGATCGATCGATCGATCG
```

The order of the bases between the square brackets does not matter.
If you already have the file in FASTA format with known polymorphism, it can be indicated similarly as to the CSV format.

For example:

```
>test_markers_01
GCCGTTCGGGTTGAAGAACGCAGGGGCAACTTATCAACGCGCCATGGTAGCTTTATTTCA
ATGATGTGATATGATACATCATGAGATTGAAGTTTA[C/A]GTAGACGATATGATAGCTC
CGGTCACAAA
>test_markers_02
GCCGTTCGGGTTGAAGAACGCAGGGGCAACTTATCAACGCGCCATGGTAGCTTTATTTCA
ATGATGTGATATGATACATCATGAGATTGAAGTTTA[M]GTAGACGATATGATAGCTC
CGGTCACAAA
```

The pipeline can reposition variants described in a VCF file from one genome version to another.
In this case, the input VCF file must be supplemented with the genome on which the coordinates apply.
If the VCF is lacking (some or all) IDs, they can be generated based on the chromosome and the position of the SNP. Anyhow, the SNP IDs currently have to follow the format: `CHROM\_POS\_REF\_ALT`. Since IDs are mandatory for the pipeline, it is advised to keep the parameter add_ids_to_vcf to true.
### Example
#### getfasta
With 2 nucleotides on each side in the bedtools slop gives:
```
>Lcu.2RBY.Chr1_58402::Lcu.2RBY.Chr1:58399-58404
CAACA
```
The first position in the header is 1-based (from the VCF). The next positions are 0-based (from the BED).
In IGV, the first C would be at position 58,400 and the last A at 58,404.


## Input of markers in BED format


The genome input must in FASTA format.
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed

All programs and libraries are installed with containers when possible or using Conda when several python packages are required.
Baptiste Imbert's avatar
Baptiste Imbert committed

# Running the pipeline

nextflow run main.nf -c  conf/example_dataset_csv.config --outdir example_dataset_csv
nextflow run main.nf -c  conf/example_dataset_vcf.config --outdir example_dataset_vcf
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed
```
nextflow run main.nf -c  conf/example_dataset_bed.config --outdir example_dataset_bed
Baptiste Imbert's avatar
Baptiste Imbert committed
```

## With your dataset
You can adapt the example config files to suit your needs and run `alignmarkers` as shown previously.
Baptiste Imbert's avatar
Baptiste Imbert committed

# Pipeline parameters
Baptiste Imbert's avatar
Baptiste Imbert committed

`--markers_csv` : markers in CSV format with the first column being the marker ID and the second the sequence
`--markers_fasta` : markers in FASTA format
`--markers_vcf` : markers in VCF format
              or
`--markers_bed` : markers in BED format
Baptiste Imbert's avatar
Baptiste Imbert committed

`--genome` : the reference genome in FASTA format, on which the markers should be aligned
Baptiste Imbert's avatar
Baptiste Imbert committed

If the option `--vcf` or `--bed` is selected, also provide:
`--snp_genome` : the genome on which the SNP markers were originally identified
Baptiste Imbert's avatar
Baptiste Imbert committed

`--slop_extend` : length of the extension in each direction from the provided locus for the sequence extraction

`--filter_evalue` : filter by evalue (default = false)
Baptiste Imbert's avatar
Baptiste Imbert committed

`--filter_length` : filter by length of alignment (default = false)
Baptiste Imbert's avatar
Baptiste Imbert committed

`--filter_indel` : filter by indel i.e. no insertion / deletion in the alignment (default = false)
Baptiste Imbert's avatar
Baptiste Imbert committed

`--filter_polymorphism` : filter by polymorphism (position of the polymorphism not included in the alignment, value of the polymorphism not correct compared to that of the genome)(default = false)
Baptiste Imbert's avatar
Baptiste Imbert committed

`--stats` : option to show stats (histogram, boxplot)
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed

`--outdir` : directory where you will have all the results of the different processes (default = 'results')
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed

- annotate/
  - *.vcf annotated VCF with IDs for the SNPs
- vcf2bed/
  - *.bed  BED file extracted from the input VCF
- faidx/
  - *.fai chromosome sizes required by bedtools slop
- slop/
  - *.bed BED file with extended features
- getFasta/
  - *.fa a FASTA file of the markers
- csv_markers/
  - *.csv a CSV file of the markers

- alignment/
  - alignment.paf paf file of the aligment of markers on the genome using minimap2
- filtered/
  - filtered_paf.csv  PAF filtered acording to users parameters
  - alignment_query_subject_sequences.txt  sequences of the query and reference for each marker
  - position_marker.tsv summary table of the polymorphism found on the query and reference along with alignment statistics
  - unaligned_markers.log list of all markers not found on the genome by minimap2
Baptiste Imbert's avatar
Baptiste Imbert committed

  - markers_n.fa  fasta file with the polymorphism replaced by "N"
  - marker_position_value.tsv tsv file with the polymorphism position and value per marker (uses IUPAC for biallelic SNPs)

- stats/
  - original_paf/ (before filtering)
    - boxplot_filtered_pident.svg boxplot of percent identity alignments of markers on the genome
    - boxplot_filtered_plenght.svg  boxplot of percent alignment length of markers on the genome
    - histogram_multimapping_filter.svg histogram of the number of alignment by markers
    - marker_multialignment.txt number of markers aligned once, twice, etc.
  - filtered_paf/ (after filtering)
    - boxplot_filtered_pident.svg boxplot of percent identity alignments of markers on the genome
    - boxplot_filtered_plenght.svg  boxplot of percent alignment length of markers on the genome
    - histogram_multimapping_filter.svg histogram of the number of alignment by markers
    - marker_multialignment.txt number of markers aligned once, twice, etc.



Baptiste Imbert's avatar
Baptiste Imbert committed

# Bugs that may occur
Baptiste Imbert's avatar
Baptiste Imbert committed

- if there is not match in your .paf file it will causes a bug (I'm working on it)
Baptiste Imbert's avatar
Baptiste Imbert committed

- if the polymorphism is at the end (or the beginning) of the sequence it will not be aligned with the rest
Baptiste Imbert's avatar
Baptiste Imbert committed

# General information
Baptiste Imbert's avatar
Baptiste Imbert committed

The nextflow will by defaut without option check the polymorphism and the output files will be given without the alignement who didn't pass the filter polymorphism
Baptiste Imbert's avatar
Baptiste Imbert committed

Baptiste Imbert's avatar
Baptiste Imbert committed

| col      |      type     |description |
|----------:|:-------------:|:-----------|
| sseqid |  str | Target sequence name      |
| sstart |    str   |   Query start (0-based; BED-like; closed)    |
| send | str |    Query end coordinate (0-based; BED-like; open)  |
| qseqid | str |    Query sequence name      |
| strand | str |    ‘+’ if query/target on the same strand; ‘-’ if opposite      |
| pos_poly_on_alignment | int |    Position Polymorphism on the alignment      |
| pos_poly_on_ref | int |    Position Polymorphism on the reference(target)      |
| base_gen | str |    nucleotides on the genome that matches the nucleotide at position (pos_poly_on_alignment + sstart)      |
| value_poly  | str |    polymorphism value for a sequence from fasta file      |
| pident  | float |percentage of identity which comes from the following calculation (Number of matching bases in the mapping / total number of sequence matches, mismatches and gaps in the mapping) * 100|
| plength  | float | percentage of marker alignment that comes from the following calculation (end position - start position) / length of aligned marker * 100 |
Baptiste Imbert's avatar
Baptiste Imbert committed

If you have any problem or some doubts about the value of the polymorphism or anything about the alignment or the result there is a data marker generated in the filterAndStats process where there every information used to generate the tsv file. For the position polymorphism if the value is -1 it means that the position of the polymo
Baptiste Imbert's avatar
Baptiste Imbert committed

![Image Data_marker](pictures/data_marker_file.png)