Secondary operations¶

Note

Data availability for this tutorial: the medium sized data set of 614 genes and 48 taxa that will be used can be downloaded here.

In addition to one of the main operations of TriFusion (Conversion, Concatenation and Reverse concatenation), one or more secondary operations can be applied during the processing of alignment files.

How main and secondary operations interact¶

Before starting with the secondary operations that are available on TriFusion, it is worth clarifying how the main and secondary operations interact:

Conversion: Each secondary operation is applied independently on each active input alignment that will be converted.

Concatenation: With the exception of the Filter secondary operation, all remaining secondary operation are performed on the single concatenated alignment file.

Reverse concatenation: Secondary operations will be applied after the reverse concatenation, which means that they will be applied to each partition (output file) that will be generated. It’s similar to the Conversion operation.

The order of operations¶

For performance reasons, operations in TriFusion are executed in a specific order:

Reverse concatenation [main]

Filters [secondary]

Concatenation [main]

Collapse [secondary]

Gap coding [secondary]

Consensus [secondary]

Write to file

Ok, so let’s start with the tutorial.

Load data¶

As already covered in a separate tutorial (see Load data into TriFusion), alignment data can be loaded into TriFusion in three different ways. Here we will use the file browser to load an entire directory where 614 alignments files are stored.

Navigate to Menu -> Open/View Data and click the Open file(s) button. This will open the main file browser.

The input data type is already correctly set to Alignment/Sequence set, so we’ll leave that as it is. Then, navigate the file browser until you find the directory containing the alignment files. In this case, all alignments are stored in a directory named Version2. Since TriFusion supports the selection of directories (in which case all files inside the specified directory will be loaded), I will only select the Version2 directory and click Load & go back button. At the end of the data loading, a popup informs how many files were loaded.

Note

If you know that not all files in the selected directory are alignments, you could still load that particular directory. All invalid alignment files will be ignored when the data is loaded.

You’ll also need to check the general options that are common to all operations.

Displaying secondary operations¶

To display all secondary operations, click the Show additional options button. This will reveal a tabbed panel, where the secondary operations are sorted into categories (with the exception of the Formatting tab, which is not a secondary operation per se).

Collapse¶

Turn the collapse switch ON to activate the operation.

The Collapse secondary operation contains three options:

Save in new output file: This will save the collapse alignment in another output file, separated from the main concatenated/converted output file. Checking this option will effectively produce two output files - a main output that is only concatenated/converted and another output file with the suffix “_collapsed” that will be concatenaded AND collapses. For now, we will not check this option.

Ignore missing data: If this option is checked, sequences will be collapsed based on alignment columns that do not contain missing data and the output alignment will also contain 0% of missing data. The currently loaded data set has a fair amount of missing data, and is most likely not appropriate for collapsing using this option, so we will also leave this unchecked.

Haplotype prefix: Sets the prefix for the haplotypes that will appear as the taxa names in the output file. An auxiliary file with the suffix “_haplotypes” will also be generated when performing this operation matching the new haplotype prefix to the original taxon names. Here we can change the default value to anything, like Haplotype

Note

You can click the Execute button to execute the Collapse operation alone, or combine other secondary operations before.

Consensus¶

Turn the consensus switch ON to activate the operation.

The consensus operation is mainly used to compress multiple sequences in an alignment into one representative sequence. While it can be done on top of the Concatenation main operation, what this will do is concatenate all 614 alignments into a single concatenated one and then create a consensus of that large alignment. However, in the majority of the cases, users are more interested in creating a consensus sequence for each input alignment. With this in mind, this secondary operation should be done with the Conversion main operation.

The Consensus secondary operation contains three options:

Save in new output file: This will save the consensus alignment in another output file, separated from the main concatenated/converted output file. Checking this option will effectively produce two output files - a main output that is only concatenated/converted and another output file with the suffix “_consensus” that have the consensus performed. For now, we will not check this option.

Save consensus in a single file: This option can be checked to merge all consensus from each input alignment in a single file. In this case, if this option is left unchecked, 614 output files will be created using this option, each with a single representative consensus sequence of the corresponding alignment. However, here we are more interested in merging all consensus sequences in a single file that will be later provided for functional annotation analyses. So we’ll check this option.

Consensus variation handing: Select how you would like to handle variation within each alignment. The appropriate choice is highly dependent on subsequent analyses. In our case, since we want to create a dataset for Blast2GO and our alignment data is fairly variable, we’ll select the First sequence value, where the first sequence of each alignment is selected as a representative.

Note

You can click the Execute button to execute the Consensus operation alone, or combine other secondary operations before.

Filters¶

There are several Filter operations that can be applied to the alignments. Turn the filter main switch ON to activate the operation. Now you can specify one or more filters to execute in the same run. Whenever a particular filter is active, the button of the corresponding operation will display Filters set.

Note

The Codon filter operation can only be executed on nucleotide alignments, so it will be disabled when protein alignments are loaded. You can use the small 7 alignment data set for this tutorial.

Taxa filter¶

Click on the button of the Taxa filter option and turn the switch on the popup of this operation ON.

The Taxa filter operation allows users to filter entire alignments if they contain or exclude a given set of taxa. Here, we will create a fictional case where we are interested in concatenating only alignments that contain at least all taxa with names beginning on a “C”.

The filter mode sets whether the alignments should be filtered if they contain or exclude the taxa group. By default, it is set to Containing, so we’ll leave that unchanged.

As you can see, there are no taxa groups yet defined so we’ll need to create a new one. Click the Set taxa group button to start the data set group creation process and then click the Set manually button. Here, select the desired taxa with names starting with the letter “C” and save the taxa group by clicking Ok.

Once the group has been created, it will be automatically selected in the Taxa filter dialog. Additional groups can be created in the same way. When multiple groups have been defined, they can be selected by clicking the Use taxa group button, and then selecting the desired group.

When you are happy with the Taxa filter settings, click the Ok button. If the Taxa filter switch was turned ON, the button of the Taxa filter option should change to Filters ON.

Finally, press the Execute button at the bottom of the Process screen to execute the filter operation. At the end of Filter operations that may remove alignment files from the final output, a Filter report will popup informing how many alignments were filtered. In our case, 84 alignments were filtered (By taxa filter) from the final output.

Codon position filter¶

Note

This filter is only available for nucleotide alignments.

Turn ON the filter switch to activate the operation. Then click the Set filters button for the Codon position filter option and turn ON the switch on the popup as well.

The Codon position filter operation allows you to remove certain codon positions from the output alignment. Consequently, this option is only available for nucleotide sequences. In many nucleotide alignments it is common to remove the third codon position, as it is generally much more variable and could introduce a substantial amount of phylogenenetic noise. However, this option removes the same codon positions in all input alignments. For example, if you load 10 alignments in TriFusion and exclude the 3rd codon position, you must make sure that all 10 alignments start in the 1st codon position. However, if all alignments start in the 2nd codon position, for instance, removing the 3rd codon position is still possible in TriFusion, by excluding the 2nd positions (which will actually correspond to the 3rd positions in the alignment).

To exclude a given codon position, simply toggle the corresponding button off. Included position button always have a blue background.

Gap/Missing data filter¶

Turn ON the filter switch to activate the operation. Then click the Set filters button for the Gap/Missing data filter option and turn ON the switch on the popup as well.

The Gap/Missing data filter allows user to filter alignment columns (within alignment) and/or alignments (multiple alignments) based on their missing data content. Both filters can be used in combination, if both within alignment and multiple alignments checkboxes are active, or only one of them.

In this example, we will filter both alignment columns and alignment files, so both checkboxes will remain active. Within an alignment, columns can be filtered depending on the amount of gaps or missing data. Gaps refer to the usual gap symbol (“-“) while missing data refers to the sum of gap symbols AND true missing data (“N” for nucleotides or “X” for proteins). These filters provide maximum threshold values in percentages, above which alignment columns are filtered. For example, if the gap percentage allowed option is set to 25% and the missing data percentage allowed option is set to 50%, then alignment columns with more than 25% of gaps OR more than 50% of gaps + true missing data are filtered.

In our case, we are interested in producing an output matrix that contains no missing data, so we will set both sliders to 0%.

Concerning the multiple alignments option, we will be more relaxed. We’ll set the slider to 25%, which means that only alignments with more than 25% of the total data set taxa (12 out of 48 in this case) will be further processed.

When you are happy with the gap/missing data filter settings, click the Ok button. If the Gap/Missing data filter switch was turned ON, the button of the Gap/Missing data filter option should change to Filters ON.

Finally, press the Execute button at the bottom of the Process screen to execute the filter operation. At the end of Filter operations that may remove alignment files from the final output, a Filter report will popup informing how many alignments were filtered. In our case, there were actually no filtered alignments, which means that all input alignments already contained more than 25% of the total taxa.

Sequence variation filter¶

Turn ON the filter switch to activate the operation. Then click the Set filters button for the Sequence variation filter option and turn ON the switch on the popup as well.

The sequence variation filter allows users to filter alignment files based on the amount of sequence variation. The two supported types of sequence variation are variable sites and informative sites. The different between these types is that variable sites includes all columns with at least one variant, while informative sites only includes variable columns where at least one alternative allele has two or more copies.

Here, you can specify multiple combination of maximum and minimum values for each variation type. When a checkbox is left inactive, it is assumed that there is no boundary for that specific value. For instance, let’s filter our alignments so that only alignments with at least 2 informative sites are processed. To achieve this, check the Minimum box of the informative sites option and set it to 2, but leave the Maximum box unchecked.

If you would like to set an upper limit to the number of informative sites, just check the Maximum box and set a number higher than 2. In this case, let’s put an upper limit of 10 informative sites.

It is also possible to mix both types of sequence variation. For instance, we may want to filter alignments with more than 2 informative sites and less than 200 variable sites.

However, note that certain combination are redundant. For instance, if you set a minimum of informative sites to 2, setting a minimum of variable sites to 1 will have no effect on the final output.

When you are happy with the sequence variation filter settings, click the Ok button. If the Sequence variation filter switch was turned ON, the button of the Sequence variation filter option should change to Filters ON.

Finally, press the Execute button at the bottom of the Process screen to execute the filter operation. At the end of Filter operations that may remove alignment files from the final output, a Filter report will popup informing how many alignments were filtered. In our case, if we execute filter options of a least 2 informative sites and less than 200 variable sites, a total of 539 alignments will be filtered.

https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/process_variation_filter_report.png

Gap coding¶

Turn ON the gap coding switch to activate the operation.

The Gap coding operation enables the codification of gaps as a binary matrix that is appended to the final of the alignment matrix. This option is available only when the Nexus format is the only output format selected. Currently, it contains a single available option:

Save in new output file: This will save the alignment with coded gaps in another output file, separated from the main concatenated/ converted output file. Checking this option will effectively produce two output files - a main output that is only concatenated/converted and another output file with the suffix “_gcoded” that will have the coded gaps. For now, we will not check this option.

The Gap coding method is currently restricted to the one described in Simmons and Ochotenera 2000, however additional methods are expected to be added in future releases.

Combination of three secondary operations¶

Until now, we only dealt with the activation and usage of individual secondary operations. However, many of these operations can fit rather naturally in combination. Here I’ll demonstrated how a data set of 614 alignments with 48 taxa can be concatenated, collapsed and filtered in a single run, with the condition that the collapsed alignment has to be generated in an independent alignment file.

After loading the data, select the Concatenation main operation in the Process screen. To keep things simple, let’s leave the Data set options in the default values, select only the Nexus output format and provide an output file name (here it will be my_concatenation).

https://raw.githubusercontent.com/ODiogoSilva/TriFusion-tutorials/master/tutorials/images/process_general_opts_secops.png

Setting up collapse operation¶

Open the secondary operations tabbed menu by clicking the Show additional options button, click on the Collapse tab and turn the switch ON.

Since we want to save the collapsed alignment in a separated output that is independent of the remaining operations, we’ll check the Save in new output file box. Our data set contains a fair amount of missing data, so we’ll leave the Ignore missing data box unchecked. Finally, we can leave the haplotype prefix in its default Hap value.

Setting up taxa filter¶

Here we are interested creating an output data set with alignments that contain any taxon whose name starts with the letter “C”.

Click on the Filter tab and turn the switch ON. Then, click on the Set filters button for the Taxa filter option and activate the switch in the popup. Change the Filter mode to Contain and then click on the Set taxa group button to define the new taxa group.

Let’s manually create a taxa group with all taxa names that start with the letter “C” by clicking the Set manually button.

Once the group has been created, check that the this group is correctly selected in the Taxa filter dialog.

If all checks out, click Ok and the button of the Taxa filter option should now display Filters ON.

Setting up missing data filter¶

Here we are interested in filtering ONLY alignments that contain less than 50% of the total taxa in the data set. Since we are not interested in the within alignment filtering, let’s uncheck this box and set the Multiple alignments slider to 50%.

Then select the Ok button, and both the Taxa filter and Gap/Missing data filter buttons should now display Filters ON.

Checking selected options¶

All currently active options can be viewed by clicking the View Queue button at the bottom of the Process screen. This will open the Menu side panel and show that:

The main operation is Concatenation;

There are two active secondary operations: Collapse and Filter;

The Nexus output format is the only selected;

There are two expected output files: The main output, my_concatenation, and the separate output file that will only contain the result of the concatenation and collapse operations, my_concatenation_collapse.

Execution¶

If everything checks out, click the Execution button at the bottom of the Process screen to show the small popup that displays a summary of the process execution and then click the Execute button to begin the execution.

At the end of the execution, a filter report will appear showing the number of alignments that were filtered by the active filters. Since we only activated two of the four filters that can remove alignments from the final output, the values for the other two filters display a Not applied message. For the active filters, the number of alignments removed due to that filter is displayed. In this case, no alignment was removed from the Gap/Missing data filter (it seems all alignments already contained more than 50% of the total taxa) and 84 alignments were removed by the Taxa filter.