SPInDel workbench
*****************

.. sidebar:: Download **SPInDel workbench**
    

   - **Version 1.1 (1 February 2012)**

   Source code
   `SPInDel workbench version 1.1 <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.1_source.zip>`_.
   
   Windows 32bit
   `SPInDel workbench version 1.1 win32 <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.1.exe>`_.   

   - **Version 1.0.1 (19 July 2010)**

   Source code
   `SPInDel workbench version 1.0.1 <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.0.1_source.zip>`_.
   
   Windows 32bit
   `SPInDel workbench version 1.0.1 win32 <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.0.1.exe>`_.
   
   Linux 32bit
   `SPInDel workbench version 1.0.1 linux32 <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.0.1_exe.linux-i686-2.6.zip>`_.
   
   - **Version 1.0**

   Windows 32bit
   `SPInDel workbench version 1.0 (20 April 2010) <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.0.exe>`_.
   
   Linux 32bit
   `SPInDel workbench version 1.0 (13 May 2010) <http://www.portugene.com/SPInDel/SPInDel_download/SPInDel_v1.0_exe.linux-i686-2.6.zip>`_.

   
   .. raw:: html
 
         <span style="font-size: 8pt">
         <br>
         <p>Softpedia guarantees that SPInDel Workbench 1.0.1 is 100% CLEAN,
         which means it does not contain any form of malware, including spyware, viruses, trojans and backdoors.<br>
         <a href="http://www.softpedia.com/get/Science-CAD/SPInDel.shtml"><img border="0" src="http://www.softpedia.com/images/softpedia_download_small.gif"/></a>
         <p></span>

.. topic:: Workbench

   The SPInDel workbench is a computational platform to facilitate the planning and 
   management of SPInDel projects, alignment of nucleotide sequences, visualization 
   and selection of conserved regions, calculation of PCR primers properties, prediction 
   of SPInDel profiles and diverse statistical and phylogenetic analyses.
   It includes a large dataset comprising nearly 1,800 numeric profiles for the identification 
   of eukaryotic, prokaryotic and viral species.


SPInDel - version 1.1 Documentation (1 February 2012)
*****************************************************


1. About
========

   **SPInDel workbench version 1.1**
   

   Population Genetics group (http://www.portugene.com)

   IPATIMUP - Institute of Molecular Pathology and Immunology of the University of Porto, Portugal (http://www.ipatimup.pt)
   
   .. raw:: html
   
      Copyright &copy 2009, 2010, 2011, 2012 by IPATIMUP. All rights reserved. 
      Software developed by Jo&#227;o Carneiro and Filipe Pereira


Related publications:
---------------------

.. Note:: Please cite these articles if you use the SPInDel workbench.

   .. raw:: html

      <span style="font-size: 11pt">
      <p>Filipe Pereira, Jo&#227;o Carneiro, Rune Matthiesen, Barbara van Asch, N&#225;dia Pinto, Leonor Gusm&#227;o and Ant&#243;nio Amorim.<br>
      "Identification of species by multiplex analysis of variable-length sequences."<br>
      Nucleic Acids Research. 2010. 38 (22): e203.<br>
      doi:10.1093/nar/gkq865<p>
      <p>Jo&#227;o Carneiro, Filipe Pereira, and Ant&#243;nio Amorim.<br>
      "A multifunction workbench for species identification in ecology and wildlife forensic investigations using insertion/deletion variants."<br>
      Submitted.<br>
      <p></span>
      

2. License Agreement
====================

   Terms of license:

   The SPInDel workbench is provided "as is", "with all faults" and without any express or implied warranty. In no event shall the authors or IPATIMUP be held liable for 
   any damages arising out of the use of or inability to use this software, even if its authors or IPATIMUP has been advised of the possibility of such damages.
   If you do not want to accept the terms of this license, you must not install the SPInDel workbench. By choosing to install this software you are accepting these terms.


3. System requirements
======================
 
   - Windows 95/98/NT/2000/XP/VISTA/7 or Linux.
   - At least 100 Mb free hard disk space.
   - A minimum of 32 MB of RAM.


4. Installation
===============
   Windows
   
   - Download the SPInDelv1.1.exe file from http://www.portugene.com or http://sourceforge.net/projects/spindel/files/ to any directory.
   - Execute the SPInDelv1.1.exe and run the Installation Wizard with administrative privileges.
   
   Linux
   
   - Download the SPInDel_v1.0.1_exe.linux-i686-2.6.zip file from http://www.portugene.com or http://sourceforge.net/projects/spindel/files/ to any directory.
   - Extract files from the zip file to the directory where you want to install the program.
   - Run the executable SPInDel_v1.


5. General Features
===================

-> Projects viewer
------------------

        Displays current SPInDel projects.

	* 'New project' button: Creates a new project by loading a DNA sequence alignment in the FASTA format (projects can be added or removed at any point). 

	* 'Remove project' button: Deletes the current selected project running on the SPInDel workbench.

-> Alignment editor
-------------------

        Displays the DNA sequence alignment from the current loaded project.

	1. SPInDel project box:

		* 'Undo all changes' button: Undo all previous changes made on a SPInDel project.

		* 'Save project' button: Saves all alterations made on a project.

		* 'Add sequences' button: Adds sequences to a project (sequences must be in a FASTA file).

		* 'Remove sequences' button: Removes selected sequences from current project. 
	
	2. Conserved region box:
	
		* 'Add' button: Adds a conserved region in current project. 

		* 'Remove' button: Removes a conserved region.
	
	3. Profiles:

		* 'Calculate profiles' button: Retrieves the list of numeric profiles defined by selected conserved regions (see theoretical background for details on calculations).

	4. Graphic options - Shows basic features of the current sequence alignment:
		
		* 'Track 1' combobox: Selects the current alignment feature to be displayed in track 1. 

		* 'Track 2' combobox: Selects the current alignment feature to be displayed in track 2. 
	
		* 'Window (track 2)' combobox: Select window length to be used in the feature displayed in track 2. 

		* 'Step (track 2)' combobox: Selects the step value to calculate the feature displayed in track 2. 

		* 'In (Zoom box)' combobox: Zoom in selected column range in graphical display of track 2.

		* 'Out (Zoom box)' combobox: Zoom out column range in graphical display of track 2.

-> SPInDel alignment options
----------------------------

        Perform sequence alignments using PyCogent TreeAlign.


-> SPInDel profiles frame
-------------------------

        Shows profiles and general statistics.

	1. Hypervariable regions box:
	
		* 'Undo changes': Recalculates general statistics using all regions defined in the alignment. 

		* 'Remove selected': Recalculates general statistics using unselected columns. 
	
		* 'Remove unselected': Recalculates general statistics using selected columns.
	
	2. SPInDel calculations box:
	
		* 'Region by region' button: Calculates the frequency of species-specific alleles and average pairwise differences for each hypervariable region.

		* 'Mismatch distribution' button: Calculates the number of pairwise differences between all profiles.

		* 'UPGMA tree' button: Calculates the UPGMA tree using the matrix of pairwise differences between profiles.
		
		* 'Primers properties' button: Calculates several PCR primers properties (sequence length, Tm, GC content).
			
			* 'Export primers' button: Exports PCR primers properties in excel csv format. 
	
		* 'Combinations' button: An algorithm generates m-combinations without repetition, which are subsets of m distinct elements of the set of all possible regions. For each m-combination, all Nsp and Ndp values are displayed on tables and graphs. The algorithm also included a 'multiplex PCR option' to retrieve only m-combinations not sharing conserved regions.

		* 'Search profile' button: Identifies an unknown profile in current databases.
		
		* 'PCA analysis' button: Performs a principal component analysis using profiles matrix.
			
	3. SPInDel exporter:
	
		* 'Profiles' button: Exports profiles in the excel csv format. 

		* 'Pairwise matrix' button: Exports the matrix of pairwise differences in the excel csv format. 

		* 'General statistics' button: Exports general statistics in text format. 
	
		* 'UPGMA tree' button: Exports the UPGMA tree in the newick format. 

		* 'PCA' button: Exports principal component analysis results.
		
		* 'Print' button: Prints profiles and general statistics. 
		
-> SPInDel profiles evaluation frame
------------------------------------

        Show results of f(ts) and f(dp) for combinations of profiles with n regions:

	1. SPInDel profiles box:
	
		* 'Standard' or 'Multiplex PCR' combobox : Filters hypervariable regions for standard SPInDel profiles (defined by all conserved regions) or multiplex PCR SPInDel profiles (only hypervariable regions not sharing conserved regions).

		* 'Graph': Displays a gaphic representation of f(sp) for all profiles from 1 to n regions. 
	
	2. Exporter tools box:
	
		* 'PCR primers' button: Exports PCR primers to a \*.csv excel file. 

		* 'Tables' button: Exports tables with f(sp) and f(dp) values. 

-> SPInDel search frame
-----------------------

        Identifies unknown samples in the current database.

	1. Profiles box:
	
		* 'Add': Adds target profile.
		
		* 'Remove': Removes profile.
		
		* 'Search': Retrieves profiles from the database equal or similar to the target profile.
		
	2. k-nearest neighbor box:
	
		* Combobox: Selects the *k* value to use in *k*-nearest neighbor calculations.

		* 'Cross-validation' button: Gives the prediction accuracy of *k*-nearest neighbor model for the selected *k* value.

		
| The SPInDel software was written in PYTHON 2.6 using Biopython (http://biopython.org/), SciPy (http://www.scipy.org/),
| GenomeDiagram (http://bioinf.scri.ac.uk/lp/programs.php), matplotlib (http://matplotlib.sourceforge.net/), NumPy (http://numpy.scipy.org/),
| pycogent(http://pycogent.sourceforge.net/) and Pythia (http://pythia.sourceforge.net). 
| The graphical interface was created using the VisualWX Rapid Application Development (RAD) environment (http://visualwx.altervista.org) 
| and Eclipse platform to debug and test the software.
| A single EXE file was created using the Inno Setup software (http://www.jrsoftware.org/isinfo.php) for installation purposes.


6. Theoretical background
=========================
	
	**Alignment calculations (Identity, GC and AT content, GC and AT skews)**
	
	   An identity value is plotted for each nucleotide position by estimating the frequency of the most common nucleotide in that
	   position (indels are ignored). Conserved regions can be easily identified by observing the graphic output for identity values
	   (highest conservation represented in green and lowest represented in red) and can be defined directly in the alignment window
	   using column selection. GC and AT skews and content were implemented using the GenomeDiagram Utilities.
	
	**Alignment algorithm**
	 
	   Projects with aligned sequences can be uploaded, although alignments can also be done with the Pycogent progressive alignment implemented on the workbench.
	   The user can select among different nucleotide substitution models (JC69, F81, HKY85 and GTR) to perform the alignment.
	
	**Calculations on SPInDel profiles (pairwise differences, mismatch distribution, f(ts), f(sh) and f(dp) )**


	   | **'SPInDel conserved regions'**: regions with no or small variability at the sequence level.
	   |
           | **'SPInDel hypervariable regions'**: regions containing multiple indels across species that potentially allow for differentiation by the determination of sequence length.
           |
           | **'Standard SPInDel profile'**: the combination of the fragment length of all contiguous SPInDel hypervariable regions observed in a sequence.
           |
	   | **'Multiplex PCR SPInDel profile'**: similar to a standard profile but only including SPInDel hypervariable regions that do not share the same conserved region.
	   |
	   | **'Species-specific SPInDel profiles'**: profiles that are only found in one species within a taxonomic group and allow their unequivocal identification. 
	   |
	   | **'Frequency of species-specific SPInDel profiles'**: 

           .. raw:: html
           
              <br>
              f<sub>n</sub><sup>G</sup>= N<sub>sp</sub>/N,<br>
              <br>
	      where G denotes the taxonomic group under investigation according to a two-letter code, n is the number of SPInDel hypervariable regions included on the profile, 
	      N<sub>sp</sub> is the number of species-specific SPInDel profiles and N is the total number of sequences represented on group G.<br>
	      'Number of species-shared profiles' (N<sub>sh</sub>): number of profiles that were found in more than one species inside a taxonomic group.<br><br> 
	      <b>'Average number of pairwise differences'</b>:<br>  
           
           .. raw:: html
           
              <br>
	      p<sub>n</sub><sup>G</sup>=(&#8721;<sup>N</sup><sub>k=1</sub>&#8721;<sup>N</sup><sub>l>k</sub>d<sub>kl</sub> )/(N(N-1)/2)<br>
	      <br>
	      where k and l are indices that refer to individual SPInDel profiles, d<sub>kl</sub> is the number of SPInDel hypervariable regions (from the total set of n) 
	      that differ in length between profiles k and l, and N is the total number of sequences represented in group G.<br><br> 
            <b>'Average number of pairwise differences per locus'</b>:<br>
            <br>(p<sub>n</sub><sup>G</sup>)/n,<br>
            <br>where n is the number of loci (i.e., hypervariable regions). 

	
        **UPGMA tree**
	
 	   UPGMA (Unweighted Pair Group Method with Arithmetic mean) is used to build a guide tree to discriminate between species in each database.
	   The distance between any two profiles A and B is taken to be the average of all distances between pairs of hypervariable regions "x" in A and "y" in B, 
	   that is, the mean distance between elements of each profile. The Pycogent UPGMA algorithm is used to cluster profiles based on the dissimilarity matrix 
	   obtained from the number of differences between profiles in each database.
	
	
	**Primers properties**
	
	   Calculations on PCR primers were implemented using `Oligocalc <http://www.basic.northwestern.edu/biotools/oligocalc.html>`_.
	   For sequences less than 14 nucleotides, 
	
	   Tm= (wA+xT)*2 + (yG+zC)*4 - 16.6*log10(0.050) + 16.6*log10([Na+]) 
	
	   where w,x,y,z are the number of the bases A,T,G,C in the sequence, respectively. The term 16.6*log10([Na+]) adjusts the Tm for changes in the salt concentration, 
	   and the term log10 (0.050) adjusts for the salt adjustment at 50 mM Na+. Other monovalent and divalent salts will have an effect on the Tm of the oligonucleotide, 
	   but sodium ions are much more effective at forming salt bridges between DNA strands and therefore have the greatest effect in stabilizing double-stranded DNA, 
	   although trace amounts of divalent cations have significant and often overlooked affects (See Nakano et al, (1999) Proc. Nuclec Acids Res. 27:2957-65). 
	
	   For sequences longer than 13 nucleotides, 

	   Tm= 100.5 + (41 * (yG+zC)/(wA+xT+yG+zC)) - (820/(wA+xT+yG+zC)) + 16.6*log10([Na+])
	
	   This equation is accurate for sequences in the 18-25mer range (Howley,P.M., Israel,M.F., Law,M-F., and Martin,M.A. (1979) J Biol Chem 254:4876-4883). 

	**K-nearest neighbors implementation in 'Search profile'**
	
	   SPInDel profiles of unknown origin can be predicted by a k-nearest neighbor method using a database of known profiles. The k-nearest neighbor algorithm 
	   is a supervised learning approach that finds the k closest matches in a database of known profiles using a distance metric. 
	   SPInDel uses the discrete metric:
	
	   if x = y then d(x,y) = 0; otherwise, d(x,y) = 1.

	   .. raw:: html

		We implemented the algorithm using Biopython and added the discrete distance metric. 
                Classification accuracy can be estimated within the SPInDel workbench by testing the performance of the k-nearest neighbors by modified
		leave-one-out cross validation using profiles from known species profiles. The modified leave-one-out cross validation ensures that classes
		with only one specie or genus are not subtracted or left out from the reference set. 
		The profiles label in dataset should be in the following format to perform the leave-one-out cross validation test: 
                "taxonomic-level-1 + underscore + taxonomic-level-2".

7. SPInDel Versions
===================

	**Version 1.1 Win32** (1 February 2012)
	
	New features:
		
		- Three taxonomic groups added to database.

        Bug correction:

		- Multiple graphics output with PyQt4 design.

	**Version 1.0.1 Win32 and Linux** (19 July 2010)
	
	New features:
		
		- Leave-one-out cross validation in search profile module.
	

	**Version 1.0 Linux32** (13 May 2010)


	**Version 1.0 Win32** (20 April 2010)

	
8. Support
==========

   If you are experiencing problems with the SPInDel please address to jcarneiro@ipatimup.pt or fpereira@ipatimup.pt.