CoreGenes 5.0

Enter one accesion number per text box, with a minimum of two Accessions to submit. Accession numbers should already be located on the NCBI website. For datasets not yet located on NCBI, use the Custom Dataset option on the left side of the site.

For larger genomes, which may require longer to process, the optional email notification will alert the user with a link when the results are available. Email notification also provides the user with a downloadable .csv formated results page.

Results will be displayed on a table, with each column displaying the core genes in each of the input genomes. The top row of the table shows the name and accession of the input genomes. The Accession numbers with hyperlinks will open the associated NCBI
The download link provides a .csv format version of the display table, which can be easily opened in excel or other file readers.

CDS Lookup

The CDS lookup option provides an easy to use interface for obtaining the coding sequence for valid NCBI Accession Numbers, which are retrieved in a zipped folder with .fasta formatted files. This tool also allows the user to see each gene used for analysis in the input genome.

File Upload

For queries with >20 accession numbers, which are available on NCBI, the File Upload option can be used to upload .txt files with comma-separated accession numbers. Either CoreGenes v.5.0 or v.5.0-IA (Iterative Algorithm) can be used in the file upload option.

The results will be displayed in the same format as the CoreGenes 5.0 table when using CoreGenes v.5.0 or v.5.0-IA

Custom Dataset

The Custom Dataset option is used for datasets which are not yet available on NCBI. Two or more .fasta files can be uploaded in the specified format of >Accession (space) Protein Name. For Custom Datasets, the formatting must be as shown to work. CoreGenes version 5.0 or 5.0-IA are available for custom datasets.

Results for the Custom Data will be similar to the CoreGenes 5.0 table, without the hyperlinks. The top row of the table shows the name of the file uploaded, with each column showing the core genes in that file.

Iterative Comparison Alorithm

The Iterative Comparison Algorithm uses the same iterative algorithm from CoreGenes 3.5, with the first accession number being the initial reference genome. The main change is in the BLAST method, switching to MMSEQS2 from Washington University Blast. The newer algorithm provides faster results, and allows for larger input genomes, however CoreGenes 5.0 Group Clustering has greater optimization for large genomes.

The table shown is the same format for all versions. Hypothetical proteins are highlighted in red for easy location.