Data Resources

Databases in Bioinformatics and Genomics

The Georgakopoulos-Soares Lab contributes to and maintains databases that capture genomic variation, regulatory elements, and computational tools. Explore these resources to support your own research and collaborations.

Featured datasets & portals

Each entry includes key highlights, scale, and keywords to help you identify the best fit for your work.

kmerDB: A database encompassing genomic and proteomic k-mers

kmerDB is a database of genomic and proteomic k-mers, allowing researchers to analyze sub-sequences of DNA and proteins. K-mers are fundamental to bioinformatics, used in sequence alignment, genome assembly, motif discovery, and evolutionary analysis. This resource enables large-scale querying and cross-species comparisons.

Keywords

k-mers
genomics
proteomics
bioinformatics

Highlights

Includes both DNA and peptide k-mers
Supports motif discovery and functional associations
Useful for metagenomics and proteomics
Provides a unified k-mer search framework

Scale & coverage

Billions of k-mer entries (not explicitly specified)

MPRAbase: A Massively Parallel Reporter Assay database

MPRAbase is a curated database of Massively Parallel Reporter Assay (MPRA) results. It consolidates experimental data on the activity of regulatory DNA elements across different cell types and species. This allows functional analysis of gene regulation and the impact of variants on enhancer and promoter activity.

Keywords

MPRA
functional genomics
regulatory elements
gene expression

Highlights

Contains 130 MPRA experiments
Covers 35 cell types across 4 species
Over 17 million tested elements
Provides downloadable datasets and query interface

Scale & coverage

17,718,677 elements
130 experiments
35 cell types
4 organisms

metagRoot: A comprehensive database of protein families in plant root microbiomes

metagRoot compiles protein families from plant root microbiomes, enabling functional analysis of microbial communities associated with roots. It allows researchers to explore how different protein families contribute to nutrient cycling, symbiosis, and plant health, and to compare microbiomes across plant species and conditions.

Keywords

root microbiome
protein families
metagenomics
plant–microbe interactions

Highlights

Focus on root-associated microbiomes
Catalog of protein families from metagenomic data
Links microbial proteins to ecological and functional roles
Supports cross-species comparisons of microbiomes

Scale & coverage

Tens of thousands of protein families (not explicitly specified)

Darling v2.0: Mining disease-related literature

Darling v2.0 is a web-based application that mines biomedical literature and databases to extract associations between genes, diseases, drugs, and proteins. It integrates multiple data sources and applies advanced text mining techniques to provide structured knowledge graphs for biomedical research.

Keywords

text mining
biomedical informatics
disease associations
knowledge discovery

Highlights

Extracts gene–disease–drug relationships
Integrates multiple biomedical databases
Supports visualization of associations
Useful for drug discovery and bioinformatics pipelines

Scale & coverage

Tens to hundreds of thousands of associations (not explicitly specified)

invertiaDB: A database of inverted repeats across genomes

invertiaDB provides a comprehensive collection of inverted repeats (IRs) from over 118,000 genomes, totaling more than 30 million sequences. IRs can form hairpins and cruciform DNA structures, which affect genome stability, replication, and rearrangements. The database supports advanced filtering and large-scale downloads.

Keywords

inverted repeats
DNA secondary structures
genome instability
hairpins

Highlights

Over 30 million inverted repeats
Covers >118,000 genomes
Searchable by length, GC content, and spacer size
Correlation observed between IR density and bacterial growth temperature

Scale & coverage

30,067,666 inverted repeats
118,070 genomes

Microsatellites Explorer: A database of short tandem repeats across genomes

Microsatellites Explorer compiles short tandem repeats (STRs, microsatellites) from more than 117,000 genomes. STRs are repetitive DNA elements that are important markers for population genetics, forensics, and evolutionary biology. The platform provides a searchable interface and supports statistical and comparative analysis of STR distribution.

Keywords

short tandem repeats
microsatellites
population genetics
genomic variation

Highlights

Database of STRs from thousands of genomes
Covers >117,000 organisms
Supports research in diversity, forensics, and comparative genomics
Provides visualization and download features

Scale & coverage

~117,253 genomes
Millions of STR sequences (not explicitly specified)

Quadrupia: A comprehensive catalog of G-quadruplexes across genomes from the tree of life

Quadrupia provides a large-scale catalog of G-quadruplexes (G4) across more than 100,000 genomes spanning the tree of life. G4s are non-canonical DNA structures formed by guanine tetrads and play roles in gene regulation and genome stability. This resource allows comparative and evolutionary analyses of G4 distribution across species.

Keywords

G-quadruplex
DNA structures
genome annotation
comparative genomics

Highlights

Systematic mapping of G4 sequences across species
Over 140 million G4 structures identified
Supports comparative and regulatory genomics
Useful for studying DNA stability and transcriptional regulation

Scale & coverage

~108,449 genomes
~140,181,277 G4 sequences
~319,784 G4 clusters