Data Resources

Databases in Bioinformatics and Genomics

The Georgakopoulos-Soares Lab contributes to and maintains databases that capture genomic variation, regulatory elements, and computational tools. Explore these resources to support your own research and collaborations.

Featured datasets & portals

Each entry includes key highlights, scale, and keywords to help you identify the best fit for your work.

kmerDB: A database encompassing genomic and proteomic k-mers

kmerDB: A database encompassing genomic and proteomic k-mers

kmerDB is a database of genomic and proteomic k-mers, allowing researchers to analyze sub-sequences of DNA and proteins. K-mers are fundamental to bioinformatics, used in sequence alignment, genome assembly, motif discovery, and evolutionary analysis. This resource enables large-scale querying and cross-species comparisons.

Keywords

  • k-mers
  • genomics
  • proteomics
  • bioinformatics

Highlights

  • Includes both DNA and peptide k-mers
  • Supports motif discovery and functional associations
  • Useful for metagenomics and proteomics
  • Provides a unified k-mer search framework

Scale & coverage

  • Billions of k-mer entries (not explicitly specified)
MPRAbase: A Massively Parallel Reporter Assay database

MPRAbase: A Massively Parallel Reporter Assay database

MPRAbase is a curated database of Massively Parallel Reporter Assay (MPRA) results. It consolidates experimental data on the activity of regulatory DNA elements across different cell types and species. This allows functional analysis of gene regulation and the impact of variants on enhancer and promoter activity.

Keywords

  • MPRA
  • functional genomics
  • regulatory elements
  • gene expression

Highlights

  • Contains 130 MPRA experiments
  • Covers 35 cell types across 4 species
  • Over 17 million tested elements
  • Provides downloadable datasets and query interface

Scale & coverage

  • 17,718,677 elements
  • 130 experiments
  • 35 cell types
  • 4 organisms
Quadrupia: A comprehensive catalog of G-quadruplexes across genomes from the tree of life

Quadrupia: A comprehensive catalog of G-quadruplexes across genomes from the tree of life

Quadrupia provides a large-scale catalog of G-quadruplexes (G4) across more than 100,000 genomes spanning the tree of life. G4s are non-canonical DNA structures formed by guanine tetrads and play roles in gene regulation and genome stability. This resource allows comparative and evolutionary analyses of G4 distribution across species.

Keywords

  • G-quadruplex
  • DNA structures
  • genome annotation
  • comparative genomics

Highlights

  • Systematic mapping of G4 sequences across species
  • Over 140 million G4 structures identified
  • Supports comparative and regulatory genomics
  • Useful for studying DNA stability and transcriptional regulation

Scale & coverage

  • ~108,449 genomes
  • ~140,181,277 G4 sequences
  • ~319,784 G4 clusters
metagRoot: A comprehensive database of protein families in plant root microbiomes

metagRoot: A comprehensive database of protein families in plant root microbiomes

metagRoot compiles protein families from plant root microbiomes, enabling functional analysis of microbial communities associated with roots. It allows researchers to explore how different protein families contribute to nutrient cycling, symbiosis, and plant health, and to compare microbiomes across plant species and conditions.

Keywords

  • root microbiome
  • protein families
  • metagenomics
  • plant–microbe interactions

Highlights

  • Focus on root-associated microbiomes
  • Catalog of protein families from metagenomic data
  • Links microbial proteins to ecological and functional roles
  • Supports cross-species comparisons of microbiomes

Scale & coverage

  • Tens of thousands of protein families (not explicitly specified)
Darling v2.0: Mining disease-related literature

Darling v2.0: Mining disease-related literature

Darling v2.0 is a web-based application that mines biomedical literature and databases to extract associations between genes, diseases, drugs, and proteins. It integrates multiple data sources and applies advanced text mining techniques to provide structured knowledge graphs for biomedical research.

Keywords

  • text mining
  • biomedical informatics
  • disease associations
  • knowledge discovery

Highlights

  • Extracts gene–disease–drug relationships
  • Integrates multiple biomedical databases
  • Supports visualization of associations
  • Useful for drug discovery and bioinformatics pipelines

Scale & coverage

  • Tens to hundreds of thousands of associations (not explicitly specified)
invertiaDB: A database of inverted repeats across genomes

invertiaDB: A database of inverted repeats across genomes

invertiaDB provides a comprehensive collection of inverted repeats (IRs) from over 118,000 genomes, totaling more than 30 million sequences. IRs can form hairpins and cruciform DNA structures, which affect genome stability, replication, and rearrangements. The database supports advanced filtering and large-scale downloads.

Keywords

  • inverted repeats
  • DNA secondary structures
  • genome instability
  • hairpins

Highlights

  • Over 30 million inverted repeats
  • Covers >118,000 genomes
  • Searchable by length, GC content, and spacer size
  • Correlation observed between IR density and bacterial growth temperature

Scale & coverage

  • 30,067,666 inverted repeats
  • 118,070 genomes
Microsatellites Explorer: A database of short tandem repeats across genomes

Microsatellites Explorer: A database of short tandem repeats across genomes

Microsatellites Explorer compiles short tandem repeats (STRs, microsatellites) from more than 117,000 genomes. STRs are repetitive DNA elements that are important markers for population genetics, forensics, and evolutionary biology. The platform provides a searchable interface and supports statistical and comparative analysis of STR distribution.

Keywords

  • short tandem repeats
  • microsatellites
  • population genetics
  • genomic variation

Highlights

  • Database of STRs from thousands of genomes
  • Covers >117,000 organisms
  • Supports research in diversity, forensics, and comparative genomics
  • Provides visualization and download features

Scale & coverage

  • ~117,253 genomes
  • Millions of STR sequences (not explicitly specified)