CTGA: a web-based functional genomic resource for Cyamopsis tetragonoloba (L.) Taub.

Cover Page


Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription or Fee Access

Abstract

BACKGROUND: Guar (Cyamopsis tetragonoloba), an industrially important crop, is valued for the galactomannan gum derived from its seeds. Recent advances in genomic and transcriptomic research have provided valuable resources such as the reference genome and several sets of gene expression profiles. However, these data are currently fragmented and therefore require bioinformatics expertise to access and analyze them. Additionally, several genomic assemblies have been recently published, but there are currently no bioinformatics platforms specifically dedicated to guar genomics and transcriptomics.

AIM: To address this challenge, we have developed CTGA, a comprehensive functional genomic web portal for guar.

METHODS: Using Flask, as well as popular Python, CSS, and HTML libraries, we have developed a backend and frontend for the genomic platform.

RESULTS: We have performed a de novo structural and functional annotation of the guar genome predicting 57,019 protein-coding genes with UTRs. Besides, expression data from 85 public RNA-seq libraries representing various tissues and conditions were collected to create a normalized gene expression atlas. CTGA features an intuitive web interface to provide interactive tools, including a genome browser (IGV), BLAST for homology searching, tools for the Gene Ontology enrichment analysis, for working with guar genomic sequences, as well as a tool for generating heatmaps for more convenient analysis of guar gene expression in various tissues and experimental conditions. It also includes detailed functional annotations from various sources (eggNOG, Mercator4, GO, and KEGG) and instant visualization of gene expression profiles.

CONCLUSION: CTGA is available at: https://guar.arriam.ru/

Full Text

BACKGROUND

Guar (Cyamopsis tetragonoloba (L.) Taub.) is an important technical, feed, and food crop globally, primarily valued for its seed endosperm gum — a storage polysaccharide with extensive applications in food, oil, textile, pharmaceutical, and cosmetic industries [1].

While traditional guar breeding was based on phenotypic selection, with the advent of next-generation sequencing (NGS), genomic and transcriptomic approaches have emerged. Recent research has focused on elucidating the molecular mechanisms of galactomannan biosynthesis. One of the first studies, the work of M. Naoumkina et al. [2], made it possible to identify key candidate genes using cDNA libraries from developing seeds.

Subsequent RNA-seq studies comparing guar varieties with varying gum yields revealed that expression peaks for mannan synthase and sucrose synthase occur during the mid-stage of seed development, corresponding with gum accumulation [3]. Y. Hu et al. (2019), using quantitative RNA-Seq, highlighted the role of cellulose synthase-like A (CsLA) gene family, including mannan synthase [4]. These findings were further supported by Sharma and coauthors, who provided spatio-temporal insights into galactomannan regulation [5].

A significant advancement was made with the first genome assembly by Gaikwad and coauthors [6]. This enabled the precise mapping of genes involved in galactomannan biosynthesis and their regulatory elements. In parallel, efforts have expanded genomic resources, including the development of transcriptome-derived single nucleotide polymorphism (SNP) markers [7, 8]. Grigoreva and colleagues created an SNP panel for use in marker-assisted selection, utilizing a draft genome sequence [9].

Research has expanded to include traits other than gum production. Integrating transcriptome and metabolome analyses have identified genes and metabolites associated with flowering time [10, 11]. Furthermore, the complete chloroplast genome has been sequenced, facilitating phylogenetic studies [12].

Collectively, these advances have transformed guar from a crop that has been understudied to one that has been molecularly characterized, with foundational resources such as a reference genome, expression profiles, and molecular markers. However, manipulating this data requires specialized skills for access and analysis. To streamline the research of guar, we have developed a user-friendly web-based platform that includes a interactive genome browser, a BLAST service [13], functional gene annotations, expression profiles from all publicly available RNA-Seq data and other useful tools for working with the genomic sequence. CTGA is available at https://guar.arriam.ru/.

METHODS

Genomic sequence obtaining and structural reannotation

To reannotate the genes in the C. tetragonoloba genome [14], the reference assembly in FASTA format was downloaded from the National Center for Biotechnology Information (NCBI) database (available under BioProject ID: PRJNA1055737 or GenBank ID: GCA_037177725.1). De novo gene annotation was performed using the BRAKER2 tool (version 2.1.6) [15]. BRAKER2 performed automatic prediction of gene structure by combining ab initio evidence from GeneMark-EP+ [16] and AUGUSTUS [17], as well as RNA sequencing alignment data (85 samples, in total) (Table 1). The default parameters were used. To increase the completeness of the annotation, an additional step was performed to predict untranslated regions (UTR) using the capabilities built into pipeline BRAKER2/AUGUSTUS. As a result, a GFF3 file was generated containing the coordinates of the predicted genes, mRNA, exons, and UTRs, as well as their corresponding protein sequences. Aberrant CDS and UTR have been fixed or removed from the annotation using a custom Python script.

 

Table 1. A brief overview of the datasets utilized for genome annotation and the construction of the expression atlas

Таблица 1. Краткое описание наборов данных, использованных для аннотации генома и построения экспрессионного атласа

Run

LibraryName

SampleName

BioProject

Tissue

Age

Treatment

Genotype

SRR10120601

Salinity_Stress_1

Salinity_Stress

PRJNA564412

Leaf

55_days

Salinity stress

BWP_5595

SRR10120602

Drought_Stress_3

Drought_Stress.2

PRJNA564412

Leaf

55_days

Drought stress

BWP_5595

SRR10120603

Drought_Stress_2

Drought_Stress.1

PRJNA564412

Leaf

55_days

Drought stress

BWP_5595

SRR10120604

Drought_Stress_1

Drought_Stress

PRJNA564412

Leaf

55_days

Drought stress

BWP_5595

SRR10120605

Heat_Stress_3

Heat_Stress.2

PRJNA564412

Leaf

55_days

Heat stress

BWP_5595

SRR10120606

Heat_Stress_2

Heat_Stress.1

PRJNA564412

Leaf

55_days

Heat stress

BWP_5595

SRR10120607

Heat_Stress_1

Heat_Stress

PRJNA564412

Leaf

55_days

Heat stress

BWP_5595

SRR10120608

Control_3

Control.2

PRJNA564412

Leaf

55_days

Control

BWP_5595

SRR10120609

Salinity_Stress_3

Salinity_Stress.2

PRJNA564412

Leaf

55_days

Salinity stress

BWP_5595

SRR10120610

Salinity_Stress_2

Salinity_Stress.1

PRJNA564412

Leaf

55_days

Salinity stress

BWP_5595

SRR10120611

Control_2

Control.1

PRJNA564412

Leaf

55_days

Control

BWP_5595

SRR10120612

Control_1

Control

PRJNA564412

Leaf

55_days

Control

BWP_5595

SRR12855380

Drought_stressed_1_Cluster_bean_RGC-1025

Clusterbean_RGC_1025_Droughtstress

PRJNA669348

Leaf

30_days

Drought stress

RGC-1025

SRR12855381

Control_1_Cluster_bean_RGC-1025

Clusterbean_RGC_1025_Droughtstress.1

PRJNA669348

Leaf

30_days

Drought stress

RGC-1025

SRR13375879

4

TF2_R1

PRJNA687332

Leaf

Seedling_7

Control

MDU1_mutant

SRR13375880

3

TF1_R1

PRJNA687332

Leaf

Seedling_5

Control

MDU1_mutant

SRR13375881

2

GC2_R1

PRJNA687332

Leaf

Seedling_3

Control

MDU1_mutant

SRR13375882

1

GC1_R1

PRJNA687332

Leaf

Seedling_1

Control

MDU1_mutant

SRR15980315

C03L-3

C03L2

PRJNA763938

Leaf

1_month

Control

Matador

SRR15980316

C03L-2

C03L2.1

PRJNA763938

Leaf

1_month

Control

Matador

SRR15980317

C03L-1

C03L1

PRJNA763938

Leaf

1_month

Control

Matador

SRR15987996

C03R-3

C03R2

PRJNA763938

Root

1_month

Control

Matador

SRR15987998

C03R-2

C03R2.1

PRJNA763938

Root

1_month

Control

Matador

SRR15988000

C03R-1

C03R1

PRJNA763938

Root

1_month

Control

Matador

SRR16036104

C03L-3

C22L3

PRJNA763938

Leaf

1_month

Control

PI-340261

SRR16036105

C03L-2

C22L2

PRJNA763938

Leaf

1_month

Control

PI-340261

SRR16036106

C03L-1

C22L1

PRJNA763938

Leaf

1_month

Control

PI-340261

SRR16089914

C22R-3

C22R3

PRJNA763938

Root

1_month

Control

PI-340261

SRR16089915

C22R-2

C22R2

PRJNA763938

Root

1_month

Control

PI-340261

SRR16089916

C22R-1

C22R1

PRJNA763938

Root

1_month

Control

PI-340261

SRR16091634

T03L-3

T03L2

PRJNA763938

Leaf

1_month

Salinity stress

Matador

SRR16091635

T03L-2

T03L2.1

PRJNA763938

Leaf

1_month

Salinity stress

Matador

SRR16091636

T03L-1

T03L1

PRJNA763938

Leaf

1_month

Salinity stress

Matador

SRR16098083

T03R-3

T03R2

PRJNA763938

Root

1_month

Salinity stress

Matador

SRR16098084

T03R-2

T03R2.1

PRJNA763938

Root

1_month

Salinity stress

Matador

SRR16098085

T03R-1

T03R1

PRJNA763938

Root

1_month

Salinity stress

Matador

SRR16118530

T22R-3

T22R3

PRJNA763938

Root

1_month

Salinity stress

PI-340261

SRR16118531

T22R-2

T22R2

PRJNA763938

Root

1_month

Salinity stress

PI-340261

SRR16118532

T22R-1

T22R1

PRJNA763938

Root

1_month

Salinity stress

PI-340261

SRR16131876

T22L-3

T22L3

PRJNA763938

Leaf

1_month

Salinity stress

PI-340261

SRR16131877

T22L-2

T22L2

PRJNA763938

Leaf

1_month

Salinity stress

PI-340261

SRR16131878

T22L-1

T22L1

PRJNA763938

Leaf

1_month

Salinity stress

PI-340261

SRR22187661

cib329-1

cib329.1

PRJNA898087

Leaf

0_days

NA

NA

SRR22188060

cib329-2

cib329.1.1

PRJNA898087

Leaf

0_days

NA

NA

SRR22201601

cib329-3

cib329.3

PRJNA898087

Leaf

0_days

NA

NA

SRR22201687

cib329-4

cib329.4

PRJNA898087

Leaf

0_days

NA

NA

SRR22201863

cib329-5

cib329.5

PRJNA898087

Leaf

0_days

NA

NA

SRR22318672

cib329-6

cib329.6

PRJNA898087

Leaf

0_days

NA

NA

SRR3218523

NA

S.1

PRJNA312055

Leaf

3_weeks

NA

M-83

SRR3729737

H7WF5BBXX_HG365S1

HG365stage1

PRJNA326981

Reproductive_Tissue

stage1

NA

HG365

SRR3729738

H7WF5BBXX_HG365S2

HG365stage2

PRJNA326981

Reproductive_Tissue

stage2

NA

HG365

SRR3729739

H7WF5BBXX_HG870S5

HG870stage5

PRJNA326981

Reproductive_Tissue

stage5

NA

HG870

SRR3729740

H7WF5BBXX_HG870S7

HG870stage7

PRJNA326981

Reproductive_Tissue

stage7

NA

HG870

SRR3729741

H7WF5BBXX_Pods_Young

Pods_Young

PRJNA326981

Pods_Young

NA

NA

NA

SRR3729742

H7WF5BBXX_Pods_Mature

Pods_Mature

PRJNA326981

Pods_Mature

NA

NA

NA

SRR3729745

H7WF5BBXX_HG365S3

HG365stage3

PRJNA326981

Reproductive_Tissue

stage3

NA

HG365

SRR3729746

H7WF5BBXX_HG365S4

HG365stage4

PRJNA326981

Reproductive_Tissue

stage4

NA

HG365

SRR3729747

H7WF5BBXX_HG365S5

HG365stage5

PRJNA326981

Reproductive_Tissue

stage5

NA

HG365

SRR3729748

H7WF5BBXX_HG365S7

HG365stage7

PRJNA326981

Reproductive_Tissue

stage7

NA

HG365

SRR3729749

H7WF5BBXX_HG870S1

HG870stage1

PRJNA326981

Reproductive_Tissue

stage1

NA

HG870

SRR3729750

H7WF5BBXX_HG870S2

HG870stage2

PRJNA326981

Reproductive_Tissue

stage2

NA

HG870

SRR3729751

H7WF5BBXX_HG870S3

HG870stage3

PRJNA326981

Reproductive_Tissue

stage3

NA

HG870

SRR3729752

H7WF5BBXX_HG870S4

HG870stage4

PRJNA326981

Reproductive_Tissue

stage4

NA

HG870

SRR5204324

NA

S.1.1

PRJNA312055

Leaf

3_weeks

NA

M-83

SRR5428802

CT_S

Cyamopsis_tetragonoloba

PRJNA382073

Shoot

Young

NA

RGC-936

SRR5428803

CT_F

Cyamopsis_tetragonoloba.1

PRJNA382073

Flower

Young

NA

RGC-936

SRR5428804

CT_L

Cyamopsis_tetragonoloba.2

PRJNA382073

Leaf

Young

NA

RGC-936

SRR7785593

ESP_30

ESP_30

PRJNA486400

endosperm

30DAF

NA

CT1

SRR7785594

EMBCOL_40_3

EMBCOL_40_3

PRJNA486400

embryo

40DAF

NA

CT1

SRR7785595

EMBCOL_30_3

EMBCOL_30_3

PRJNA486400

embryo

30DAF

NA

CT1

SRR7785596

EMBCOL_30_2

EMBCOL_30_2

PRJNA486400

embryo

30DAF

NA

CT1

SRR7785597

EMBCOL_40_2

EMBCOL_40_2

PRJNA486400

embryo

40DAF

NA

CT1

SRR7785598

EMBCOL_40_1

EMBCOL_40_1

PRJNA486400

embryo

40DAF

NA

CT1

SRR7785599

SEED_20_2

SEED_20_2

PRJNA486400

seed

20DAF

NA

CT1

SRR7785600

SEED_20_1

SEED_20_1

PRJNA486400

seed

20DAF

NA

CT1

SRR7785601

EMBCOL_30_1

EMBCOL_30_1

PRJNA486400

embryo

30DAF

NA

CT1

SRR7785602

SEED_20_3

SEED_20_3

PRJNA486400

seed

20DAF

NA

CT1

SRR7785603

ESP_40

ESP_40

PRJNA486400

endosperm

40DAF

NA

CT1

SRR8082057

Not_applicable

Sample.A

PRJNA497670

Root

25_days

NA

RGC-1066

SRR8082058

Not_-applicable

Sample.B

PRJNA497670

Root

25_days

NA

M-83

SRR8434717

CT1

CT1

PRJNA514706

Leaf

Vegetative

Treated

CT

SRR8434718

TCT1

TCT1

PRJNA514706

Leaf

Vegetative

Control

TCT

SRR8434719

CT2

CT2

PRJNA514706

Leaf

Vegetative

Control

CT

SRR8434720

TCT2

TCT2

PRJNA514706

Leaf

Vegetative

Treated

TCT

SRR9176900

D4_R50_R1

Late_pod_1R

PRJNA545776

Pod

50DAF

Control

RGC-936

SRR9176901

E4_R50_R2

Late_pod_2R

PRJNA545776

Pod

50DAF

Control

RGC-936

SRR9176902

F2_R39_R1

Mid._pod.1R

PRJNA545776

Pod

39DAF

Control

RGC-936

SRR9176903

G2_R39_R2

Mid._pod.2R

PRJNA545776

Pod

39DAF

Control

RGC-936

SRR9176904

B4_M39_R1

Mid._pod.1Rm

PRJNA545776

Pod

39DAF

Control

M-83

SRR9176905

C4_M39_R2

Mid._pod.2Rm

PRJNA545776

Pod

39DAF

Control

M-83

SRR9176906

B2_R25_R1

Early_pod1R

PRJNA545776

Pod

25DAF

Control

RGC-936

SRR9176907

C2_R25_R2

Early_pod2R

PRJNA545776

Pod

25DAF

Control

RGC-936

SRR9176908

D2_M25_R1

Early_pod1Rm

PRJNA545776

Pod

25DAF

Control

M-83

SRR9176909

E2_M25_R2

Early_pod2Rm

PRJNA545776

Pod

25DAF

Control

M-83

SRR9176910

F4_M50_R1

Late_pod.1Rm

PRJNA545776

Pod

50DAF

Control

M-83

SRR9176911

G4_M50_R2

Late_pod_2Rm

PRJNA545776

Pod

50DAF

Control

M-83

 

Protein functional annotation and quality assessment

An integrated approach was applied to assign a functional annotation to the predicted protein sequences. The primary functional annotation, including the prediction of Gene Ontology (GO) [18, 19], metabolic pathways (KEGG) [20], and domain architecture, was performed using eggNOG-mapper (version 2.1.9) [21] against the eggNOG database (v5.0) using homology search mode (diamond). Additionally, for the categorization of genes in the context of biological pathways and comparison with other plant species, annotation was performed using Mercator4 [22, 23]. This tool assigned each protein to one of 70 hierarchical MapMan BIN categories based on hidden Markov models.

The quality control of predicted genes was conducted using BUSCO [24] with “embryophyta_odb10” database.

Raw RNA-seq reads processing

All publicly available RNA-seq datasets for C. tetragonoloba were obtained from the NCBI Sequence Read Archive (SRA) using the SRA Toolkit version 3.0.0 [25]. The search and selection were based on species-specific keywords, resulting in the loading of 89 libraries representing various tissue types and experimental conditions.

Initial processing of raw reads was carried out to ensure high-quality data for subsequent analysis. Adapter sequences, technical artifacts, and low-quality reads were filtered using BBDuk version 38.96 [26]. Parameters used included ktrim=r, k=23, mink=11, hdist=1, tbo, tpe, qtrim=rl, trimq=20, minlen=50.

Reads mapping and count matrix construction

The high-quality reads were mapped to the guar reference genome using the STAR tool version 2.7.10a [27] in two stages. At the first stage, splice events were detected, which were then used to improve genome annotation in the second stage. This improved the accuracy of the mapping.

Based on the BAM-formatted mapping results obtained using STAR, a count matrix was created using the featureCounts program [28]. This utility calculated the number of reads uniquely mapped to each BRAKER2 annotated gene for each library.

Data normalization and expression atlas construction

To compare the expression levels between the samples, which differ significantly in the total number of sequenced reads (library size), the counts matrix was normalized using the Counts Per Million (CPM) method. The resulting normalized CPM matrix served as the basis for constructing the guar expression atlas.

The implementation of a web-based functional genomics resource for guar

A specialized web service dedicated to Cyamopsis tetragonoloba has been developed to provide convenient and interactive access to the obtained genomic, transcriptomic and functional data.

The server part of the application is implemented in Python (version 3.11) using the Flask framework (version 2.3.2) [29]. The application provides routing, query processing, and programmatic access to data (annotated genome, pre-build BLAST database, expression matrix, and functional annotation) that is stored in a structured form on the server.

The client side is built using standard web technologies: HTML5, CSS3 and JavaScript. Bootstrap and Chart.js (version 4.3.0) libraries are used to create interactive and dynamic user interface elements.

The web service includes four main functional modules. For visual analysis of the annotated genome, the IGV.js component was integrated (Integrative Genomics Viewer [30], version 2.13.2), pre-generated reference genome (FASTA) and annotation (GFF3) files are uploaded on the client side, allowing users to navigate through chromosomes, scale the loci of interest, and visualize predicted gene structures, including exons, introns, and UTR regions. The gene expression analysis module allows the user to enter the gene identifier, after which the server application on Flask extracts the corresponding normalized expression values (CPM) for all samples from the prepared matrix. The data is transmitted to the client side, where an interactive boxplot chart is automatically generated using the Chart.js library, which visually displays the expression profile of the requested gene in various tissues and conditions. Searching by gene identifier also allows the user to obtain comprehensive functional information. On a separate page or as a pop-up window, data obtained from the EggNOG and Mercator tools are displayed, including protein function prediction, Gene Ontology (GO) terms, KEGG pathways, as well as Mercator4 detailed functional description. To enable the search for homologous genes in the guar genome by the sequence of nucleotides or amino acids, BLAST+ was integrated into the web service. A local BLAST database containing annotated coding sequences (CDS) and protein sequences was created on the server. The user interface includes a form for entering ID or uploading a sequence in FASTA format and configuring basic parameters. After sending the request, the server application on Flask runs the BLAST+ utility, processes the results and returns to the user an interactive HTML page with alignments, E-value and percentage of identity, providing direct links to homologous genes in other modules of the service (genome browser, gene expression, annotation).

The Gene Ontology enrichment analysis was implemented using custom R and python scripts and the topGO R package [31]. The calculation was carried out using the Fisher’s exact test and the weight01 algorithm.

The web service is available at https://guar.arriam.ru/ and it can be used by the scientific community for in-depth analysis of the guar genome.

RESULTS

The guar reference genome has been assembled at the chromosomal level and annotated in the year 2024, but the annotation is not publicly available at the moment. To obtain a high-quality set of predicted genes for future work, we performed a de novo genome annotation.

Re-annotation and characterization of the guar genome

In this study, we performed a comprehensive de novo annotation of the published guar (Cyamopsis tetragonoloba) genome using the BRAKER2 pipeline. This approach resulted in the prediction of 57,019 protein-coding genes encoding 82,042 proteins. A key advancement of our annotation over the existing one was the precise prediction of untranslated regions (UTRs) for the gene models, which are crucial for the regulation of gene expression, especially if the analysis is performed using technologies involving RNA capture by the polyA-tail and therefore sequencing only 3'end of transcripts (for example, the 3' MACE technology [32]).

To assess the quality of the structural annotations, the BUSCO program was run on the predicted proteins using the embryophyta_odb10 database. As a result, a significant number of genes have been fully covered, and the low percentage of missing or fragmented data was obtained, indicating the high quality of the annotation process (Table 2).

 

Table 2. BUSCO based analysis of completeness of annotation

Таблица 2. Анализ качества аннотации по предсказанным белкам с помощью BUSCO

Protein categories

Number

Percentage

Complete BUSCOs (C)

1574

97.5%

Complete and single-copy BUSCOs (S)

1538

95.3%

Complete and duplicated BUSCOs (D)

36

2.2%

Fragmented BUSCOs (F)

21

1.3%

Missing BUSCOs (M)

19

1.2%

Total BUSCO groups searched

1614

100%

 

The functional annotation of the predicted genes using EggNOG-mapper successfully assigned putative functions to 36,998 genes (Table 3), providing Gene Ontology (GO) terms, KEGG pathways, and domain architectures.

 

Table 3. Statistics of the genome annotation for Cyamopsis tetragonoloba

Таблица 3. Описательные статистики аннотации генома Cyamopsis tatragonoloba

Feature

Count

Genes with functional annotation (EggNOG)

36,998 (65%)

Genes assigned to MapMan BINs (Mercator4)

28,443 (50%)

Average transcript length (bp)

1,217.35

Average exons per gene

4.76

 

Complementary analysis with Mercator4, enabled the categorization of 28,443 genes into the hierarchical MapMan BIN system (see Table 3), facilitating the functional exploration of biological pathways in guar. All the major biological processes and metabolic pathways encoded in MapMan bins were covered by the predicted genes (Fig. 1, a). In addition, galactomannan biosynthesis genes were also identified among the annotated ones (b), which indicates the high quality of the annotation.

 

Fig. 1. Distribution of annotated genes using Mercator4 by functional groups (a), distribution of galactomannan biosynthesis genes by functional groups (b).

Рис. 1. Распределение аннотированных генов по функциональным группам, полученным через Mercator4 (a), распределение генов биосинтеза галактоманнана по функциональным группам (b).

 

Construction of a comprehensive gene expression atlas

To capture the transcriptomic landscape of guar, we collected all publicly available RNA-seq datasets from NCBI SRA, comprising 96 libraries derived from a wide range of tissues and developmental stages (see Table 1). However, not all the samples collected could be analyzed, as various technical errors were found that did not allow proper processing of the reads. Eventually, 85 samples were left for further analysis (Table 1).

After rigorous quality control and adapter trimming using BBDuk, high-quality reads were aligned to the annotated genome using the STAR aligner.

The resulting expression matrix was normalized using Counts Per Million (CPM) to enable cross-sample comparison. This comprehensive expression atlas reveals the transcript abundance of all predicted genes across the studied conditions. Principal Component Analysis (PCA) of the CPM matrix showed clear separation of samples by tissue type (Fig. 2, b) but not sequencing running or experiment (a), demonstrating the biological consistency of the dataset and the quality of normalization.

 

Fig. 2. Principal Component Analysis (PCA) plot of the RNA-seq samples included in the expression atlas, colored by BioProject (a) and tissue (b).

Рис. 2. Анализ методом главных компонент экспрессиии генов в образцах РНК-секвенирования, использованных для построения экспрессионного атласа, окрашенных в соответствии с BioProject ID (a) и типом ткани (b).

 

From the 57,019 total genes annotated in the guar genome, 27823 genes (48.79%) showed significant expression (≥10 CPMs, Counts Per Million) in the sum of all samples.

Development of an interactive guar genomic resource

To make annotated genome, gene expression, functional data and several research tools freely accessible and user-friendly, we developed a comprehensive web resource using the Flask framework. The platform integrates several key modules (Fig. 3).

 

Fig. 3. A schematic representation of the key modules and data flow within the developed guar genomics web service.

Рис. 3. Схема организации ключевых модулей разработанной платформы CTGA.

 

First of all, the user can get information both by the identifier of a specific gene and by the nucleotide or amino acid sequence of a guar or a closely related organism by inserting it into the appropriate window and conducting a BLAST search (Fig. 4, a). The user can fine-tune a filter for the BLAST search by changing the E-value. On the results page, the user can select the best hits based on a number of parameters, such as the percentage of identity, alignment length, number of substitutions, and e-value (b).

 

Fig. 4. The start page of the resource and fields for searching for genes via identifiers or sequence (a). Homology BLAST search results page (b). A section with information about the functional annotation of a gene (c).

Рис. 4. Начальная страница платформы и поля для поиска генов по идентификаторам или последовательности (a). Страница результатов поиска по гомологии через BLAST (b). Раздел с информацией о функциональной аннотации гена (c).

 

By clicking on the selected gene, the user is taken to the next page with detailed information about the gene. The platform provides instant access to comprehensive functional annotation (via EggNOG and Mercator4 databases) for each gene. In addition, KEGG and GO terms are assigned to each gene, along with a brief description, which facilitates subsequent analysis (see Fig. 4, c).

An embedded IGV.js instance allows for intuitive visualization of the genomic context of any gene, including exon-intron structures and predicted UTRs (Fig. 5, a). The genomic browser is available for all genes, regardless of whether the user accesses it through the BLAST service or the ID search. In the genomic browser section, the user can also extract the complete sequence of the gene of interest, or only the CDS or protein sequence, with one click (a).

 

Fig. 5. Representation of several genes via the IGV genomic browser on a developed genomic resource (a). Section with the expression of a specific gene for all samples in the form of: b, barplot or c, boxplots.

Рис. 5. Представление нескольких генов с помощью геномного браузера IGV (a). Разделы с экспрессией определенного гена для всех образцов в виде столбчатых диаграмм (b) или боксплотов (c).

 

For any gene of interest, users can generate an interactive barplot and boxplot displaying its normalized expression level (CPM) across all integrated RNA-seq samples, facilitating quick assessment of its expression pattern. Each box reflects the median, Q1 and Q2 values of the normalized expression counts for all available replications, except in cases where the data is publicly available only in a single replicate (see Fig. 5, b, c).

Integration of a BLAST service allows users to search for homologous sequences within the guar genome using nucleotide or protein queries, directly linking results to the genome browser and expression modules.

To facilitate the work of researchers in the fields of genomics and transcriptomics of guar, we have added the ability to download all necessary files for analysis. These include structural genome annotation, functional annotation using eggnog with GO/KEGG identifiers, functional annotation using mercator4, and a gene expression data. All the listed datasets are available in the “Downloads” section.

By conducting transcriptomic and genomic analyses using the available guar reference genome assembly and the datasets obtained during this study, researchers can use a special tool to perform the Gene Ontology enrichment analysis on their own set of genes. Based on user-entered gene identifiers, the R script performs Gene Ontology enrichment analysis for one of three categories: biological process, molecular function, or cellular component. As an output, the user receives a barplot and a table with statistically significant gene ontology terms (Fig. 6). This service is available in the “Tools” section.

 

Fig. 6. An example of what the result of the Gene Ontology enrichment analysis tool looks like.

Рис. 6. Пример результата работы инструмента анализа обогащения онтологии генов.

 

Having a set of genes of interest or regions of the genome, users can extract sequences directly from the genomic assembly by coordinates using the “Sequence Extractor” tool implemented in our resource, which is also available on the “Tools” page.

When working with sequences of many genes, scientists often need to estimate their expression levels across different organs and tissues of the organism. For these cases, we developed the “Heatmap Generator” tool, allowing create a heatmap based on all currently available RNA-sequencing libraries using gene identifiers as a query (Fig. 7). This tool is available online in the “Tools” section.

 

Fig. 7. Demonstration of a “Heatmap Generator” tool for creating heatmaps based on expression data and gene identifiers as a query.

Рис. 7. Демонстрация инструмента «Генератор тепловых карт» для создания тепловых карт на основе данных экспрессии и идентификаторов генов.

 

This tool allows users to quickly create a heatmap using flexible parameters for z-scale standardization, gene or sample clustering, and provides a choice of color palettes.

DISCUSSION

In recent decades, due to the development of next-generation sequencing methods, guar has transformed from a poorly studied agriculture crop into a genetically well-studied species. Initial studies have successfully identified key genes associated with galactomannan biosynthesis [3, 5, 6], developed a comprehensive collection of molecular markers [9, 11], and resulted in the chromosome-level genome assembly [6, 14]. However, the full potential of these diverse genomic resources has yet to be fully realized, as their accessibility and integration necessitate significant bioinformatics expertise, posing a barrier for numerous researchers and breeders. Our research was aimed at solving this problem.

We presented a comprehensive genomic resource for C. tetragonoloba, which includes a structural and functional annotation with an extensive expression atlas and is available through a user-friendly web interface. High-quality de novo gene prediction performed using BRAKER2 and confirmed by BUSCO’s high completeness score (97.5%) provides an accurate and reliable set of genes (Table 2). Accurate prediction of untranslated regions (UTRs) is essential for studying post-transcriptional regulation and research based on 3'-MACE sequencing technology [18]. The predicted genes ultimately covered all major functional categories of Mercator, indicating both successful gene prediction and high-quality annotation.

The value of genome annotation increases significantly when understanding the conditions and tissues in which genes are expressed.

The expression atlas we have built, based on 85 publicly available RNA-seq libraries covering various tissues and developmental conditions, provides an unprecedented overview of the guar transcriptome landscape. The clear separation of samples by experiment conditions, genotypes and tissue in Principal Component Analysis (PCA) highlights the high quality of this integrated dataset (see Fig. 2) and the ability to make comparisons of gene expression, disregarding factors such as the sequencing run and the origin of the data.

The service is designed to simplify the work of researchers and provides them with the opportunity to work effectively with data. By integrating IGV’s interactive genomic browser, BLAST server, and instant visualization of expression profiles and functional annotations, the service allows researchers to opt out of using command-line tools and local data processing.

In our view, the tools provided in this resource have the potential to significantly enhance research opportunities in the field of guar biology. By facilitating accurate genomic analysis without requiring programming skills, the sequence extractor enables researchers to rapidly obtain specific sequences of exons, introns, and intergenic regions, allowing them to design experiments, characterize genes. Gene Ontology enrichment analysis tool assists in identifying key biological processes and functions associated with, for instance, stress tolerance or seed maturation in guar. Heatmaps allowing detect trends in gene expression patterns, and the clustering of genes based on these patterns enabling identify their common regulation pathways and association with physiological processes. Together, these instruments make complex genomic and transcriptomic data more approachable and interpretable to plant biologists, facilitating a better understanding of plant physiological mechanisms and development processes.

Thus, while previous works have provided important data for genomic and transcriptomic studies of guar, our study integrates previous experience and serves as a centralized database. We have combined a variety of data into a single powerful platform. By reducing the technical barrier, this resource will allow a wider range of scientists and breeders to contribute to improving the quality of guar research, which will eventually lead, hopefully, to the creation of high-yielding, disease-resistant varieties that meet global agricultural and industrial requirements. We are aware of the limitations of the current platform and intend to address them in the future by incorporating new data, functional modules, and features that will further enhance the user experience for researchers.

CONCLUSION

This study presents CTGA, a comprehensive functional genomics resource that significantly advances research on C. tetragonoloba. By integrating high-quality de novo genome annotation and extensive expression data from 85 RNA-sequencing libraries, we have created a unified platform that addresses the fragmentation of existing genomic resources. The resource offers accurate gene models and functional annotations from the EggNOG and Mercator4 databases, detailed expression profiles across various tissues and stages of development as well as several analytical tools.

ADDITIONAL INFO

Author contributions: V.A. Zhukov, E.A. Zorin, M.A. Vishnyakova, conceptualization, writing—original draft preparation, writing—review and editing; E.A. Zorin, methodology, investigation; M.A. Vishnyakova, funding acquisition. The authors approved the manuscript (the version for publication) and also agreed to be responsible for all aspects of this work, ensuring proper consideration and resolution of issues related to the accuracy and integrity of any part of it.

Funding sources: This research was funded by the Russian Science Foundation, project No. 23-16-00195 dated 15 May 2023.

Disclosure of interests: The authors have no relationships, activities, or interests for the last three years related to for-profit or not-for-profit third parties whose interests may be affected by the content of the article.

Statement of originality: No previously obtained or published material (text, images, or data) was used in this study or article.

Data availability statement: All the data obtained in this study are presented in the article. The data obtained during the work is available at https://guar.arriam.ru/ in the Downloads tab.

Generative AI: No generative artificial intelligence technologies were used to prepare this article.

Provenance and peer-review: No generative artificial intelligence technologies were used to prepare this article. The peer review process involved one external reviewer, a member of the Editorial Board, and the in-house science editor.

Disclaimer: This article is published as submitted by the authors. The authors are solely responsible for the content and style of the manuscript.

ДОПОЛНИТЕЛЬНАЯ ИНФОРМАЦИЯ

Вклад авторов. В.А. Жуков, Е.А. Зорин, М.А. Вишнякова — определение концепции, написание черновика рукописи, пересмотр и редактирование рукописи; Е.А. Зорин — разработка методологии, поведение исследования; М.А. Вишнякова — привлечение финансирования. Все авторы одобрили рукопись (версию для публикации), а также согласились нести ответственность за все аспекты настоящей работы, гарантируя надлежащее рассмотрение и решение вопросов, связанных с точностью и добросовестностью любой ее части.

Источники финансирования. Данная работа целиком поддержана грантом Российского научного фонда (проект № 23-16-00195)

Раскрытие интересов. Авторы заявляют об отсутствии отношений, деятельности и интересов за последние три года, связанных с третьими лицами (коммерческими и некоммерческими), интересы которых могут быть затронуты содержанием статьи.

Оригинальность. При проведении исследования и создании настоящей статьи авторы не использовали ранее полученные и опубликованные сведения (данные, текст, иллюстрации).

Доступ к данным. Все данные, полученные в настоящем исследовании, представлены в статье. Данные, полученные в ходе работы доступны по адресу https://guar.arriam.ru/ во вкладке Downloads.

Генеративный искусственный интеллект. При создании настоящей статьи технологии генеративного искусственного интеллекта не использовали.

Рассмотрение и рецензирование. Настоящая работа подана в журнал в инициативном порядке и рассмотрена по обычной процедуре. В рецензировании участвовали один внешний рецензент, член редакционной коллегии и научный редактор издания.

Дисклеймер. Данная статья публикуется в том виде, в каком она была представлена авторами. Авторы несут полную ответственность за содержание и стиль рукописи.

×

About the authors

Evgeny A. Zorin

N.I. Vavilov All-Russian Institute of Plant Genetic Resources; All-Russia Research Institute for Agricultural Microbiology

Author for correspondence.
Email: ezorin@arriam.ru
ORCID iD: 0000-0001-5666-3020
SPIN-code: 5048-0203

Cand. Sci. (Biology)

Russian Federation, Saint Petersburg; Pushkin, Saint Petersburg

Margarita A. Vishnyakova

N.I. Vavilov All-Russian Institute of Plant Genetic Resources

Email: m.vishnyakova@vir.nw.ru
ORCID iD: 0000-0003-2808-7745
SPIN-code: 2802-9614

Dr. Sci. (Biology), Professor

Russian Federation, Saint Petersburg

Vladimir A. Zhukov

N.I. Vavilov All-Russian Institute of Plant Genetic Resources; All-Russia Research Institute for Agricultural Microbiology

Email: vzhukov@arriam.ru
ORCID iD: 0000-0002-2411-9191
SPIN-code: 2610-3670

Cand. Sci. (Biology)

Russian Federation, Saint Petersburg; Pushkin, Saint Petersburg

References

  1. Thombare N, Jha U, Mishra S, et al. Guar gum as a promising starting material for diverse applications: a review. Int J Biol Macromol.2016;88:361–372. doi: 10.1016/j.ijbiomac.2016.04.001 EDN: WRSQWT
  2. Naoumkina M, Torres-Jerez I, Allen S, et al. Analysis of cDNA libraries from developing seeds of guar (Cyamopsis tetragonoloba (L.) Taub). BMC Plant Biol. 2007;7:62. doi: 10.1186/1471-2229-7-62 EDN: NUVATW
  3. Chaudhury A, Kaila T, Gaikwad K. Elucidation of galactomannan biosynthesis pathway genes through transcriptome sequencing of seeds collected at different developmental stages of commercially important Indian varieties of cluster bean (Cyamopsis tetragonoloba L.). Sci Rep. 2019;9(1):11539. doi: 10.1038/s41598-019-48072-w EDN: GEKMME
  4. Hu H, Wang H, Zhang Y, et al. Characterization of genes in guar gum biosynthesis based on quantitative RNA-sequencing in guar bean (Cyamopsis tetragonoloba). Sci Rep. 2019;9(1):10991. doi: 10.1038/s41598-019-47518-5 EDN: XRKTMY
  5. Sharma S, Tyagi A, Srivastava H, et al. Exploring the edible gum (galactomannan) biosynthesis and its regulation during pod developmental stages in clusterbean using comparative transcriptomic approach. Sci Rep. 2021;11(1):4000. doi: 10.1038/s41598-021-83507-3 EDN: JYLGKG
  6. Gaikwad K, Ramakrishna G, Srivastava H, et al. The chromosome-scale genome assembly of cluster bean provides molecular insight into edible gum (galactomannan) biosynthesis family genes. Sci Rep. 2023;13(1):9941. doi: 10.1038/s41598-023-33762-3 EDN: ACCLTJ
  7. Rawal HC, Kumar S, Mithra SVA, et al. High quality unigenes and microsatellite markers from tissue specific transcriptome and development of a database in clusterbean (Cyamopsis tetragonoloba, L. Taub). Genes. 2017;8(11):313. doi: 10.3390/genes8110313 EDN: YIOLPP
  8. Thakur O, Randhawa GS. Identification and characterization of SSR, SNP and InDel molecular markers from RNA-Seq data of guar (Cyamopsis tetragonoloba, L. Taub.) roots. BMC Genomics. 2018;19(1):951. doi: 10.1186/s12864-018-5205-9 EDN: HZOERI
  9. Grigoreva E, Barbitoff Y, Changalidi A, et al. Development of SNP set for the marker-assisted selection of guar (Cyamopsis tetragonoloba (L.) Taub.) based on a custom reference genome assembly. Plants. 2021;10(10):2063. doi: 10.3390/plants10102063 EDN: OZJRYH
  10. Arkhimandritova S, Shavarda A, Potokina E. Key metabolites associated with the onset of flowering of guar genotypes (Cyamopsis tetragonoloba (L.) Taub). BMC Plant Biology. 2020;20(Suppl 1):291. doi: 10.1186/s12870-020-02498-x EDN: VMAHIS
  11. Grigoreva E, Tkachenko A, Arkhimandritova S, et al. Identification of key metabolic pathways and biomarkers underlying flowering time of guar (Cyamopsis tetragonoloba (L.) Taub.) via integrated transcriptome-metabolome analysis. Genes. 2021;12(7):952. doi: 10.3390/genes12070952 EDN: XBDJOP
  12. Kaila T, Chaduvla PK, Rawal HC, et al. Chloroplast genome sequence of clusterbean (Cyamopsis tetragonoloba L.): genome structure and comparative analysis. Genes. 2017;8(9):212. doi: 10.3390/genes8090212
  13. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2 EDN: SGUJER
  14. Li JH, Li MJ, Li WL, et al. Leguminous industrial crop guar (Cyamopsis tetragonoloba): the chromosome-level reference genome de novo assembly. Industrial Crops and Products. 2024;216:118748. doi: 10.1016/j.indcrop.2024.118748 EDN: AIAYXK
  15. Brůna T, Hoff KJ, Lomsadze A, et al. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3(1):lqaa108. doi: 10.1093/nargab/lqaa108 EDN: UMWJZR
  16. Brůna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020;2(2):lqaa026. doi: 10.1093/nargab/lqaa026 EDN: AOAAGH
  17. Stanke M, Steinkamp R, Waack S, et al. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32(Web Server issue): W309–W312. doi: 10.1093/nar/gkh379 EDN: IUCVSZ
  18. The Gene Ontology Consortium, Aleksander SA, Balhoff J, et al. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224(1):iyad031. doi: 10.1093/genetics/iyad031 EDN: OIDRSY
  19. Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556 EDN: SPYGDX
  20. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27 EDN: IUQVVD
  21. Cantalapiedra CP, Hernández-Plaza A, Letunic I, et al. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021;38(12):5825–5829. doi: 10.1093/molbev/msab293 EDN: PFFLTN
  22. Bolger M, Schwacke R, Usadel B. MapMan visualization of RNA-Seq data using Mercator4 functional annotations. Methods in Molecular Biology. 2021;2354:195–212. doi: 10.1007/978-1-0716-1609-3_9 EDN: DMLUEF
  23. Schwacke R, Ponce-Soto GY, Krause K, et al. MapMan4: a refined protein classification and annotation framework applicable to multi-omics data analysis. Mol Plant. 2019;12(6):879–892. doi: 10.1016/j.molp.2019.01.003
  24. Manni M, Berkeley MR, Seppey M, et al. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol. 2021;38(10):4647–4654. doi: 10.1093/molbev/msab199 EDN: LULTEI
  25. github.com [Internet]. SRA-toolkit. Available from: https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit Accessed: October 02, 2025.
  26. sourceforge.net [Internet]. BBMap. Available from: https://sourceforge.net/projects/bbmap/ Accessed: October 02, 2025.
  27. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. doi: 10.1093/bioinformatics/bts635
  28. Liao Y, Smyth GK, Shi W. Featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–930. doi: 10.1093/bioinformatics/btt656 EDN: YCINGW
  29. flask.palletsprojects.com [Internet]. FLASK. Available from: https://flask.palletsprojects.com/en/stable/. Accessed: October 02, 2025.
  30. Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–192. doi: 10.1093/bib/bbs017
  31. Adrian A, Rahnenfuhrer J. topGO. Bioconductor: Buffalo, NY, USA; 2017.
  32. Boneva S, Schlecht A, Böhringer D, et al. 3' MACE RNA-sequencing allows for transcriptome profiling in human tissue samples after long-term storage. Lab Invest. 2020;100(10):1345–1355. doi: 10.1038/s41374-020-0446-z EDN: RJUKRN

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Fig. 1. Distribution of annotated genes using Mercator4 by functional groups (a), distribution of galactomannan biosynthesis genes by functional groups (b).

Download (638KB)
3. Fig. 2. Principal Component Analysis (PCA) plot of the RNA-seq samples included in the expression atlas, colored by BioProject (a) and tissue (b).

Download (898KB)
4. Fig. 3. A schematic representation of the key modules and data flow within the developed guar genomics web service.

Download (496KB)
5. Fig. 4. The start page of the resource and fields for searching for genes via identifiers or sequence (a). Homology BLAST search results page (b). A section with information about the functional annotation of a gene (c).

Download (892KB)
6. Fig. 5. Representation of several genes via the IGV genomic browser on a developed genomic resource (a). Section with the expression of a specific gene for all samples in the form of: b, barplot or c, boxplots.

Download (1MB)
7. Fig. 6. An example of what the result of the Gene Ontology enrichment analysis tool looks like.

Download (1MB)
8. Fig. 7. Demonstration of a “Heatmap Generator” tool for creating heatmaps based on expression data and gene identifiers as a query.

Download (739KB)

Copyright (c) 2025 Eco-Vector

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

СМИ зарегистрировано Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор).
Регистрационный номер и дата принятия решения о регистрации СМИ: серия ПИ № ФС 77 - 89324 от 21.04.2025.