About Our Platform

The Plant Genomics Chatbot Hub is an AI-powered suite of tools designed to simplify access to specialized plant genomics data. Using natural language, researchers and students can now explore complex datasets related to tRFs, fusion transcripts, and other molecular classes across numerous plant species without writing a single line of code.

Technological Framework

Large Language Models (LLMs)

Our system uses a dual-LLM strategy via the Groq™ inference engine for speed and accuracy. A lightweight model (Llama-3.1-8B-Instant) classifies user intent, while a more powerful model (GPT-OSS-120B or Llama-3.3-70B-Versatile) handles complex reasoning, SQL generation, and summarization.

Prompt Engineering

We use advanced prompt engineering with LangChain's `PromptTemplate` framework. Prompts are dynamically injected with database schemas, knowledge graphs, and conversation history to guide the LLM, ensuring outputs are contextually aware, syntactically valid, and semantically coherent.

Retrieval-Augmented Generation (RAG)

To answer questions beyond the scope of our SQL databases, we employ a RAG architecture. User queries are matched against a ChromaDB vector store containing embedded research articles, providing the LLM with relevant context to formulate accurate, fact-based answers.

How It Works: A Multi-Stage Pipeline

Every query is processed through an orchestrated workflow to ensure accuracy and relevance, transforming your question into a clear, data-backed answer.

1. Intent Classification

First, your query is classified to determine its intent (e.g., data retrieval, metadata question, or general conversation). This allows the system to choose the most efficient path—either querying the database, consulting our knowledge base via RAG, or providing a direct conversational reply.

2. SQL Query Planning

For data-related questions, the powerful LLM, guided by the database schema and a curated knowledge graph, translates your natural language query into one or more precise, executable SQL statements. This plan ensures the correct tables and columns are queried.

3. Data Retrieval & Processing

The generated SQL queries are safely executed on read-only copies of our SQLite databases. The raw data is then processed using pandas to compute statistical summaries and prepare it for interpretation, with large datasets being intelligently sampled for efficiency.

4. Summary Generation

Finally, the LLM transforms the structured data and statistical reports into a coherent, easy-to-understand conversational summary. The final response, along with a link to download the full dataset as a CSV, is delivered to you in a clean JSON format.

Integrated Databases

Our chatbots are connected to a comprehensive suite of specialized, in-house plant genomics databases developed by our research team.

Database Name	Description
PbtRFdb	Documents biotic stress-responsive tRNA-derived fragments (tRFs).
PtRFdb	Provides tRF sequences from multiple species with expression and functional annotations.
PtncRNAdb	Contains plant tRNA-derived fragments with expression profiles and functional annotations.
AtFusionDB	Catalogs Arabidopsis thaliana fusion transcripts with detailed metadata.
PFusionDB	A repository of fusion transcripts from multiple plant species.
PlantPepDB	Houses plant-derived antimicrobial peptides (AMPs) and their properties.
Athisomir	Contains Arabidopsis isomiR profiles.
Cotton ncRNA Atlas	Contains data on cotton lncRNAs and miRNAs.
AlnC	A collection of long intergenic non-coding RNAs (lincRNAs).
ANNinter	Comprises annotated RNA-RNA interactions.

Chatbot Architecture

Single vs. Multi-Database Bots

To provide both focused and integrative analyses, our bots are organized into two classes:

Single-Database Bots: Tightly coupled to the schema of one resource (e.g., the AtFusionDB Bot). This ensures high precision for exploring a single dataset.
Multi-Database Bots: Designed to query across multiple databases with a shared molecular theme (e.g., the tRF Bot). This enables powerful comparative analyses.

Figure: Chatbots are categorized as either single-database specialists or multi-database integrators to support different research queries.

Computational Resources

The entire backend framework is deployed on a dedicated server running Ubuntu 22.04 LTS. The system is equipped with multi-core CPUs and sufficient RAM to handle concurrent user requests, complex data processing with pandas, and efficient in-memory caching. This robust infrastructure ensures a responsive and scalable experience for the scientific community.