The Plant Genomics Chatbot Hub is an AI-powered suite of tools designed to simplify access to specialized plant genomics data. Using natural language, researchers and students can now explore complex datasets related to tRFs, fusion transcripts, and other molecular classes across numerous plant species without writing a single line of code.
Our system uses a dual-LLM strategy via the Groq™ inference engine for speed and accuracy. A lightweight model (Llama-3.1-8B-Instant) classifies user intent, while a more powerful model (GPT-OSS-120B or Llama-3.3-70B-Versatile) handles complex reasoning, SQL generation, and summarization.
We use advanced prompt engineering with LangChain's `PromptTemplate` framework. Prompts are dynamically injected with database schemas, knowledge graphs, and conversation history to guide the LLM, ensuring outputs are contextually aware, syntactically valid, and semantically coherent.
To answer questions beyond the scope of our SQL databases, we employ a RAG architecture. User queries are matched against a ChromaDB vector store containing embedded research articles, providing the LLM with relevant context to formulate accurate, fact-based answers.
Every query is processed through an orchestrated workflow to ensure accuracy and relevance, transforming your question into a clear, data-backed answer.
First, your query is classified to determine its intent (e.g., data retrieval, metadata question, or general conversation). This allows the system to choose the most efficient path—either querying the database, consulting our knowledge base via RAG, or providing a direct conversational reply.
For data-related questions, the powerful LLM, guided by the database schema and a curated knowledge graph, translates your natural language query into one or more precise, executable SQL statements. This plan ensures the correct tables and columns are queried.
The generated SQL queries are safely executed on read-only copies of our SQLite databases. The raw data is then processed using pandas to compute statistical summaries and prepare it for interpretation, with large datasets being intelligently sampled for efficiency.
Finally, the LLM transforms the structured data and statistical reports into a coherent, easy-to-understand conversational summary. The final response, along with a link to download the full dataset as a CSV, is delivered to you in a clean JSON format.
Our chatbots are connected to a comprehensive suite of specialized, in-house plant genomics databases developed by our research team.
| Database Name | Description |
|---|---|
| PbtRFdb | Documents biotic stress-responsive tRNA-derived fragments (tRFs). |
| PtRFdb | Provides tRF sequences from multiple species with expression and functional annotations. |
| PtncRNAdb | Contains plant tRNA-derived fragments with expression profiles and functional annotations. |
| AtFusionDB | Catalogs Arabidopsis thaliana fusion transcripts with detailed metadata. |
| PFusionDB | A repository of fusion transcripts from multiple plant species. |
| PlantPepDB | Houses plant-derived antimicrobial peptides (AMPs) and their properties. |
| Athisomir | Contains Arabidopsis isomiR profiles. |
| Cotton ncRNA Atlas | Contains data on cotton lncRNAs and miRNAs. |
| AlnC | A collection of long intergenic non-coding RNAs (lincRNAs). |
| ANNinter | Comprises annotated RNA-RNA interactions. |
To provide both focused and integrative analyses, our bots are organized into two classes:
Figure: Chatbots are categorized as either single-database specialists or multi-database integrators to support different research queries.
The entire backend framework is deployed on a dedicated server running Ubuntu 22.04 LTS. The system is equipped with multi-core CPUs and sufficient RAM to handle concurrent user requests, complex data processing with pandas, and efficient in-memory caching. This robust infrastructure ensures a responsive and scalable experience for the scientific community.