Home | PlantXBot

About Our Platform

The Plant Genomics Chatbot Hub is an AI-powered suite of tools designed to simplify access to specialized plant genomics data. Using natural language, researchers and students can now explore complex datasets related to tRFs, fusion transcripts, and other molecular classes across numerous plant species without writing a single line of code.

Technological Framework

Large Language Models (LLMs)

Our system uses a dual-LLM strategy via the Groq™ inference engine for speed and accuracy. A lightweight model (Llama-3.1-8B-Instant) classifies user intent, while a more powerful model (GPT-OSS-120B or Llama-3.3-70B-Versatile) handles complex reasoning, SQL generation, and summarization.

Prompt Engineering

We use advanced prompt engineering with LangChain's `PromptTemplate` framework. Prompts are dynamically injected with database schemas, knowledge graphs, and conversation history to guide the LLM, ensuring outputs are contextually aware, syntactically valid, and semantically coherent.

Retrieval-Augmented Generation (RAG)

To answer questions beyond the scope of our SQL databases, we employ a RAG architecture. User queries are matched against a ChromaDB vector store containing embedded research articles, providing the LLM with relevant context to formulate accurate, fact-based answers.

How It Works: A Multi-Stage Pipeline

Every query is processed through an orchestrated workflow to ensure accuracy and relevance, transforming your question into a clear, data-backed answer.

1. Intent Classification

First, your query is classified to determine its intent (e.g., data retrieval, metadata question, or general conversation). This allows the system to choose the most efficient path—either querying the database, consulting our knowledge base via RAG, or providing a direct conversational reply.

2. SQL Query Planning

For data-related questions, the powerful LLM, guided by the database schema and a curated knowledge graph, translates your natural language query into one or more precise, executable SQL statements. This plan ensures the correct tables and columns are queried.

3. Data Retrieval & Processing

The generated SQL queries are safely executed on read-only copies of our SQLite databases. The raw data is then processed using pandas to compute statistical summaries and prepare it for interpretation, with large datasets being intelligently sampled for efficiency.

4. Summary Generation

Finally, the LLM transforms the structured data and statistical reports into a coherent, easy-to-understand conversational summary. The final response, along with a link to download the full dataset as a CSV, is delivered to you in a clean JSON format.

Integrated Databases

Our chatbots are connected to a comprehensive suite of specialized, in-house plant genomics databases developed by our research team.

Database Name Description
PbtRFdbDocuments biotic stress-responsive tRNA-derived fragments (tRFs).
PtRFdbProvides tRF sequences from multiple species with expression and functional annotations.
PtncRNAdbContains plant tRNA-derived fragments with expression profiles and functional annotations.
AtFusionDBCatalogs Arabidopsis thaliana fusion transcripts with detailed metadata.
PFusionDBA repository of fusion transcripts from multiple plant species.
PlantPepDBHouses plant-derived antimicrobial peptides (AMPs) and their properties.
AthisomirContains Arabidopsis isomiR profiles.
Cotton ncRNA AtlasContains data on cotton lncRNAs and miRNAs.
AlnCA collection of long intergenic non-coding RNAs (lincRNAs).
ANNinterComprises annotated RNA-RNA interactions.

Chatbot Architecture

Single vs. Multi-Database Bots

To provide both focused and integrative analyses, our bots are organized into two classes:

  • Single-Database Bots: Tightly coupled to the schema of one resource (e.g., the AtFusionDB Bot). This ensures high precision for exploring a single dataset.
  • Multi-Database Bots: Designed to query across multiple databases with a shared molecular theme (e.g., the tRF Bot). This enables powerful comparative analyses.
Chatbot classification figure

Figure: Chatbots are categorized as either single-database specialists or multi-database integrators to support different research queries.

Computational Resources

The entire backend framework is deployed on a dedicated server running Ubuntu 22.04 LTS. The system is equipped with multi-core CPUs and sufficient RAM to handle concurrent user requests, complex data processing with pandas, and efficient in-memory caching. This robust infrastructure ensures a responsive and scalable experience for the scientific community.