DataGPT Documentation

Step-by-step guide to deploy, configure, and optimize your self-hosted AI analytics solution.

1. Quick Start

Clone the latest release from GitHub:

git clone https://github.com/coomia-ai/datagpt.git

Before running, make sure you have:

2. Configuration

Copy .env.example to .env and adjust parameters according to your setup:

###########################################################
#                     DATABASE CONFIG                      #
###########################################################
# Primary metadata database for DataGPT
DB_DRIVER=postgresql
DB_HOST=postgres
DB_USER=postgres
DB_PASS=postgresql123
DB_PORT=5432
DB_NAME=datagpt


###########################################################
#                         CACHE CONFIG                    #
###########################################################
# Cache backend: "memory" (default) or "redis"
CACHE_TYPE=memory

# If you want to use Redis, uncomment these:
# CACHE_TYPE=redis
# REDIS_HOST=redis
# REDIS_PORT=6379
# REDIS_DB=0
# REDIS_PASSWORD=123456    # Optional


###########################################################
#                   DATA SOURCE CONFIG                    #
###########################################################
# Database to search for data and analysis
# Supported: postgresql / mysql / trino / clickhouse / doris
SEARCH_DB_TYPE=postgresql
SEARCH_DB_HOST=postgres
SEARCH_DB_PORT=5432
SEARCH_DB_USER=postgres
SEARCH_DB_PASSWORD=postgresql123
SEARCH_DB_SCHEMA=public
SEARCH_DB_NAME=datagpt

###########################################################
#                   DATA QUALITY EXECUTION ENGINE         #
###########################################################
TRINO_HOST=trino
TRINO_PORT=8080
TRINO_USER=admin
TRINO_CATALOG=postgresql  # The catalog you want to use for data quality checks
TRINO_SCHEMA=public # The schema you want to use for data quality checks


###########################################################
#                 VECTOR DATABASE CONFIG                  #
###########################################################
# Default vector store is pgvector (embedded in PostgreSQL)
VECTOR_DB=pgvector
VECTOR_COLLECTION=schema_info

# To use Qdrant instead, uncomment:
# VECTOR_DB=qdrant
# QDRANT_HOST=qdrant
# QDRANT_PORT=6333

# Minimum interval (in seconds) between two vector reindex operations, in order to detect schema changes, default 86400 (24 hours)
#VECTOR_REINDEX_INTERVAL_SECONDS=86400

###########################################################
#                EMBEDDING MODEL CONFIG                   #
###########################################################
# Embedding model provider:
# - ollama   → local inference (fastest, no token cost)
# - openai   → best quality, recommended for production
EMBEDDING_MODEL_TYPE=ollama
EMBEDDING_MODEL_NAME=nomic-embed-text
EMBEDDING_MODEL_DIMENSIONS=768
OLLAMA_HOST=http://ollama-service:11434

# OpenAI embeddings example:
# EMBEDDING_MODEL_TYPE=openai
# EMBEDDING_MODEL_NAME=text-embedding-3-small
# OPENAI_API_KEY=your_openai_api_key


###########################################################
#                     LARGE LLM CONFIG                    #
###########################################################
# Supported LLM providers:
#   openai     → OpenAIChat
#   claude     → Claude / Anthropic
#   anthropic  → Claude
#   deepseek   → DeepSeek
#   qwen       → DashScope (阿里)
#   kimi       → Kimi
#   doubao     → Doubao (字节)
#   google     → Gemini
#   meta       → LlamaOpenAI
#   xai        → xAI (Grok)

# Recommended selections:
# - China mainland: qwen / kimi / doubao (低延迟)
# - Global users: openai / claude / deepseek (高质量)
# - Local inference: meta (Llama via OpenAI-compatible server)

MODEL_PROVIDER=qwen
MODEL_ID=qwen-flash-2025-07-28
MODEL_API_KEY=your_api_key_here
#MODEL_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1


###########################################################
#                      API SERVER CONFIG                  #
###########################################################
API_PORT=8000
AGENT_URL=http://api:8000/


      

LLM & Embedding Recommendations

Model Use Case Performance Notes
OpenAI (gpt-4o / gpt-4o-mini) General-purpose Fast Recommended for users with ChatGPT account; combine with text-embedding-3-small
Qwen / Kimi Large-scale schema & analytics Fast Recommended for Chinese users
Deepseek Advanced analytics Slower Good for deep querying but slower than others

3. Running DataGPT

Use Docker Compose to bring the system up:

    docker-compose up -d
      

Before running, make sure to update the license.json file in the repository with your own license information.

Access the system:

4. Scaling & Resources

Recommended system specs based on dataset size:

Dataset CPU RAM Usage
<1M rows 2 cores 8 GB Development
1–100M rows 4–8 cores 16–32 GB Production
>100M rows 8+ cores 64+ GB Distributed mode

5. Tips & Best Practices