This is a living repo gathering the most important evals, monitoring, test cases for the continuous development and improvement of agents. We'll add monitoring and evaluation tooling and standardized capability test cases (e.g. functional calling, agent communication) to a basic agent application.
Check out the Weave Workspace here!
- Install
requirements_verbose.txtin environment (for Mac Silicon) - Setup
benchmark.envin./configwith necessary API keys (WANDB_API_KEY) and optional (HUGGINGFACEHUB_API_TOKEN,OPENAI_API_KEY,ANTHROPIC_API_KEY) - Set variables accordingly in
general_config.yaml- Set Entity, Project (device for now only CPU)
- Setup = True the first time to run to extract data and generate dataset
- The chat model, embedding model, judge model, prompts, params as you want to!
- Run
main.pywith different configs or runstreamlit run chatbot.pyto track interactions with an already deployed model.
main.py- contains the main application flow - serves as an example for bringing everything togethersetup.py- contains utility functions for the RAG modelRagModel(weave.Model)and the data extraction and dataset generation functionsevaluatie.py- contains theweave.flow.scorer.Scorerclasses to evaluate the correctness, hallucination, and retrieval performance../configs- the configs of the project./configs/benchmark.env- should contain env vars for your W&B account and the model providers you want to use (HuggingFace, OpenAI, Anthropic, Mistral, etc.)./configs/requirements.txt- environment to install necessary dependencies to run RAG./configs/sources_urls.csv- a CSV to contain all the Websites and PDFs that should be considered by RAG./configs/general_config.yaml- the central config file with models, prompts, params
annotate.py- can be run withstreamlit run annotation.pyto annotate existing datasets or fetch datasets based on production function calls to annotate and save as new dataset.chatbot.py- can be run withstreamlit run chatbot.pyto serve the RAG Model from Weave and track questions asked to it