Hypothesis: "If we train an LLM on detecting peacock behavior, then we can learn if it can detect this policy violation with at least >70% precision and >50% recall and ultimately, decide if said LLM is effective enough to power a new Edit Check and/or Suggested Edit."
Conclusion
Confirm if the hypothesis was supported or contradicted
The hypothesis was contradicted because LLMs were not able to provide the precision and recall expected. With other AI-based models (i.e. smaller language models) we were close to reach the target precision & recall, but still they were the numbers were below the thresholds.
Briefly describe what was accomplished over the course of the hypothesis work.
- We tested two LLMs (AYA23 and Gemma) for a "peacock detection task". We tried a set of prompts and learning strategies (zero-shot, few-shot).
- Specifically, we tested on 522 examples (261 positive and 261 negative).
- The best result for an LLM, was for AYA23 on a zero-shot approach, obtaining a precision of 0.54 and recall equal to 0.24
- We saw that although LLMs capture a signal (they work better than random), the precision & recall were low compared with simpler models such as smaller Language Models (BERT-like).
- The baseline model, a fine-tuned Bert Model, was tested with the same data, and got a precision 0.72 and recall 0.4. Much better than the LLM approach, but still under the our target.
Major lessons
- LLMs are promising technology, that makes simple to build a AI-based model, just with a simple prompt, the LLM was able to capture a signal on "promotional language" on Wikipedia article.
- However, the precision and recall this type of tasks, is still not enough for creating a reliable product.
- On the other hand, more established technologies, such as BERT, requires more work on training a model, but gives better results.
- For future, when considering creating AI-based models for policy violation detection, these two approaches should be tested.
- LLMs offers a lot of flexibility, and can work on scenarios without much training data, and potentially one LLM could be used to work on several policies/taks.
- On the other hand, smaller models offers higher accuracy and significantly less computation resources, but requires training data and as well might imply a major maintenance efforts because they will be task/policy specific.