Let's talk about prompt injection detection, and how with relative ease, we can improve it significantly.
Most detection systems are trained on specific public datasets. We asked: Why shouldn’t we train a model based on all available datasets? A straight forward approach, but it yielded significant results.
Our new model, enhancing Microsoft’s DeBERTaV3, boasted approximately 99% accuracy, compared to 90% we observed with other models, a significant reduction in false positives. In our evaluation, the model achieved a false positive rate of 0.002 which means an average of one false positive for every 500 detections, surpassing a well-accepted public model rate of 0.017 which stands for a false positive every 60 detections.
Challenges in Dataset Integration: Overcoming Data Engineering Hurdles
We initially assessed two leading pre-trained models: protectai/deberta-v3-base-prompt-injection and deepset/deberta-v3-base-injection, which boast an accuracy of over 99% on the HuggingFace platform. We then evaluated their performance on 17 HuggingFace and GitHub datasets, including deepset/prompt-injections, JasperLS/prompt-injections, Harelix/Prompt-Injection-Mixed-Techniques-2024, imoxto/prompt_injection_cleaned_dataset-v2, and others, to gauge their real-world efficacy.
During testing, we couldn't replicate the model performance to the published HuggingFace metrics, when tested for our needs. The accuracy achieved while running the 'protectai/deberta-v3-base-prompt-injection' model was 90%, compared to the reported 99.99%, and 84% for the 'deepset/deberta-v3-base-injection' model, compared to the reported 99.14%.
Given the critical role of dataset diversity in fine-tuning optimization, we consolidated the 17 HuggingFace and GitHub models which contain prompt injection and jailbreak data into a single, comprehensive dataset, yielding superior accuracy in detection.
Challenges in Dataset Integration: Overcoming Data Engineering Hurdles
For regular updates and insights from Knostic research, follow us on Linkedin.