Malicious Query Detection using Machine Learning and Lexical Analyzer

This project focuses on enhancing application-layer security by proactively identifying and blocking malicious SQL queries before execution. Leveraging lexical analysis and machine learning, it combines rule-based and data-driven techniques to deliver a lightweight and effective intrusion detection mechanism.

Problem Statement

Traditional web applications are vulnerable to SQL injection attacks due to inadequate query validation. Detecting such malicious queries before execution remains a critical challenge. The goal is to build a model that can distinguish between benign and harmful SQL inputs using lexical and statistical features.

Results

Multiple machine learning models were trained and evaluated, with Support Vector Machines achieving the highest performance. On the test set, SVM achieved 99.37% accuracy, 99.44% precision, and 99.32% F1-score. Logistic Regression followed closely with 99.08% accuracy. Random Forest had a high true positive rate but resulted in 1194 false positives, making it unsuitable for real-time systems. The final model selection was based on minimizing both false positives and computational cost. This system adds a practical, efficient layer of defense for web applications against one of the most common vulnerabilities. The integration of lexical analysis also added explainability, helping understand why a query was flagged.

Methodology

This project was designed to identify and prevent SQL injection attacks before they reach the database. I started by collecting a dataset of 30,873 queries, which included both malicious and benign examples. The queries were tokenized using a lexical analyzer built with Flex to break each input into identifiers, keywords, operators, and structure-level features like nesting and clause frequency. From these tokens, I engineered a range of features, including query length, frequency of dangerous patterns like tautologies (e.g., OR 1=1), and overall keyword weight.

Conclusion

This project demonstrates how traditional compiler techniques can be successfully integrated with machine learning to improve web application security. It adds a layer of intelligent query filtering that is fast, interpretable, and ready for deployment in real-world applications.