Chapter 2 – Machine Learning Basics for Security

2.1 Core Machine‑Learning Concepts

  • Supervised vs. Unsupervised: Labelled data for anomaly detection vs. clustering of unknown patterns.
  • Feature Engineering: Transform raw logs, network flows, or threat‑intel into numeric vectors.
  • Model Evaluation: Accuracy, precision/recall, ROC‑AUC, and the importance of a realistic validation set.
  • Overfitting & Regularization: Techniques such as dropout, L1/L2 penalties, and cross‑validation.

2.2 Data Pipelines for Security

  • Ingestion: Beats, Logstash, or custom collectors feeding into a central store (Elasticsearch, PostgreSQL, or a vector database).
  • Pre‑processing: Normalization, tokenization, and embedding generation (e.g., Sentence‑Transformers for log text).
  • Storage: Vector databases (FAISS, Milvus) for similarity search; relational DBs for structured telemetry.
  • Serving: REST/GraphQL endpoints or batch jobs that score new data against the trained model.

2.3 Model Types & Use‑Cases

Model Typical Security Use‑Case Example Tool
Logistic Regression Binary threat vs. benign classification scikit‑learn
Random Forest Feature‑rich anomaly detection scikit‑learn
Autoencoder Unsupervised anomaly detection PyTorch / TensorFlow
LLM (e.g., Llama‑2) Log summarization, IOC extraction Hugging Face Transformers
Graph Neural Network Threat‑intel graph analysis PyTorch Geometric

2.4 Open‑Source AI Tools for Security

  • Llama‑2: Large‑language model for natural‑language log analysis and policy generation.
  • Mistral: Lightweight LLM for on‑prem inference.
  • Sentence‑Transformers: Generate dense embeddings for semantic similarity.
  • FAISS: Efficient similarity search for large embedding collections.
  • Stable‑Baselines3: Reinforcement‑learning framework for automated play‑book generation.

This chapter equips readers with the foundational ML knowledge needed to build and evaluate security‑specific models, and introduces the open‑source tools that will be explored in later chapters.