Machine Learning & Artificial Intelligence in Cybersecurity: Hype vs Reality
by Amit Sinha, CTO and Executive Vice President of Engineering and Cloud Operations at Zscaler
In the last few years, we have witnessed a renaissance in machine learning (ML) and artificial intelligence (AI). AI broadly refers to the ability of machines to "think" like humans and perform tasks considered "smart," without explicitly being programmed to do so. ML is a subset of AI. ML algorithms build a mathematical model based on training data, and they leverage the model to make predictions when new data is provided. For example, a computer-vision ML model can be trained with millions of sample images that have been labeled by humans so that it can automatically classify objects in a new image.
AI and ML principles have been around for decades. AI's recent surge in popularity is a direct result of two factors. First, AI/ML algorithms are computationally intensive. The availability of cloud computing has made it feasible to run these algorithms practically. Second, training AI/ML models requires massive amounts of data. The availability of big data platforms and digital data have improved the effectiveness of AI/ML, making them better in many applications than humans.
Cybersecurity is a promising area for AI/ML. In theory, if a machine has access to everything you currently know is bad, and everything you currently know is good, you can train it to find new malware and anomalies when they surface. In practice, there are three fundamental requirements for this to work. First, you need access to data -- lots of it. The more malware and benign samples you have, the better your model will be. Second, you need data scientists and data engineers to be able to build a pipeline to process the samples continuously and design models that will be effective. Third, you need security domain experts to be able to classify what is good and what is bad and be able to provide insights into why that is the case. In my opinion, many companies touting AI/ML-powered security solutions lack one or more of these pillars.
A core principle of security is defense in depth. Defense in depth refers to having multiple layers of security and not relying on just one technology (like AI/ML). There is hype around the ability of new AI/ML-powered security endpoints that claim to “do it all.” But if you want to protect a user from cyber threats, you need to make sure all content the user accesses is scanned, and you have to keep the user’s system patched and up to date. Scanning all files before allowing download requires the ability to intercept SSL-encrypted communications between the user’s client and the destination server. Otherwise, the scanner will be blind to it. Scanning all files takes time and can introduce latency, resulting in user experience issues. As such, quickly blocking the obviously bad stuff and immediately allowing already-white-listed stuff is a good way to balance security with user experience.
Once known threat intelligence has been applied and no verdict is available, we enter the realm of unknown threats, also known as zero-day threats. Zero-day threats don’t have known, recognizable signatures. Sandboxing is used to analyze such unknown threats. Sandboxing involves installing a suspicious file in a virtual machine sandbox that mimics the end user’s computer and then determining if the file is good or bad based on its observed behavior. This process – during which the user’s file is quarantined – can take several minutes. Users love instant gratification, and they hate waiting. A properly-trained AI/ML model can deliver a good or bad verdict for such files in milliseconds. New attacks often use exploit kits, and they may borrow delivery and exfiltration techniques from previous attacks. AI/ML models can be trained to detect these polymorphic variants.
An important consideration when using AI/ML for malware detection is the ability to provide a reasonable explanation as to why a sample was classified as malicious. When a customer asks why a file was blocked, the answer cannot be “because our AI/ML said so.” Having security domain experts who understand what attributes or behaviors got triggered and who are able to analyze false positives/negatives is important — not just for understanding why a certain prediction was made, but to iteratively improve model prediction accuracy.
When it comes to training AI/ML models, a popular debate is whether “supervised” or “unsupervised” learning should be used. Supervised learning is based on labeled data and features extracted to derive a prediction model. For malware, this means human experts classify each sample in the data set as good or bad, and feature-engineering is performed to determine what attributes of the malware are relevant to the prediction model prior to training. Unsupervised learning gleans patterns and determines structure from data that is not labeled or categorized. Unsupervised learning proponents claim that it is not limited by the boundaries of human classification and remains free from feature-selection bias. However, the effectiveness of fully unsupervised learning in security still needs to be proven at scale. With unsupervised models, it can also be hard to explain why something was marked good or bad.
Some classes of security challenges are better suited for AI/ML than others. Phishing detection, for example, has a significant visual component. An adversary will use logos, images and other “look-and-feel” elements to make a fake website look like its legitimate counterpart. Significant advances in AI/ML vision algorithms have resulted in the ability to apply techniques to detect fake websites designed to trick unsuspecting users. AI/ML algorithms can also be used for detecting anomalous user behavior, learning a baseline of what a user normally does and flagging when there is a significant departure from the norm.
When trained properly by experts with data science and cybersecurity expertise, AI/ML can be an important addition to the cybersecurity defense-in-depth arsenal. However, we are still far away from naming AI/ML the panacea for preventing all cyber threats.