Today’s post combines two of my favorite topics – Machine Learning and Cybersecurity.
Geneva, short for Genetic Evasion is a program developed out of the University of Maryland to use a Genetic Algorithm to Evade censorship controls in restrictive or otherwise heavily monitored environments, such as the Great Firewall of China or others.
Researchers at the University of Maryland developed a new tool based on genetic evolution that automatically learned to evade censorship in China, India and Kazakhstan
“Internet censorship by authoritarian governments prohibits free and open access to information for millions of people around the world. Attempts to evade such censorship have turned into a continually escalating race to keep up with ever-changing, increasingly sophisticated internet censorship. Censoring regimes have had the advantage in that race, because researchers must manually search for ways to circumvent censorship, a process that takes considerable time.”
Genetic Algorithms (GAs) are a Machine Learning method that loosely emulate the biological processes of evolution, namely incremental improvement in the traits of a population over time. The building blocks of GAs are Genes and Individuals. Each Gene represents a possible part of a solution and several Genes combine to form an Individual. An Individual then represents a possible complete solution to problem. How well the Individual solves the problem is determined by a Fitness function. To avoid going too deep into please take a look here for more details on GAs: https://en.wikipedia.org/wiki/Genetic_algorithm
To me, the best descriptive example of using GAs is for solving the Knapsack Problem, which you tackle every time you pack for a trip (and CompSci students will be faced with and lose sleep over). The problem is deceptively simple – you’re preparing for a trip and have knapsack (bag) of a certain volume (V) and several items I1,I2,I3…I(n) of different volumes and relative values from which to choose to place in the bag. You can’t take everything, so how do you determine the “best” combination of items (S) to place in the bag for your trip such that that it uses maximizes both the space used in the bag and collective value of the items? This is easy for small numbers of N, but intractable when N is large. You could try random combinations, try each combination in order, or use some heuristic to help guide your decision e.g. I’m going to the beach so I won’t take my winter coat. In any case, when N is large finding the optimal solution can can take a long time. This is known as a (combinatorial) constrained optimization problem and is typically wheres GA shine. Take a look here a working example of using GA in Python to find a solution https://www.dataminingapps.com/2017/03/solving-the-knapsack-problem-with-a-simple-genetic-algorithm/
In the case of Geneva, the “genes” are implemented as IP/UDP/TCP packet headers along with a respective list of actions that can be performed on the packet on ingress or egress from the host. The individual, termed the Strategy in this usage, is the concatenation of those headers, the actions and a payload. The Fitness of an Individual is roughly determined by if it made it past the censor i.e it was not dropped. The team behind the tool explain it much better here: Geneva: Evolving Censorship Evasion Strategies (PAPER)
and in this appearance on the Packet Pushers Heavy Netwoking Podcast Heavy Networking 488: Using Genetic Algorithms To Avoid Internet Censorship
If you’ve read this far you’re probably wondering what the cyber element here is beyond censorship evasion. To me, this technique and tool can be easily modded to perform data exfiltration – or at least be used to find the optimal method by which to exfiltrate data from an environment.
There are already Python libs out there https://github.com/ytisf/PyExfil to facilitate data exfil via a number of methods, namely DNS tunneling, HTTP, BGP Open messages, etc. just to name a few. A typical malicious package may include a few methods to call-home, trying each one until it gets lucky. Theoretically, given the right constraints and a representative data model of each method, a GA could find the optimal configuration of a given method or a combination of several methods – including the best packet size to use, to best exfil the data without raising any alarms. Of course, this is also where tools such as the Cisco NGIPS and Stealthwatch excel – so when I have time in the lab I have to try both sides and report the results in a future post.