6

Using Keras, I want to build an LSTM neural net to analyze user behavior in my system. One of my features is a string containing the user IP address, that could be IPv4 or IPv6.

As I see it I need to embed the address so it can be use as a feature. In Keras documentation there is no clear explanation how to do such a thing.

What would be a good place to start?

Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
  • What sort of feature are you looking for? If it's just an opaque, unique identifier, that should be easy. If you need to figure out, or keep track of, which IP addresses belong to the same owner or AS network, for example, that's somewhat more challenging. – tripleee Jan 15 '18 at 09:00
  • Don't treat them as strings. IPv4 is a 32-bit number; IPv6 is a 128-bit number, with the prefix-bits having individual meanings (hierarchically selecting subnets). To match Internet topology, I'd encode them as 128 binary inputs. This should be powerful enough not just to capture "these two rows are from the same IP", but also "these two rows come from the same subnet". – Amadan Jan 15 '18 at 09:05
  • Thank you for the comments, Basically I would like to see if fraudulent traffic comes from specific IPs or uses more IPs then normal traffic or any other outlier behavior. Decoding as 128 bits sounds good, can you point how to get started with that? – Shlomi Schwartz Jan 15 '18 at 09:12
  • 1
    `[int(b) for b in format(int(ipaddress.IPv6Address('2001:4860:4860::8888')), '0128b')]` – Amadan Jan 15 '18 at 09:13
  • Based on this study, https://hammer.purdue.edu/articles/thesis/Encoding_IP_Address_as_a_Feature_for_Network_Intrusion_Detection/11307287, splitting the IP address into four numbers. – keramat Sep 29 '21 at 12:04

3 Answers3

13

The optimal way to encode IP addresses in your model depends on their semantics with respect to your problem. There are several options:

One-hot encoding

This way assumes no relationship between IP addresses at all. 1.2.3.4 is assumed to be as different from 1.2.3.5 as 255.255.255.255. To prevent having 2^32 features, you only encode the IP addresses in your training data as features and treat new IP's as unkown. One way to achieve this is sklearn's LabelBinarizer:

train_data = ['127.0.0.1', '8.8.8.8', '231.58.91.112', '127.0.0.1']
test_data = ['8.8.8.8', '0.0.0.0']

ip_encoder = LabelBinarizer()
print('Train Inputs:\n', ip_encoder.fit_transform(train_data))
print('Test Inputs:\n', ip_encoder.transform(test_data))

This prints:

Train Inputs:
 [[1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]
Test Inputs:
 [[0 0 1]
 [0 0 0]]

Note the difference between One-hot encoding and dummy encoding.

Using 32 or 128 features

Here, you use one feature per bit in the IP.

Advantages:

  1. The model can more easily identify IP's that belong to the same subnet.
  2. The number of features remains small even for a large number of distinct IP addresses in your training data.

Disadvantages:

  1. The model doesn't know how subnets work. If your training data actually justifies generalizing multiple IP's to their subnet, there is a high probability that the model won't apply the subnet mechanism 100% correctly. What I mean is that it might learn to use the 2nd and 3rd part of 1.1.1.1 and 1.1.1.2 to detect this specific subnet and thus treat 0.1.1.1 as an IP of this subnet as well.
  2. Reducing the number of features is great but it also makes it harder for the model to detect whether two IP addresses are the same. When using One-Hot-Encoding it has this information directly in the features, while with this approach it would need to learn 32 / 128 'if' statements internally to see whether an IP address is the same. But a neural network is unlikely to learn this completely if fewer 'if' statements suffice to discriminate correctly. This is analogous to the treatment of subnets. For example, if '1.2.3.4' is a very discriminative IP in your training data, i.e. this IP makes a specific outcome very likely, the model will probably learn to detect this IP based on a specific subset of its bits. Thus, different IPs with the same value for these specific bits will be treated similarly by the model.

Overall, this approach needs to be treated carefully.

One-hot encoding frequent IPs

If the number of distinct IPs is too high to create a new feature for each IP, you can check if each IP is actually important enough to be incorporated into the model. For example, you might check the histogram of IPs. IPs that only have a few samples in the training data might be worth ignoring. With only a few samples, the model is likely to either overfit on these IPs or ignore them completely. So, you could one-hot-encode the top 1000 frequent IPs in your training data and add one feature for all other IPs. Similarly, you could try to do some data preprocessing and cluster the IPs based on their location etc.

Using numerical inputs

It might be tempting to use a single int32 feature or four int8 features for an IPv4. This is a bad idea as it allows the model to do arithmetics on IPs, such as 1.1.1.1 + 2.2.2.2 = 3.3.3.3.

Word Embeddings

This is the way that you linked to in the question (https://keras.io/layers/embeddings/). These embeddings are intended for Word Embeddings and should be trained on sentences / text. They generally shouldn't be used for encoding IPs.

Kilian Obermeier
  • 6,678
  • 4
  • 38
  • 50
  • Good analysis of pros and cons of possible approaches. I had much the same thoughts, but self-censored thinking one-hot might generate too many inputs... – Amadan Jan 15 '18 at 10:00
  • Excellent answer, I think I might go with top 1000 approach, and see if it holds. My Data contains millions of records. – Shlomi Schwartz Jan 15 '18 at 10:13
0

You can use python's ipaddress library.

By using the library, you can convert IP addresses into integer values using:

>>> str(ipaddress.IPv4Address('192.168.0.1'))
'192.168.0.1'
>>> int(ipaddress.IPv4Address('192.168.0.1'))
3232235521
lasmeninas
  • 11
  • 2
0

None of these answers seemed right to me. I thought what you should do is find a sentence encoder a la Sentence-Transformers (Github) or that maps not just the address into an n-dimensional space but that understands the semantics of whether an IP is public or private, a netmask or not, the public routing table, etc. I don’t know of such an embedding but you could train one. I looked into it and it doesn’t exist.

So to echo and extend a previous answer, you should probably roll your own IP encoding that stores more than the address by using the built in Python 3 ipaddress module’s IPV4Address class. Look at the properties of that class, and encode them into a fixed length vector. That is your hand built semantic embedding for an IP address.

FWIW, taking my own advice right now.

rjurney
  • 4,824
  • 5
  • 41
  • 62