I am working on a project to detect anomalies in web users activity in real-time. Any ill intention or malicious activity of the user has to be detected in real-time. Input data is clickstream data of users. Click data contains user-id ( Unique user ID), click URL ( URL of web page), Click text (Text/function in the website on which user has clicked) and Information (Any information typed by user). This project is similar to an Intrusion detection system (IDS). I am using python 3.6 and I have the following queries,
- Which is the best approach to carry out the data preprocessing, Considering all the attributes in the dataset are categorical values.
- Encoding methods like hot encoding or label encoding could be applied but data has to be processed in real-time which makes it difficult to apply
- As per the requirement of the project 3 columns(click URL, Click Text and Typed information) considered as feature columns.
I am really confused about how to approach data preprocessing. Any insight or suggestions would be appreciated