1

I am new to pyspark and I am trying to select the best features using chisqselector. I have a dataset of 78 features. The steps I did are the following 1. Dropped nan, applied imputer 2. Converted the string label column to int using stringindexer. 3. Applied Vector Assembler 4. Vector inderxer 5. Standard Scaler 6. Applied Chisqselector, produced error. As mentioned in the post (SparkException: Chi-square test expect factors) I applied vector indexer, still its not working. What are the data preparation steps I should do for chisqselector. Thanks in Advance.

I am using a security dataset CICIDS2017 with 78 features and label is a string.

CODE

````
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("exp").getOrCreate()
raw_data = spark.read.csv("SCX.csv", inferSchema = True, header = 
 True)


raw_data.na.drop().show()
cols=raw_data.columns
cols.remove("Label")
from pyspark.ml.feature import Imputer
imputer=Imputer(inputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
'TotalLengthofBwdPackets'
],outputCols=['Destination Port',
'FlowDuration',
'TotalFwdPackets',
'TotalBackwardPackets',
'TotalLengthofFwdPackets',
 ])
model=imputer.fit(raw_data)
raw_data1=model.transform(raw_data)
raw_data1.show(5)

#RAW DATA2  => After doing String indexer on label column
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Label', outputCol='_LabelIndexed')
raw_data2 = indexer.fit(raw_data1).transform(raw_data1)

#RAW DATA3 => After applying vector assembler
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=cols,outputCol="features")
# Now let us use the transform method to transform our dataset
raw_data3=assembler.transform(raw_data2)
raw_data3.select("features").show(truncate=False)

#RAW DATA 4 => After applying Vector Indexer
from pyspark.ml.feature import VectorIndexer
vindexer = VectorIndexer(inputCol="features", outputCol="vindexed", 
maxCategories=9999)
vindexerModel = vindexer.fit(raw_data3)

categoricalFeatures = vindexerModel.categoryMaps
print("Chose %d categorical features: %s" %      
(len(categoricalFeatures), ", ".join(str(k) for k in 
categoricalFeatures.keys())))

# Create new column "indexed" with categorical values transformed to 
indices
raw_data4 = vindexerModel.transform(raw_data3)
raw_data4.show()

#RAW DATA 5 => After applying Standard Scaler
from pyspark.ml.feature import StandardScaler 
standardscaler=StandardScaler().setInputCol("vindexed").setOutputCol
("Scaled_features")
raw_data5=standardscaler.fit(raw_data4).transform(raw_data4)

 train, test = raw_data5.randomSplit([0.8, 0.2], seed=456)

 # Feature selection using chisquareSelector
 from pyspark.ml.feature import ChiSqSelector
 chi = ChiSqSelector(featuresCol='Scaled_features',
 outputCol='Selected_f',labelCol='_LabelIndexed',fpr=0.05)
 train=chi.fit(train).transform(train)
 #test=chi.fit(test).transform(test)
 #test.select("Aspect").show(5,truncate=False)
````

But this code returns error message while fiting Py4JJavaError: An error occurred while calling o568.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 32.0 failed 1 times, most recent failure: Lost task 2.0 in stage 32.0 (TID 69, 192.168.1.15, executor driver): org.apache.spark.SparkException: *****Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 14.*****

  ````
    raw_data.printSchema()
    ````
   |-- Destination Port: integer (nullable = true)
   |-- FlowDuration: integer (nullable = true)
   |-- TotalFwdPackets: integer (nullable = true)
   |-- TotalBackwardPackets: integer (nullable = true)
   |-- TotalLengthofFwdPackets: integer (nullable = true)
   |-- TotalLengthofBwdPackets: integer (nullable = true)
   |-- FwdPacketLengthMax: integer (nullable = true)
   |-- FwdPacketLengthMin: integer (nullable = true)
   |-- FwdPacketLengthMean: double (nullable = true)
   |-- FwdPacketLengthStd: double (nullable = true)
   |-- BwdPacketLengthMax: integer (nullable = true)
   |-- BwdPacketLengthMin: integer (nullable = true)
   |-- BwdPacketLengthMean: double (nullable = true)
   |-- BwdPacketLengthStd: double (nullable = true)
   |-- FlowBytesPersec: double (nullable = true)
   |-- FlowPacketsPersec: double (nullable = true)
   |-- FlowIATMean: double (nullable = true)
   |-- FlowIATStd: double (nullable = true)
   |-- FlowIATMax: integer (nullable = true)
   |-- FlowIATMin: integer (nullable = true)
   |-- FwdIATTotal: integer (nullable = true)
   |-- FwdIATMean: double (nullable = true)
   |-- FwdIATStd: double (nullable = true)
   |-- FwdIATMax: integer (nullable = true)
   |-- FwdIATMin: integer (nullable = true)
   |-- BwdIATTotal: integer (nullable = true)
   |-- BwdIATMean: double (nullable = true)
   |-- BwdIATStd: double (nullable = true)
   |-- BwdIATMax: integer (nullable = true)
   |-- BwdIATMin: integer (nullable = true)
   |-- FwdPSHFlags: integer (nullable = true)
   |-- BwdPSHFlags: integer (nullable = true)
   |-- FwdURGFlags: integer (nullable = true)
   |-- BwdURGFlags: integer (nullable = true)
   |-- FwdHeaderLength_1: integer (nullable = true)
   |-- BwdHeaderLength: integer (nullable = true)
   |-- FwdPackets/s: double (nullable = true)
   |-- BwdPackets/s: double (nullable = true)
   |-- MinPacketLength: integer (nullable = true)
   |-- MaxPacketLength: integer (nullable = true)
   |-- PacketLengthMean: double (nullable = true)
   |-- PacketLengthStd: double (nullable = true)
   |-- PacketLengthVariance: double (nullable = true)
   |-- FINFlagCount: integer (nullable = true)
   |-- SYNFlagCount: integer (nullable = true)
   |-- RSTFlagCount: integer (nullable = true)
   |-- PSHFlagCount: integer (nullable = true)
   |-- ACKFlagCount: integer (nullable = true)
   |-- URGFlagCount: integer (nullable = true)
   |-- CWEFlagCount: integer (nullable = true)
   |-- ECEFlagCount: integer (nullable = true)
   |-- Down/UpRatio: integer (nullable = true)
   |-- AveragePacketSize: double (nullable = true)
   |-- AvgFwdSegmentSize: double (nullable = true)
   |-- AvgBwdSegmentSize: double (nullable = true)
   |-- FwdHeaderLength_2: integer (nullable = true)
   |-- FwdAvgBytes/Bulk: integer (nullable = true)
   |-- FwdAvgPackets/Bulk: integer (nullable = true)
   |-- FwdAvgBulkRate: integer (nullable = true)
   |-- BwdAvgBytes/Bulk: integer (nullable = true)
   |-- BwdAvgPackets/Bulk: integer (nullable = true)
   |-- BwdAvgBulkRate: integer (nullable = true)
   |-- SubflowFwdPackets: integer (nullable = true)
   |-- SubflowFwdBytes: integer (nullable = true)
   |-- SubflowBwdPackets: integer (nullable = true)
   |-- SubflowBwdBytes: integer (nullable = true)
   |-- Init_Win_bytes_forward: integer (nullable = true)
   |-- Init_Win_bytes_backward: integer (nullable = true)
   |-- act_data_pkt_fwd: integer (nullable = true)
   |-- min_seg_size_forward: integer (nullable = true)
   |-- ActiveMean: double (nullable = true)
   |-- ActiveStd: double (nullable = true)
   |-- ActiveMax: integer (nullable = true)
   |-- ActiveMin: integer (nullable = true)
   |-- IdleMean: double (nullable = true)
   |-- IdleStd: double (nullable = true)
   |-- IdleMax: integer (nullable = true)
   |-- IdleMin: integer (nullable = true)
   |-- Label: string (nullable = true)


Dataset Reference - Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization”, 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, January 2018


For your information, I was using all the features, and inorder to reduce the code content here I am showing only few 6 columns while imputing. 

0 Answers0