1

First I deleted NA values using the following Python code:

import pandas as pd

a = pd.read_csv("true.csv",low_memory=False)
#print a
b = pd.read_csv("false.csv",low_memory=False)


merged = a.append(b, ignore_index=False)
merged=merged.dropna(axis=1)
merged.to_csv("out.csv", index=False)

After that I used Rattle and found that 2 columns are categoric, and I only wanted numeric data. So I deleted those columns using the following code:

cat("\nSTART\n")
startTime = proc.time()[3]
startTime

#--------------------------------------------------------------
# Step 1: Include Library
#--------------------------------------------------------------
cat("\nStep 1: Library Inclusion")
library(randomForest)
library(FSelector)

#--------------------------------------------------------------
# Step 2: Variable Declaration
#--------------------------------------------------------------
cat("\nStep 2: Variable Declaration")
modelName <- "randomForest"
modelName

InputDataFileName="out.csv"
InputDataFileName

training = 70      # Defining Training Percentage; Testing = 100 - Training

#--------------------------------------------------------------
# Step 3: Data Loading
#--------------------------------------------------------------
cat("\nStep 3: Data Loading")
dataset <- read.csv(InputDataFileName)      # Read the datafile
dataset <- dataset[sample(nrow(dataset)),]  # Shuffle the data row wise.

#result <- cfs(Features ~ ., dataset)

head(dataset)   # Show Top 6 records
nrow(dataset)   # Show number of records
names(dataset)  # Show fields names or columns names

#--------------------------------------------------------------
# Step 4: Count total number of observations/rows.
#--------------------------------------------------------------
cat("\nStep 4: Counting dataset")
totalDataset <- nrow(dataset)
totalDataset

nums <- sapply(dataset, is.numeric)
dataset<-dataset[ ,nums]

#--------------------------------------------------------------
# Step 5: Choose Target variable
#--------------------------------------------------------------
cat("\nStep 5: Choose Target Variable")
target  <- names(dataset)[1]   # i.e. RMSD
target

#data(dataset)

result <- cfs(Activity ~ ., dataset)

In the above code, I have used the last line for feature selection using FSelector.

I am getting the following error after executing last line:

Error in if (sd(vec1) == 0 || sd(vec2) == 0) return(0) :
missing value where TRUE/FALSE needed

out.csv https://drive.google.com/open?id=0B3UWvP6zFBQnN3JiamloOWl3T28

  • You need to include data before and after the python cleaning so that it is possible to give you a meaningful answer! – sconfluentus Jul 27 '17 at 19:31
  • no actually i dont want any na values in my dataset so first i deleted them from python.After that i found that their are some columns which does not numeric values(which are required when you are using RandomForest) so i deleted them.But then also i am given above written ERROR – Mohit Khandelwal Jul 27 '17 at 19:39
  • I am saying you need to show us data, not just code, if you want help with the error. Likely the problem is that you are serving your function data it cannot digest. – sconfluentus Jul 27 '17 at 19:41
  • https://drive.google.com/open?id=0B3UWvP6zFBQnN3JiamloOWl3T28 this is my out.csv – Mohit Khandelwal Jul 27 '17 at 19:52
  • Convert the target to a factor: `dataset$Activity = factor(dataset$Activity)` – Lars Kotthoff Jul 27 '17 at 22:01
  • @Lars i did this but after executing this line(result <- cfs(Activity ~ ., dataset)) i didnt get any output and this is executing for indefinite time. – Mohit Khandelwal Jul 28 '17 at 03:42
  • Well you do have a very large data set, so it'll take some time. – Lars Kotthoff Jul 28 '17 at 03:46
  • @Lars Its been 3 hours since its running but their is no output – Mohit Khandelwal Jul 28 '17 at 06:17
  • It looked like a categorical variable to me, so I converted it. In principle it should work as a continuous variable as well, not sure why it didn't here. – Lars Kotthoff Jul 28 '17 at 17:31

1 Answers1

1

Before last line

(result <- cfs(Activity ~ ., dataset)) 

use

dataset$Activity = factor(dataset$Activity)

It will take some time to execute because we have a very large dataset.