I have a basic, working neural network implementation in PyBrain
# relevant imports go here
train_input = numpy.loadtxt('train_input.csv', delimiter=',')
test_input = numpy.loadtxt('test_input.csv', delimiter=',')
train_output = numpy.loadtxt('train_output.csv', delimiter=',')
test_output = numpy.loadtxt('test_output.csv', delimiter=',')
train_input = train_input / train_input.max(axis=0)
test_input = test_input / test_input.max(axis=0)
train_output = train_output / train_output.max(axis=0)
test_output = test_output / test_output.max(axis=0)
ds = SupervisedDataSet(2, 1)
for x in range(0, len(train_input) - 1):
ds.addSample(train_input[x], train_output[x])
fnn = buildNetwork( ds.indim, 25, ds.outdim, bias=True)
trainer = BackpropTrainer(fnn, ds, learningrate=0.01, momentum=0.1)
for epoch in range(0, 100000):
if epoch % 10000 == 0:
error = trainer.train()
print 'Epoch: ', epoch
print 'Error: ', error
result = numpy.array([fnn.activate(x) for x in test_input])
I can run this by submitting it to Apache Spark and it works. Without changing the code, however, I assume I gain nothing from Spark.
EDIT
I noticed someone voted to close this so perhaps I'm being too vague. To rephrase my questions
- If I run this code as a spark job, without customising it in any way, will it run just the same as if I ran it as a standard python script
- To rewrite it to be best exploited by Spark should my key focus be on moving datasets from array to Spark RDDs
- The
for
loop which actually trains the network, how would I change that to run in parallel via Spark