Defining a function that automates production of k-means cluster diagram taking 3 arguments

Question

I have the various lines of code to produce a k-means cluster diagram. Rather than repeat the code for the various different data sets, I wanted to create a function that automates this.

I envisaged it working by having 3 arguments - x, y, and z.

Below is what I have got so far. I'd really welcome any assistance.

I am using Python 3 in Jupyter Notebook and Pandas, Matplotlib, sklearn packages.

x = chosen correlation (moving average data set - plotted on x-axis)

y = chosen index change (y axis data set)

z = corresponding subset (various dataframes which hold the different x & y combinations)

def make_cluster(x,y,z):

model = KMeans(n_clusters = 6)
model.fit(scale(z))
z.plot.scatter(x=x, y=y)

plt.xlabel('Correlation')
plt.ylabel('Daily Return')
plt.grid()
plt.title(str(x) + "Day /" + str(y) + "Daily Performance")
plt.show()

groups = z.groupby('cluster')
fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', label=name)

Examples of x, y and z variables as follows:

# Z Example
UK30 = Raw[['Cor30', 'FTSE100change']]

# X Example
Cor30 = str('Cor30')

# Y Example
FTSE100change = str('FTSE100change')

I am trying to get to the position where I can run the function "make_cluster(x, y, z)" and when it is run, it returns the clustering diagram for the relevant arguments.

Whatever is inputted as the arguments, I wanted to be reflected in the code where the corresponding "x", "y" and "z" appear.

Hopefully this makes sense!

Hi, this is my first post and I am struggling to understand how I can reduce the scope of the question (it has been put on hold for being "too broad"). If somebody could offer some assistance as to how avoid it being put on hold I can learn from this and repost accordingly. — Paul Allen, Jan 13 '19 at 05:53
I have just added additional tags (X3) so perhaps it was the fact I had only tagged "python-3.x" that was causing it to be considered too broad? — Paul Allen, Jan 13 '19 at 05:55
What is the problem? Are you *sure* that `scale` is appropriate on such data? Why do you need x, y, and z? Shouldn't it be `Cor30=Raw["Co30"]` in your example? Wouldn't it be better to make the method only plot - and then call `plotkmeans(x y, kmeans_result)`? — Has QUIT--Anony-Mousse, Jan 13 '19 at 09:18
@anony-mousse, the thought process behind having the X, y, z arguments was to make everything more efficient. I need to run the clustering code multiple times for various datasets & rather than copy/paste, then go into the code & manually change X, y, z, I wanted to be able do something like make_cluster(cor30, ftse100change, uk30). This would produce 1 cluster diagram. I then might run make_cluster(cor365, ftse100change, uk365) and so on. Would just save a lot of time and code. I'll look in to the need for the scale code. Thanks for your response! — Paul Allen, Jan 14 '19 at 04:26
if you look in the code block that defines the function, the z.plot.scatter line makes reference to x='x', y='y'. In the example that I have given, these should be replaced with 'Cor30' (being x) and 'FTSE100change' (being y). — Paul Allen, Jan 14 '19 at 04:42
You may need to car n_clusters at some point, that is why I would use separate functions for clustering and for plotting. And you seem to not know when to use a string, and when to use a variable. Review the difference between `x` and `'x'`. But as is, you need to explain what the problem is - what is not working? What is the error? — Has QUIT--Anony-Mousse, Jan 14 '19 at 07:54
@anony-mousse, I have taken on board your comments about variables/strings and have updated the code above. The function now works to the extent that it produces the scatter chart with arguments x & y. The code falls down from the point of `groups = z.groupby('cluster')`. I get `KeyError: 'cluster'`. — Paul Allen, Jan 14 '19 at 09:32
I think I need to add the column 'cluster' to the respective 'z' data frames with the following `z['cluster'] - model.labels_` . When I add this below the `z.plot.scatter(x=x, y=y)` line and rerun, I get `AttributeError: 'DataFrame' object has no attribute 'x' — Paul Allen, Jan 14 '19 at 10:02
When I check the respective dataframe with `z.head()` I can see that there are now three columns in total - "x", "y" and "cluster". — Paul Allen, Jan 14 '19 at 10:18

Defining a function that automates production of k-means cluster diagram taking 3 arguments

0 Answers0