0

Recently, I'm suffering to demonstrate a 3D-scatter plot using 2 different dataframes. The idea is to offer a 3D-scatter plot including 2 legends for reporting the results of the clustering algorithms. Let's say we I have main dataframe df1 contains the 3 features as below:

+-----+------------+----------+----------+
|   id|           x|         y|         z|
+-----+------------+----------+----------+
| row0|  -6.0776997|-2.9096103|-1.5181729|
| row1|  -1.0122601|  7.322841|-5.4424076|
| row2|   -8.297007| 6.3228936| 1.1672047|
| row3|  -3.5071216|  4.784812|-5.4449472|
| row4|   -5.122823|-3.3220499|-0.5069805|
| row5|  -2.4764006|  8.255791|  4.409478|
| row6|   7.3153954| -5.079449| -7.291215|
| row7|  -2.0167463|  9.303454|  7.095179|
| row8|  -0.2338185| -4.892681| 2.1228876|
| row9|    6.565442| -6.855994|-6.7983212|
|row10|  -5.6902847|-6.4827404|-0.9246967|
|row11|-0.017986143| 2.7632365| -8.814824|
|row12|  -6.9042625|-6.1491723|-3.5354295|
|row13|  -10.389865|  9.537853|  0.674591|
|row14|   3.9688683|-6.0467844| -5.462389|
|row15|   -7.337052|-3.7689247| -5.261122|
|row16|   -8.991589|  8.738728|  3.864116|
|row17| -0.18098584|  5.482743| -4.900118|
|row18|   3.3193955|-6.3573766| -6.978025|
|row19|  -2.0266335|-3.4171724|0.48218703|
+-----+------------+----------+----------+

now I have information out of the clustering algorithm in the form of the datafarame df2 as below:

print("==========================Short report==================================== ")

n_clusters = model.summary.k
#n_clusters
print("Number of predicted clusters: " + str(n_clusters))

cluster_Sizes = model.summary.clusterSizes
#cluster_Sizes 

col = ['size']
df2 = pd.DataFrame(cluster_Sizes, columns=col).sort_values(by=['size'], ascending=True)  #sorting
cluster_Sizes = df2["size"].unique()
print("Size of predicted clusters: " + str(cluster_Sizes))
clusterSizes

#==========================Short report==================================== 
#Number of predicted clusters: 10
#Size of predicted clusters: [ 486  496  504  529  985  998  999 1003 2000]

+-----+----------+
|     |prediction|
+-----+----------+
|    2|       486|
|    6|       496|
|    0|       504|
|    8|       529|
|    5|       985|
|    9|       998|
|    7|       999|
|    3|      1003|
|    1|      2000|
|    4|      2000|
+-----+----------+

so here the index column is predicted cluster labels. I could assign the predicted cluster labels into the main dataframe but not cluster size as below:

+-----+----------+------------+----------+----------+
|   id|prediction|           x|         y|         z|
+-----+----------+------------+----------+----------+
| row0|         9|  -6.0776997|-2.9096103|-1.5181729|
| row1|         4|  -1.0122601|  7.322841|-5.4424076|
| row2|         1|   -8.297007| 6.3228936| 1.1672047|
| row3|         4|  -3.5071216|  4.784812|-5.4449472|
| row4|         3|   -5.122823|-3.3220499|-0.5069805|
| row5|         1|  -2.4764006|  8.255791|  4.409478|
| row6|         5|   7.3153954| -5.079449| -7.291215|
| row7|         1|  -2.0167463|  9.303454|  7.095179|
| row8|         7|  -0.2338185| -4.892681| 2.1228876|
| row9|         5|    6.565442| -6.855994|-6.7983212|
|row10|         3|  -5.6902847|-6.4827404|-0.9246967|
|row11|         4|-0.017986143| 2.7632365| -8.814824|
|row12|         9|  -6.9042625|-6.1491723|-3.5354295|
|row13|         1|  -10.389865|  9.537853|  0.674591|
|row14|         2|   3.9688683|-6.0467844| -5.462389|
|row15|         9|   -7.337052|-3.7689247| -5.261122|
|row16|         1|   -8.991589|  8.738728|  3.864116|
|row17|         4| -0.18098584|  5.482743| -4.900118|
|row18|         2|   3.3193955|-6.3573766| -6.978025|
|row19|         7|  -2.0266335|-3.4171724|0.48218703|
+-----+----------+------------+----------+----------+

Now wanna include\report 3D scatter plot it via 2 individual legends besides via following function:

color_names = ["red", "blue", "yellow", "black", "pink", "purple", "orange"]

def plot_3d_transformed_data(df, title, colors="red"):
 
  # Imports.
  import matplotlib as mpl
  import matplotlib.pyplot as plt
  from mpl_toolkits.mplot3d import Axes3D
  import pandas as pd
  import numpy as np
  import plotly.express as px
  import matplotlib.cm as cm

  # Figure.
  figure = plt.figure(figsize=(12, 10))
  ax = figure.add_subplot(projection="3d")
  ax.set_xlabel("PC1: x")
  ax.set_ylabel("PC2: y")
  ax.set_zlabel("PC3: z")
  ax.set_title("scatter 3D legend") 

  # Data and 3D scatter.
  #colors = ["red", "blue", "yellow", "black", "pink", "purple", "orange", "black", "red" ,"blue"]
  colors = cm.rainbow(np.linspace(0, 1, len(cluster_Sizes)))

  # Create your plot
  #px.scatter(df1, x='x', y='y', size=df2['size'], color='jet')
  sc = ax.scatter(df1.x, df1.y, df1.z, alpha=0.6, c=colors, sizes=df2['size'], marker="o")

  # Legend 1.
  handles, labels = sc.legend_elements(prop="sizes", alpha=0.6)
  legend1 = ax.legend(handles, labels, bbox_to_anchor=(1, 1), loc="upper right", title="Sizes")
  ax.add_artist(legend1) # <- this is important.

  # Legend 2.
  unique_colors = set(colors)
  handles = []
  labels = []
  for n, color in enumerate(unique_colors, start=1):
      artist = mpl.lines.Line2D([], [], color=color, lw=0, marker="o")
      handles.append(artist)
      labels.append(str(n))
  legend2 = ax.legend(handles, labels, bbox_to_anchor=(0.05, 0.05), loc="lower left", title="Classes")

  figure.show() 

The problem is to create propper color map list support clusters numbers (to avoid ValueError: 'c' argument has 9 elements, which is inconsistent with 'x' and 'y' with size 10000.) as well as find the solution for dismatching size between two dataframes (to avoid ValueError: s must be a scalar, or the same size as x and y) to use in:

sc = ax.scatter(df1.x,
                df1.y,
                df1.z,
                alpha=0.6,
                c=colors,   #colors=cm.rainbow(np.linspace(0, 1, len(cluster_Sizes)))
                s=df2['size'],
                marker="o")

so one idea is I assign the df2['size'] to df1 but it's expensive and not a good idea. So I was wondering if there is an elegant way to update the def plot_3d_transformed_data() and use it for better visualization can indicate predicted cluster labels and cluster size by one plot. Kindly I provide a colab notebook for quick debugging.

Expected output is illustrated as below:

img

Mario
  • 1,631
  • 2
  • 21
  • 51

1 Answers1

0

I tried to reproduce only the graphing part with the updated Colab. I have traced your code and noticed something. I think the error in running the function is caused by the number of colors not matching the number of data. The graph is created as 20 data by cutting only the important data from your code.

color_names = ["red", "blue", "yellow", "black", "pink", "purple", "orange"]

def plot_3d_transformed_data(df, title, colors="red"):

  # Imports.
  import matplotlib as mpl
  import matplotlib.pyplot as plt
  from mpl_toolkits.mplot3d import Axes3D
  import pandas as pd
  import numpy as np
  import plotly.express as px
  import matplotlib.cm as cm

  #clusterSizes = pd.read_csv(io.StringIO(data1), delim_whitespace=True)

  #pddf_pred = df_pred.set_index('id')
  cluster_Sizes = clusterSizes["size"].unique()
  #x_train = np.random.randint(20,500,(20,))

  # Figure.
  figure = plt.figure(figsize=(12, 10))
  ax = figure.add_subplot(projection="3d")
  ax.set_xlabel("PC1: x")
  ax.set_ylabel("PC2: y")
  ax.set_zlabel("PC3: z")
  ax.set_title("scatter 3D legend") 

  colors2 = ["red", "blue", "yellow", "black", "pink", "purple", "orange", "black", "red" ,"blue"]
  colors = cm.rainbow(np.linspace(0, 1, 20))

  # Create 3D scatter plot
  sc = ax.scatter(pddf_pred.x.values, pddf_pred.y.values, pddf_pred.z.values, alpha=0.6, s=x_train, c=colors, marker="o")

  # Legend 1.
  handles, labels = sc.legend_elements(prop="sizes", alpha=0.6)
  legend1 = ax.legend(handles, labels, bbox_to_anchor=(1.2, 1), loc="upper right", title="Sizes")
  ax.add_artist(legend1)

  # Legend 2.
  unique_colors = set(colors2)
  handles = []
  labels = []
  for n, color in enumerate(unique_colors, start=1):
      artist = mpl.lines.Line2D([], [], color=color, lw=0, marker="o")
      handles.append(artist)
      labels.append(str(n))
  legend2 = ax.legend(handles, labels, bbox_to_anchor=(-0.05, 0.05), loc="lower left", title="Classes")

  plt.show()
Mario
  • 1,631
  • 2
  • 21
  • 51
r-beginners
  • 31,170
  • 3
  • 14
  • 32
  • Would you double-check your answer to update `def plot_3d_transformed_data():` on [colab notebook](https://colab.research.google.com/drive/1DMBMlICT-iq5_i5Oz-NC5WS4eBPRdgrB?usp=sharing). – Mario Sep 07 '21 at 09:31
  • Are you referring to the function part in the comment as the answer? – r-beginners Sep 07 '21 at 09:43
  • Yes, I'm addressing to adapt your solution in the form of the function to use it generally after any clustering algorithm. Also, it would be amazing if I can include exact cluster sizes in one of the legends and not only ranges. would you re-check the notebook for testing your solution on the notebook? – Mario Sep 07 '21 at 09:54
  • The error in the latest notebook is caused by the size of the scatter not matching the number of scatter. s=20 is the default value, so try to correct it to the default value and run it once. – r-beginners Sep 07 '21 at 10:01
  • I couldn't adapt your solution to the notebook to test it. I get `'c' argument has 10 elements, which is inconsistent with 'x' and 'y' with size 10000.` I think still the problem is size\shape of the list of values which going to be set on `ax.scatter()` parameters. – Mario Sep 07 '21 at 10:31
  • The color and size of the scatter plot should be fixed or the size of the data. Once you set the color and size to fixed, check it out. – r-beginners Sep 07 '21 at 10:35
  • Then I need to assign clustering algorithms results like *predicted cluster labels* and *their size* extra/ordinary on the main dataframe, which means I'm increasing the dimension of data unnecessary for just visualization job, which seems to be expensive!! – Mario Sep 07 '21 at 10:43
  • [Sample notebook](https://colab.research.google.com/drive/1kyDgqBMKAS0N06uRBEL4EebipZZfIJ-N?usp=sharing) with 1000 data for number of scatter. There will be 1000 colors for the color map. In fact, we are just extending 256 to 1000. And for clusters, I used the same color map and specified the number of classes to determine the number of colors. Now the legend and scatter colors match. How would you like the colors and sizes to be? – r-beginners Sep 07 '21 at 13:00
  • I comment the problems under the my [colab notebook](https://colab.research.google.com/drive/1DMBMlICT-iq5_i5Oz-NC5WS4eBPRdgrB?usp=sharing) if you consider that. I noticed that you set `s=df.z*10` so how about to make it much real by multiplying to number of instances in each cluster (cluster sizes) using a loop `s=df.z*cluster_Sizes[i] ` while `[i]` started from 0 till last cluster number ? – Mario Sep 07 '21 at 14:22
  • I understand that the size of a scatter plot indicates a quantitative element. Doesn't your size represent the number of cases? The purpose of it and the two ranges are the quantitative legend and the category legend. – r-beginners Sep 07 '21 at 14:29
  • It does if you check the colab notebook `cluster_Sizes = clusterSizes["size"].unique()` provides an array which indicates number of instances in each cluster. Meanwhile `cluster_numbers = clusterSizes["size"].count() -1` indicates predicted cluster labels. Alternatively `cluster_df` is formed in the form of the dataframe. The legends still mismatch from my expectation plz see [here](https://i.imgur.com/mITTHk6.jpg) – Mario Sep 07 '21 at 15:23