0

so i have this code :

class distKmeans(beam.DoFn):

#i will do an init function to add the kmeans parameters
  def __init__(self, n_clusters,rseed=2):
    self.n_clusters = n_clusters
    self.rseed = rseed
    self.centers = None

#The function "process" implements the main functionality of the K-means algorithm
  def process(self,element):
    if self.centers is None:
     
        rng = np.random.RandomState(self.rseed)
        #we use len instead of shape because element is a PCOLLECTION
        i = rng.permutation(element.shape[0])[:self.n_clusters]
        self.centers = element[i]

    # b1. Calculate the closest center μ to xi
    labels = pairwise_distances_argmin(element, self.centers)

    # b2. Update the center
    new_centers = np.array([element[labels == i].mean(0)
                                for i in range(self.n_clusters)])
        
    # c. 
    if np.all(self.centers == new_centers):
      return
    self.centers = new_centers

    yield self.centers, labels

  
with beam.Pipeline() as pipeline:
  mydata = pipeline | beam.Create(X)
  mydata = mydata |beam.ParDo(distKmeans(3))
  mydata |"write" >> beam.io.WriteToText("sample_data/output.txt")

as i'm trying to create a distributed kmeans with apache beam, my data was generated using this code :

n_samples=200 
n_features=2
X, y = make_blobs(n_samples=n_samples,centers=3, n_features=n_features)
data = np.c_[X,y]
plt.scatter(data[:, 0], data[:, 1], s=50);

and then X is :

X = data[['X1','X2']].to_numpy()
X = X[1:]

it shape is (200, 2 )

The code seems correct but i always get the fellowing error even tho my data is a 2d array:

Expected 2D array, got 1D array instead:
array=[-6.03120913 11.30181549].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. [while running '[54]: ParDo(distKmeans)']

and this error comes in this line :

 labels = pairwise_distances_argmin(element, self.centers)

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
Nadia Nadou
  • 137
  • 2
  • 7
  • Did you try: labels = pairwise_distances_argmin(element.reshape(-1, 1), self.centers) ? – Pren Ven Jan 28 '23 at 13:40
  • yes i did and it keeps showing the same error :/ – Nadia Nadou Jan 28 '23 at 13:45
  • Can you check if you are passing the right array to the function? In your code, you are passing the element variable to the pairwise_distances_argmin function. Are you sure element is a 2D array? Also, when you reshape the array, you should be specifying the number of columns and the number of rows. If you don't know the number of rows, you should use -1. For example, element = element.reshape(-1, 1). – Pren Ven Jan 28 '23 at 13:49
  • So i added a print(X) under class distKmeans(...): and it printed the whoe data which is correct (200, 2) shape, i added a print(element) before labels = pairwise_distances_argmin(element, self.centers) and it prints only the first row of the data. – Nadia Nadou Jan 28 '23 at 14:23
  • Can you clarify your requirements and which [Beam Pipeline Runner](https://beam.apache.org/get-started/beam-overview/) are you using? – kiran mathew Jan 29 '23 at 08:30
  • I am unable to reproduce your data (I get an index error trying to execute `X = data[['X1','X2']].to_numpy()`). However, could it be that you simply have to use `beam.Create([X])`? By simply putting `X` into `beam.Create`, each entry of the numpy array is propagated individually down your pipeline and thus no longer in the correct shape? – CaptainNabla Jan 29 '23 at 13:56

0 Answers0