0

Suppose I have a dataset I want to run a Mahout clustering job on. I want each data point to have a unique identifier, such as an ID number. I don't want to append the ID to the vector as this way it will be included in the clustering calculations. How can I include an identifier in the data without the algorithm including the ID number in its calculations? Is there a way to have the input be a key-value pair where the key is the ID and the value is the Vector I want to run the algorithm on?

Panther
  • 3,312
  • 9
  • 27
  • 50
Alison
  • 99
  • 2
  • 7

1 Answers1

0

Alison before worrying about this, see the output first. Many times, you have lines of assignedCLusterIDs, where line orders in input and output files are the same. For example, the node in the first line of your input file will be in the first line of the output file. So you can keep ids in a separate file, their vectors in the input file. Then you can combine the separate file and the output file to see which node is assigned which cluster.

cuneyt
  • 336
  • 5
  • 15
  • forgot to add. The R does that. – cuneyt Jul 20 '12 at 19:02
  • Thanks for your response, @cuneyt. I've looked at the output again and interestingly enough there is some sort of order to the output but it does not completely match up with the order of the input. For example, the first few points in my input file are listed consecutively in the output file, yet the first of these points doesn't appear until halfway through the output file, under a heading that reads CL-595, which I originally thought was a cluster ID followed by the points in the cluster. Have you seen this before? Am I reading the output file incorrectly? – Alison Jul 20 '12 at 19:21
  • paste 10-20 lines of the output here, I am sure it will be self explanatory enough for us – cuneyt Jul 23 '12 at 12:20
  • here is some of the output (separated into multiple comments): CL-592{n=57 c=[30.726, 29.813, 30.744, 29.337, 29.865, 29.284, 29.719, 29.716, 28.154, 28.816, 27.901, 28.527, 28.006, 28.643, 28.464, 27.317, 27.985, 27.138, 27.178, 27.804, 27.598, 25.966, 26.486, 24.031, 23.986, 23.804, 24.387, 22.373, 23.139, 22.572, 21.657, 21.324, 21.325, 20.816, 20.613, 20.931, 20.134, 20.353, 19.669, 20.701, 20.136, 20.429, 19.707, 18.946, 18.342, 18.807, 18.924, 18.014, 19.538, 18.749, 18.329, 19.114, 17.410, 16.727, 18.531, 17.307, 17.218, 17.721, 16.722, 17.235] – Alison Aug 06 '12 at 13:09
  • r=[3.528, 3.597, 3.258, 3.315, 3.628, 3.271, 3.776, 3.754, 3.850, 3.553, 3.210, 3.890, 3.304, 3.653, 3.804, 3.861, 3.458, 3.916, 3.950, 4.090, 4.322, 4.370, 4.636, 4.640, 4.111, 4.641, 4.398, 4.035, 3.777, 4.081, 3.760, 3.383, 4.199, 3.450, 3.351, 4.139, 3.648, 3.516, 3.812, 3.688, 3.632, 3.707, 3.572, 3.896, 3.606, 3.977, 3.700, 4.472, 4.248, 3.967, 3.407, 3.958, 3.537, 4.399, 3.166, 4.216, 4.144, 3.861, 4.655, 4.598]} – Alison Aug 06 '12 at 13:11
  • Weight : [props - optional]: Point: 1.0 : [distance=27.453962995925863]: [24.672, 35.261, 30.486, 34.447, 27.469, 33.179, 32.280, 31.612, 33.215, 30.145, 25.664, 26.510, 23.344, 22.746, 23.703, 25.613, 27.950, 30.915, 27.055, 32.099, 28.053, 25.602, 25.857, 23.649, 23.729, 20.707, 26.265, 24.739, 23.297, 28.814, 28.322, 24.125, 27.636, 19.490, 20.211, 23.685, 17.537, 24.913, 23.852, 17.429, 18.166, 26.208, 16.250, 18.389, 19.903, 17.949, 26.284, 16.435, 22.171, 16.566, 14.734, 20.814, 15.615, 25.051, 17.750, 22.335, 12.816, 20.545, 17.145, 16.969] – Alison Aug 06 '12 at 13:12
  • 1.0 : [distance=27.675053294846002]: [25.592, 29.951, 34.188, 25.658, 26.887, 23.573, 34.070, 32.134, 24.226, 32.835, 28.736, 22.764, 27.075, 31.695, 23.068, 28.177, 30.347, 21.692, 23.520, 25.869, 20.738, 26.484, 25.945, 26.356, 26.610, 27.923, 22.344, 18.341, 25.289, 17.043, 23.898, 21.450, 21.012, 26.453, 19.442, 19.780, 23.152, 16.660, 23.176, 24.844, 21.370, 24.335, 22.465, 17.060, 12.203, 11.832, 15.639, 14.378, 17.319, 18.499, 10.786, 17.209, 15.585, 17.023, 19.042, 18.056, 17.958, 15.153, 9.625, 17.562] – Alison Aug 06 '12 at 13:12
  • 1.0 : [distance=28.97727289419493]: [30.696, 32.667, 34.223, 33.183, 34.835, 33.391, 33.175, 32.804, 24.116, 25.190, 22.739, 25.053, 32.679, 31.196, 32.160, 29.381, 23.589, 31.786, 24.265, 30.298, 21.200, 26.239, 30.859, 29.984, 21.029, 27.869, 18.415, 19.499, 23.458, 24.589, 25.958, 23.921, 26.189, 27.101, 27.984, 21.713, 20.958, 20.110, 16.171, 26.001, 21.950, 22.971, 17.464, 19.791, 14.989, 15.000, 24.850, 20.741, 23.414, 16.101, 15.681, 15.673, 23.288, 17.766, 21.817, 16.371, 12.139, 18.997, 17.320, 12.940] – Alison Aug 06 '12 at 13:13
  • 1.0 : [distance=21.999685652862784]: [32.702, 35.219, 30.143, 24.275, 28.156, 26.281, 26.887, 29.739, 28.588, 32.115, 28.952, 31.654, 23.860, 24.503, 26.140, 26.283, 25.897, 31.719, 24.259, 31.153, 27.673, 28.435, 27.952, 23.764, 20.125, 24.848, 27.495, 24.808, 20.754, 24.518, 18.523, 22.455, 25.533, 19.716, 17.452, 17.822, 18.375, 18.684, 17.331, 20.561, 21.989, 20.922, 15.342, 21.997, 23.533, 19.106, 20.590, 15.386, 23.640, 15.969, 16.974, 18.554, 18.152, 14.431, 18.404, 12.034, 16.727, 17.414, 10.661, 12.707] – Alison Aug 06 '12 at 13:13
  • 1.0 : [distance=20.02515456205999]: [30.343, 33.085, 28.130, 31.294, 28.719, 30.306, 26.441, 29.986, 25.757, 26.601, 27.699, 27.233, 29.376, 31.373, 30.535, 24.821, 23.137, 24.924, 30.362, 29.024, 28.737, 19.135, 19.318, 22.184, 24.326, 21.256, 24.222, 24.839, 24.351, 18.481, 21.962, 20.152, 18.972, 22.825, 22.988, 23.799, 18.610, 17.205, 17.968, 22.920, 21.987, 22.731, 18.080, 19.168, 20.863, 19.833, 16.373, 19.790, 16.253, 15.409, 16.462, 19.237, 14.938, 12.695, 16.116, 19.813, 17.155, 19.612, 19.827, 13.522] – Alison Aug 06 '12 at 13:14