-1

The format of my input file is the following:

PERSON1 BUILDING1
PERSON2 BUILDING4
PERSON3 BUILDING4
PERSON5 BUILDING3
PERSON3 BUILDING2
PERSON3 BUILDING1
PERSON5 BUILDING6
PERSON4 BUILDING6
1000 more rows like this

Each row should be read like this "the person X visited building Y"

I simply want to have clusters like this:

Cluster 1 : Persons that visited only 1 building (the same building)
Cluster 2 : Persons that visited only 2 buildings (the same buildings, let's say building 1 & 2)
Cluster 3 : Persons that visited only 2 buildings (the same buildings, let's say building 3 & 4)
Cluster 4 : Persons that visited only 3 buildings (the same buildings)
etc..

What would be the best way to do it? Is there a software ideally with data visualization that can do that? I tried Knime with no success.

Learthgz
  • 133
  • 12

2 Answers2

0

You need to reformat your data appropriately.

The use a group_by operation based on the set of buildings visited.

This is much simpler than clustering.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

I second @Anony-Mousse the solutions is more similar to use "group by" than make a clustering. So, with the idea to prove it works I built a simple code with knime getting the expected result. Then, for the visualization part you mention, maybe a correspondence analysis could be usuful, .

enter image description here

this chart is implemented in R (you can use R node) and shows how related is a entity (let's say visitors-blue) to another entity (let's say buildings-red) but ofcourse, the proper chart depends on your full data and intentions.

Jason Angel
  • 2,233
  • 1
  • 14
  • 14