Time-Based Clustering of Multidimensional Data

Question

I'm trying to do clustering of a large number of people based on the pattern of their hours worked across a week. This is an example of the data I'm working with:

table, th, td {
    border: 1px solid black;
}

<table>
  <tr>
  <th>Name</th>
  <th>Monday (00:00 to 07:59)</th>
  <th>Monday (08:00 to 15:59)</th>
  <th>Monday (16:00 to 23:59)</th>
  </tr>
  <tr>
  <td>Guy1</td>
  <td>3</td>
  <td>5.5</td>
  <td>0.5</td>
  </tr>
  <tr>
  <td>Guy2</td>
  <td>0</td>
  <td>7</td>
  <td>2</td>
  </tr>
  <tr>
  <td>Guy3</td>
  <td>4</td>
  <td>4</td>
  <td>1</td>
  </tr>
</table>

I want to find clusters based on the pattern of their work hours. The actual data set I'm working with has over 10000 rows (distinct individuals) and has 42 columns (intervals of hours). I am using R-Studio.

I want to see "profiles" of different individuals, which will be based on the similarity of the pattern of work hours in the week. For example, maybe one person's work hours are focused on 9am to 6pm on weekdays, showing that he belongs to the cluster of employees with regular schedules, while another's work hours are focused in the nighttime, indicating that the person works the night shift.

Note that I am an intern who hasn't graduated yet, and I just learned R today. This is also my first StackOverflow question, so pardon me for sounding ignorant or uninformed.

score 1 · Answer 1 · answered Jun 27 '17 at 10:21

You may want to have a look at the theory of clustering, first. For exemple looking at that post. And then follow up on some R code.

The reason is because clustering is often very dependent on your data and what you want to achieve. There is often no perfect solution, so you have to assess yourself if what you have done is good enough or not.

You can do some research on k-means and hierarchical clustering, there is plenty of ressources on internet. My favorite being the R help that you can find in the help tab of Rstudio. Look for hclust or kmeans to get exemple of how these functions works.

You can also have a look at hts which allow to create cluster of time series. That may solve the issue you can have when creating a matrix distance of 10k * 10k.

Thanks for your suggestions! Would you know if the package "TSclust" would be applicable to my situation? — NewbCoder, Jun 30 '17 at 02:56
`TSclust` is a very good package for time series clustering, but I fear you'll have to sample your dataset, as clustering 10K rows may be too much for the package. Try and give feedback :) — YCR, Jun 30 '17 at 08:19

Time-Based Clustering of Multidimensional Data

1 Answers1