Modified version of Student Project Allocation algorithm

Question

I am working on a project for a non-profit organization where they are trying to help students with special needs to match to different project topics. Each student will have four preferences and a set of supervisors will have their list of preferences on the topics they supervise.

The solution I look for is an algorithm which can find an optimum solution to match students to project topics and supervisors.

I have done extensive reading on SPA, HR and other Greedy Algorithms and even tried a flavour of Genetic algorithm. So far I have nothing but stress.

Here is the flow of the program.

We have a pool of topics for supervisors to show their interest. Supervisors can pick topics where they like to supervise and they can also suggest a topic and decide how many project groups they would like to supervise.

P1, P2, P3, P4, P5 ...... Pn ... SP1, SP2, SP3 .... SPn

In the above list, P1 ... Pn are existing topics and SP1...SPn are suggested topics.

Let's say after this round, we have a list of supervisors with the following preference.

supervisor | Topics of Interest | No. Of Groups
L1         | P1, P3, P4         |     2
L2         | P5, P2, P9         |     1
L3         | P1, P3, P4         |     1
L4         | P1, P3, P4         |     4
L5         | SP1, P3, P8        |     3
L6         | P32, P3, P40       |     3

After the above round, we know that there are only supervisors who can Supervise students on the following topics.

P1, P2, P3, P4, P8, P9, P32, P40, SP1

When we open the topics for the students, they will only be able to pick projects from the above list, with their preference/priority. For example

student    | Pref1 | Pref 2 | Pref 3 | Pref 4 |
S1         |  P4   |  P1    |  SP1   |   P5   |
S2         |  P1   |  P9    |  SP1   |   P5   |
S3         |  P3   |  P1    |  P2    |   P5   |
S4         |  P4   |  P1    |  P40   |   P5   |
S5         |  P4   |  P32   |  SP1   |   P5   |
...
Sn         |  P9   |  P1    |  SP1   |   P5   |

Now, once the students pick the preference, We will then decide a number MAX_GROUP_SIZE and we will run our algorithm to group these students into partitions where we will

a. Group students with similar interests into same group ( eg. We add Students who picked P1 as their pref1 and fill in the rest with pref2, pref3, pref4 when they don't have groups for their first choices). b. Assign a supervisor to a group where he has shown interest in the project ( Ideally, every students first preferences or best-matched project) c. We need to make sure that we don't overload the supervisor, if a supervisor has shown interest in P1, P2, P3 and mentioned that he can only supervise 2 projects, then we should only add him to 2 projects.

So far, I have been trying the above-explained algorithms and I still don't think I have a justified solution for the students. I prefer a solution which is more biased towards the students since they have special needs. If anyone can point me in the right direction or can provide me with a well-defined algorithm or an implementation I would not only appreciate the effort but I would buy you a coffee as well.

Linear programming make marvels for this kind of allocation problems, but you would have to define an objective function you wish to optimize. If you want to use a meta heuristic algorithm (such as a genetic algorithm as you tagged), you will have to define one too (objective function will be called 'fitness' in genetic algorithms). So I'd suggest you to start with that. — m.raynal, Jul 06 '20 at 13:10

David Eisenstat · Answer 1 · 2020-07-09T14:00:08.243

Speaking as someone who does this kind of thing for a living, the core of this problem is pretty similar to a standard problem called "capacitated facility location", which at the scales I imagine you're dealing with can be handled cleanly by integer programming. I can vouch for the free Google OR-Tools (disclaimer: yep, that's my employer; nope, not speaking for them), but you have several other free and paid options (SCIP, lpsolve, Gurobi, CPLEX).

Integer programming is pretty nice: you declare some variables, write some constraints and an objective on those variables, push a button and get a(n often optimal) solution.

Here we're going to have the following binary variables:

For each pair of (student i, potential project j for student i), we have a 0-1 variable Assign[i,j] that is 1 if that student does that project and 0 otherwise.
For each pair of (advisor k, potential project j for advisor k), we have a 0-1 variable Avail[k,j] that is 1 if that advisor does that project and 0 otherwise.

The objective is

minimize sum_{i,j} PreferenceValue[i,j] Assign[i,j],

where PreferenceValue[i,j] has lower values to indicate a student's more preferred projects. You could use 1,2,3,4 for example for first, second, third, fourth choice; or bias toward first choices with 1,2,2,2; or bias toward fairness with 1,4,9,16. Lots to play with, have fun. As requested, this objective does not care what we make the advisors do.

The constraints are

for each student i, sum_j Assign[i,j] = 1,

i.e., each student is assigned exactly one project;

for each advisor k, sum_j Avail[k,j] ≤ MaxGroups[k],

i.e., no advisor has more work than they want;

for each student i and project j, Assign[i,j] ≤ sum_k Avail[k,j],

i.e., each student can only be assigned to a project if it is available;

for each project j, sum_i Assign[i,j] ≤ MaxGroupSize,

i.e., each group has at most MaxGroupSize students.

OR-Tools doesn't let you write "for each" and "sum" like that, so you'll have to write a short program to expand them. Read the OR-Tools documentation.

Hopefully this is enough of a start that when you build it and it inevitably disappoints your expectations, you can figure out how to add more constraints to prevent solutions that you don't want. Good luck!

ldog · Answer 2 · 2020-07-27T17:10:24.873

There is ambiguity in your problem statement that, depending on how it is resolved, will change the algorithm that you would want to employ. I will discuss the ambiguity later.

As others have suggested, this falls into the domain of combinatorial optimization and there are many different OR tools that could be used to solve this.

To start, I would suggest employing a sequence of weighted bipartite matchings with (possibly) solution pruning.

Here is a solution written in python using networkx based on a sequence of two bipartite matchings (the first being a weighted one for students, the second being unweighted.)

#!/usr/bin/python

"""
filename: student_assign.py
purpose:  demonstrate that the problem described in 
          https://stackoverflow.com/questions/62755778/modified-version-of-student-project-allocation-algorithm
          can be solved as a sequence of assignment problems solved through a weighted bipartite matching.
"""


import networkx as nx
import numpy as np

# For this demonstration we take data directly from the problem description
#supervisor | Topics of Interest | No. Of Groups
#L1         | P1, P3, P4         |     2
#L2         | P5, P2, P9         |     1
#L3         | P1, P3, P4         |     1
#L4         | P1, P3, P4         |     4
#L5         | SP1, P3, P8        |     3
#L6         | P32, P3, P40       |     3


supervisors = {
        'L1' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 2},
        'L2' : { 'topics' : ['P5', 'P2', 'P9'], 'num_groups' : 1},
        'L3' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 1},
        'L4' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 4},
        'L5' : { 'topics' : ['SP1', 'P3', 'P8'], 'num_groups' : 3},
        'L6' : { 'topics' : ['P32', 'P3', 'P40'], 'num_groups' : 3},
        }

all_topics = sorted(list({ t  for s in supervisors for t in supervisors[s]['topics'] }))


# assuming there is a typo in the problem specification and 'supervisor' = 'student' below
#supervisor | Pref1 | Pref 2 | Pref 3 | Pref 4 |
#S1         |  P4   |  P1    |  SP1   |   P5   |
#S2         |  P1   |  P9    |  SP1   |   P5   |
#S3         |  P3   |  P1    |  P2    |   P5   |
#S4         |  P4   |  P1    |  P40   |   P5   |
#S5         |  P4   |  P32   |  SP1   |   P5   |
#S6         |  P9   |  P1    |  SP1   |   P5   |

students = {
        'S1' : ['P4', 'P1', 'SP1', 'P5'] ,
        'S2' : ['P1', 'P9', 'SP1', 'P5'] ,
        'S3' : ['P3', 'P1', 'P2', 'P5'] ,
        'S4' : ['P4', 'P1', 'P40', 'P5'] ,
        'S5' : ['P4', 'P32', 'SP1', 'P5'] ,
        'S6' : ['P9', 'P1', 'SP1', 'P5'] ,
        }

MAX_GROUP_SIZE = 2


def get_student_assignments_to_topics(all_topics,students,max_group_size=MAX_GROUP_SIZE):
    G = nx.DiGraph()
    G.add_node('sink',demand=len(students))

    for topic in all_topics:
        G.add_node(topic)
        G.add_edge(topic, 'sink', weight = 0, capacity = max_group_size)

    for student in students:
        prefs = students[student]
        G.add_node(student,demand=-1)
        # add increasing weight edges from student to preferences (lowest == best)
        for i, topic in enumerate(prefs):
            G.add_edge(student, topic, weight = i, capacity = 1)

    # solve the weighted matching
    flow_dict = nx.min_cost_flow(G)

    # decode which student is assigned to which topic
    student_assignments = { t : [] for t in all_topics}
    for student in students:
        adjacency = flow_dict[student]
        prefs = students[student]
        for pref in prefs:
            if adjacency[pref]:
                student_assignments[pref].append(student)
                break

    return student_assignments


def get_topic_assignments_to_supervisors(student_assignments,supervisors):
    non_empty_student_assignments = { topic : student_assignments[topic] for topic in student_assignments if len(student_assignments[topic]) > 0}

    G = nx.DiGraph()
    G.add_node('sink',demand=len(non_empty_student_assignments))

    for topic in non_empty_student_assignments:
        G.add_node(topic,demand=-1)

    for supervisor in supervisors:
        supervisor_properties = supervisors[supervisor]
        for topic in supervisor_properties['topics']:
            if topic in non_empty_student_assignments:
                G.add_edge(topic, supervisor, weight = 0, capacity = 1)
        G.add_edge(supervisor, 'sink', weight = 0, capacity = supervisor_properties['num_groups'])

    # solve the unweighted matching
    flow_dict = nx.min_cost_flow(G)

    # decode which supervisor is assigned to which topic
    topic_assignments = { s : [] for s in supervisors}
    for supervisor in supervisors:
        supervisor_properties = supervisors[supervisor]
        for topic in supervisor_properties['topics']:
            if topic in non_empty_student_assignments:
                adjacency = flow_dict[topic]
                if adjacency[supervisor]:
                    topic_assignments[supervisor].append(topic)

    return topic_assignments




# assign students to topics by preference
student_assignments = get_student_assignments_to_topics(all_topics,students)
# assign all topics with at least one student to a supervisor who fits the criteria
topic_assignments = get_topic_assignments_to_supervisors(student_assignments,supervisors)

print 'These are the assignment of students to topics based on preference:'
print student_assignments
print 'These are the assignment of topics to supervisors based on availability:'
print topic_assignments

This script outputs:

These are the assignment of students to topics based on preference:
{'P2': [], 'P3': ['S3'], 'P1': ['S2', 'S1'], 'P4': ['S5', 'S4'], 'P5': [], 'SP1': [], 'P8': [], 'P9': ['S6'], 'P32': [], 'P40': []}
These are the assignment of topics to supervisors based on availability:
{'L6': [], 'L4': ['P1', 'P3'], 'L5': [], 'L2': ['P9'], 'L3': ['P4'], 'L1': []}

Ambiguity

There is ambiguity as to how you want to handle important edge cases:

what if topics have no interest from students?
what if a topic only has one interested student?
students may have to rank all possible topics to ensure a solution exists?
should supervisors have preference for topics as well (if so whose preference takes precedence?)
should supervisor assignment to topics be load balanced (solutions with all supervisors having similar amount of work being preferred)?

Answers to these specific questions that disambiguiate are very important and will shape the type of solution you craft (as well as being able to communicate to users of your algorithm what exactly is optimized.)

I definitely recommend you spend more time disambiguiating your problem.

Solution existence

The sequential bipartite matching algorithm presented here will find optimal solutions; however, it may not find a solution even if one exists.

This can happen if the solution of the first matching produces a set of projects for which there is no assignment of supervisors.

One possible way to address this is to systematically search through subsets of possible projects until a solution exists (see pruning below.)

Pruning solutions

If some assignments of students to topics is unfavorable, an easy way to prevent that solution from being possible is to set weights of student-topic assignment very high (infinity.) This gives a structured way to prune unwanted pairings:

Solve the weighted bipartite matching
Identify undesirable student-topic pairing
Set weight to infinity or remove edge between student-topic pairing, resolve.

Efficiency

Here python was used with networkx to optimize prototyping ability not efficiency. If you wish to scale this solution to large problem sizes, I would recommend the lemon MCF library (in particular the cost scaling MCF algorithm) or Andrew V Goldberg's original cost-scaling MCF algorithm implementation.

In my experience benchmarking MCF, these are the two most competitive implementations. I don't have experience with Google-OR's implementation of MCF.

ldog · Accepted Answer · 2020-07-15T18:16:13.043

This is a (more correct) answer based on the same approach as the previous answer, however, it solves the entire problem as a single weighted bipartite matching.

The same considerations apply as for the previous answer; however, this answer will find an answer if one exists. It has to, however, condition on the number of projects used in the final solution, so it can find multiple "good" solutions for different numbers of projects used (a project is considered used if it has 1 or more students.)

#!/usr/bin/python

"""
filename: student_assign.py
purpose:  demonstrate that the problem described in 
          https://stackoverflow.com/questions/62755778/modified-version-of-student-project-allocation-algorithm
          can be solved as an instance of MCF.
"""


import networkx as nx

# For this demonstration we take data directly from the problem description
#supervisor | Topics of Interest | No. Of Groups
#L1         | P1, P3, P4         |     2
#L2         | P5, P2, P9         |     1
#L3         | P1, P3, P4         |     1
#L4         | P1, P3, P4         |     4
#L5         | SP1, P3, P8        |     3
#L6         | P32, P3, P40       |     3


supervisors = {
        'L1' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 2},
        'L2' : { 'topics' : ['P5', 'P2', 'P9'], 'num_groups' : 1},
        'L3' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 1},
        'L4' : { 'topics' : ['P1', 'P3', 'P4'], 'num_groups' : 4},
        'L5' : { 'topics' : ['SP1', 'P3', 'P8'], 'num_groups' : 3},
        'L6' : { 'topics' : ['P32', 'P3', 'P40'], 'num_groups' : 3},
        }

all_topics = sorted(list({ t  for s in supervisors for t in supervisors[s]['topics'] }))


# assuming there is a typo in the problem specification and 'supervisor' = 'student' below
#supervisor | Pref1 | Pref 2 | Pref 3 | Pref 4 |
#S1         |  P4   |  P1    |  SP1   |   P5   |
#S2         |  P1   |  P9    |  SP1   |   P5   |
#S3         |  P3   |  P1    |  P2    |   P5   |
#S4         |  P4   |  P1    |  P40   |   P5   |
#S5         |  P4   |  P32   |  SP1   |   P5   |
#S6         |  P9   |  P1    |  SP1   |   P5   |

students = {
        'S1' : ['P4', 'P1', 'SP1', 'P5'] ,
        'S2' : ['P1', 'P9', 'SP1', 'P5'] ,
        'S3' : ['P3', 'P1', 'P2', 'P5'] ,
        'S4' : ['P4', 'P1', 'P40', 'P5'] ,
        'S5' : ['P4', 'P32', 'SP1', 'P5'] ,
        'S6' : ['P9', 'P1', 'SP1', 'P5'] ,
        }

MAX_GROUP_SIZE = 2


def get_student_topic_supervisor_assignments(all_topics,students,supervisors,num_topics_used,max_group_size=MAX_GROUP_SIZE,do_supervisor_load_balancing=False):

    G = nx.DiGraph()
    G.add_node('sink',demand=len(students) - num_topics_used)

    for topic in all_topics:
        G.add_node(topic)
        G.add_edge(topic, 'sink', weight = 0, capacity = max_group_size-1)

    for student in students:
        prefs = students[student]
        G.add_node(student,demand=-1)
        # add increasing weight edges from student to preferences (lowest == best)
        for i, topic in enumerate(prefs):
            G.add_edge(student, topic, weight = i, capacity = 1)


    G.add_node('sink_2',demand=num_topics_used)

    for topic in all_topics:
        G.add_node(topic + "_2")
        G.add_edge(topic, topic + "_2", weight = 0, capacity = 1 )

    for supervisor in supervisors:
        supervisor_properties = supervisors[supervisor]
        for topic in supervisor_properties['topics']:
            G.add_edge(topic + "_2", supervisor, weight = 0, capacity = 1)
        if do_supervisor_load_balancing:
            for i in range(supervisor_properties['num_groups']):
                G.add_node(supervisor + "_dummy")
                G.add_edge(supervisor, supervisor + "_dummy", weight = i, capacity = 1)
                G.add_edge(supervisor + "_dummy", 'sink_2', weight = 0, capacity = 1)
        else:
            G.add_edge(supervisor, 'sink_2', weight = 0, capacity = supervisor_properties['num_groups'])

    # solve the weighted matching
    flow_dict = nx.min_cost_flow(G)
    
    for topic in all_topics:
        edges = flow_dict[topic]
        if edges['sink'] and  not edges[topic+"_2"]:
            raise RuntimeError('Solution with num_topics_used={n} is not valid.'.format(n=num_topics_used))

    # decode solution
    topic_assignments = {t : [] for t in all_topics}
    for student in students:
        edges = flow_dict[student]
        for target in edges:
            if edges[target]:
                topic_assignments[target].append(student)
                break

    supervisor_assignments = {s : [] for s in supervisors}
    for topic in all_topics:
        edges = flow_dict[topic+"_2"]
        for target in edges:
            if edges[target]:
                supervisor_assignments[target].append(topic)
    
    return topic_assignments, supervisor_assignments

num_students = len(students)
for n in range(1,num_students+1):
    try:
        topic_assignments, supervisor_assignments =\
                get_student_topic_supervisor_assignments(all_topics,students,supervisors,num_topics_used=n)
        print ' An optimal solution was found with `num_topics_used`={n}'.format(n=n)
        print ' Topic assignments:\n', topic_assignments
        print ' Supervisor assignments:\n', supervisor_assignments
    except Exception as e:
        pass

This outputs:

 An optimal solution was found with `num_topics_used`=4
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2', 'S4'], 'P4': ['S1', 'S5'], 'P5': [], 'SP1': [], 'P8': [], 'P9': ['S6'], 'P32': [], 'P40': []}
 Supervisor assignments:
{'L6': ['P3'], 'L4': ['P4'], 'L5': [], 'L2': ['P9'], 'L3': ['P1'], 'L1': []}
 An optimal solution was found with `num_topics_used`=5
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2'], 'P4': ['S1', 'S4'], 'P5': [], 'SP1': [], 'P8': [], 'P9': ['S6'], 'P32': ['S5'], 'P40': []}
 Supervisor assignments:
{'L6': ['P3', 'P32'], 'L4': ['P1'], 'L5': [], 'L2': ['P9'], 'L3': ['P4'], 'L1': []}
 An optimal solution was found with `num_topics_used`=6
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2'], 'P4': ['S4'], 'P5': [], 'SP1': ['S1'], 'P8': [], 'P9': ['S6'], 'P32': ['S5'], 'P40': []}
 Supervisor assignments:
{'L6': ['P3', 'P32'], 'L4': ['P1'], 'L5': ['SP1'], 'L2': ['P9'], 'L3': ['P4'], 'L1': []}

Supervisor Load Balancing

An update of this solution added an extra parameter to the function do_supervisor_load_balancing, which (when set to true) will prefer solutions where the number of topics each supervisor is assigned is similar.

Note, that using load balancing can potentially set the two criteria at odds:

balancing supervisor loads
giving students preference on which projects they work

Setting the weights of one higher than the other (by an order of magnitude) will ensure that criteria is much more highly weighted. As it stands, the solution presented here gives about equal weight to both criteria.

In the above example, when load balancing is used the following is output:

 An optimal solution was found with `num_topics_used`=4
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2', 'S4'], 'P4': ['S1', 'S5'], 'P5': [], 'SP1': [], 'P8': [], 'P9': ['S6'], 'P32': [], 'P40': []}
 Supervisor assignments:
{'L6': ['P3'], 'L4': [], 'L5': [], 'L2': ['P9'], 'L3': ['P4'], 'L1': ['P1']}
 An optimal solution was found with `num_topics_used`=5
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2'], 'P4': ['S1', 'S4'], 'P5': [], 'SP1': [], 'P8': [], 'P9': ['S6'], 'P32': ['S5'], 'P40': []}
 Supervisor assignments:
{'L6': ['P32'], 'L4': [], 'L5': ['P3'], 'L2': ['P9'], 'L3': ['P4'], 'L1': ['P1']}
 An optimal solution was found with `num_topics_used`=6
 Topic assignments:
{'P2': [], 'P3': ['S3'], 'P1': ['S2'], 'P4': ['S4'], 'P5': [], 'SP1': ['S1'], 'P8': [], 'P9': ['S6'], 'P32': ['S5'], 'P40': []}
 Supervisor assignments:
{'L6': ['P32'], 'L4': ['P3'], 'L5': ['SP1'], 'L2': ['P9'], 'L3': ['P4'], 'L1': ['P1']}

@Idog I am interested in the load balancing, May I know how you would approach it? — Fawzan, Jul 15 '20 at 03:25
@Fawzan added supervisor load balancing to the solution. Note, use with caution, load balancing can be directly at odds with giving students their preferential topics, so re-weight your problem definition accordingly. — ldog, Jul 15 '20 at 18:17