In Search of Similar Courses

Or, ‘How I went round and round in circles….. and then round and round some more’

One of the initial ideas that was suggested back at the beginning of the project was a way of mining course data in order to provide suggestions for similar courses. Since we now have access to the APMS data (a lot more data than my dummy set), finding ways of suggesting similar courses is now something that I can attempt properly.

The first step in the process was deciding on a method of finding keywords from within the various text descriptions of programmes and modules. OpenCalais, which has been used in a previous project at the university – JISCPress, was one such option.

OpenCalais, which has an easy to use API, takes a body of text and returns a series of keywords, broken down by type, and their relevancy score, which indicates how strongly an identified keyword is relevant to the body of text. Initially I looked at using an existing PHP library that would interface with the API (reinventing the wheel and all that), but found that they did not return all of the data I wanted in a easy to use manner, so I wrote my own code, which will be available on Github.

When I first started this process, I had access to data relating to 6436 modules of study, which are part of 878 individual programmes of study for a total of 349 courses. (If one course has two different years’ intake, it will be represented by 2 programmes. A similar situation exists with modules.)

In the first instance, I looked at generating keywords for all programmes of study (a mistake, which I cover later) and modules. I decided to use the following fields to generate keywords for programmes:

  • ‘Aims and Objectives’
  • ‘Introduction’
  • ‘Specialist Facilities’
  • ‘Career Opportunities’

I also used the ‘Synopsis’ field for modules.

This process generated 3,335 keywords, broken down into 36 types of keywords. 17,835 links were generated between programmes of study and keywords and 19,255 links between keywords and modules.

By ‘joining’ programmes that have a keyword in common and setting a minimum relevancy threshold, the following amount of links between programmes were generated.

Minimum Relevance Score Links between Programmes
0.0 1,641,336
0.1 1,453,374
0.2 1,212,506
0.3 886,996
0.4 778,221
0.5 762,308
0.6 721,267
0.7 675,011
0.8 657,049
0.9 254,229
1.0 94,951

With no minimum threshold applied, it basically meant that roughly every programme was joined to every other programme at least twice. This is obviously far from ideal, as it suggests that all programmes are similar to a programme that a student may be interested in. The high numbers can also be attributed to the fact that at this point, I was still including multiple iterations of the same course, which of course will have very similar (if not the same) descriptors, meaning they are very tightly linked together.

Breaking these figures down into a count of each keyword (for the links with a relevance of over 0.8, all 657,049 of them) shows that some keywords occur far too often to be even remotely useful for such an application. OpenCalais groups together ‘like’ keywords and returns its own ‘lead keyword’ for each grouping, which explains some of the keywords that are listed in the table below.

Keyword Count
Education 561,001
Labor 331,224
Technology_Internet 30,625
Entertainment_Culture 8,649
Health_Medical_Pharma 7,569
Business_Finance 6,084
Social Issues 4,096
Law_Crime 2,401
Environment 1,681
Hospitality_Recreation 1,225
Human_Interest 196
Lincolnshire History 100
War_Conflict 64
Disaster_Accident 64
Politics 64
Tourism 36
Acupuncture 16
Sport 16
Sports 16
Religion_Belief 16
Higher Education Academy 4
Criminology Benchmark 4
Internet Computing 1

Ignoring keywords such as ‘Education’ and starting to group programmes together that are instances of the same course vastly reduces the amount of links and results in networks such as the visualization below.

Links between programmes that share common keywords.

Even at this stage there are far too many links between programmes to be of any real use. Using the relationships shown above, suggestions would be made for ‘similar’ programmes that are nothing like each other and in completely different fields of study.

It was at this point that I changed the angle of attack, focusing on the latest validated programme for each course code. By creating links between courses in this manner, OpenCalais identified 6,194 links between keywords and courses, of which there was a total of 349 at that time. Creating links between similar courses again produced a very large amount of results. Manually inspecting the keywords that were responsible for the majority of these results showed something interesting. OpenCalais identifies high-level topics from the bodies of text that it is passed, these topics can be quite wide-ranging and were responsible for the majority of links between courses, so I decided to ignore them! (One of the identified topics was ‘Education’, hence the ridiculous amount of links)

When ignoring any keywords that were identified as ‘Topics’, the amount of links between courses  is dramatically:

Minimum Relevance Score Links between Courses
0.0 101,030
0.1 79,972
0.2 49,624
0.3 13,488
0.4 3,292
0.5 1,254
0.6 408
0.7 118
0.8 24
0.9 0
1.0 0

The resulting visualization is a *little* more useful than previous iterations, but there are still far too many links between courses.

Visualization of relationships between courses, when ignoring Topics.

Again checking the keywords responsible for the majority of the links, and their ‘types’ showed that geographical keywords ranked fairly highly : city, country, province or state, facility, continent, region and holiday. Ignoring all of these keyword types reduced the amount of resultant links between courses by around 40%.

Once again checking the keywords responsible for the links between courses for misleading / unnecessary / erroneous keyword types showed that 3 of the top 4 most popular keywords referred to departments within the university. Ignoring the departments (identified as Organizations by OpenCalais), names of degree titles (identified as ‘Companies’ ??!??! by Open Calais) and individual People (some courses were being linked because the same lecturer was mentioned in their descriptions) reduces the amount of links to a far more manageable number. With no minimum relevancy applied, 44,090 links between courses were created, with a minimum relevancy of 0.5 the number drops to 102 (compared to 762,308 at the start of the process).

The resultant visualization of this data (relevancy of >= 0.4), shows small clusters of similar courses, which is more realistic and useful.

Relationships between similar courses

While this visualization is not overly useful in its own right, the logic used to narrow the data down and produce this visualization can be (and will be) used in an application to identify courses based on keywords and then to highlight similar courses where they have keywords in common.

Future work may also include using keywords at a module level to highlight courses that are similar to one another. As modules are far more targeted in their subject matter and therefore more specialist, when compared to the overarching degree course, even more useful results may be generated.

One thought on “In Search of Similar Courses

  1. Jamie

    I meant to reply to your excellent blog post here when I first read it a while back. However, re-reading it again as part of reviewing your draft JISC report, I thought I’d take this opportunity.

    I wonder if you might have a look also at the AX-S demonstrator that APS and InGenius Solutions produced as a ‘best of breed’ subject search mechanism? It has a widget for sticking on a website, and it might be a useful addition to your search mechanisms.

    Useful links:
    • Our explanation of the AX-S Demonstrator (in a blog post) is here:
    • The full blog on its development is here:
    • The AX-S Demonstrator itself is here:
    • And the open source code for the widget is here:


Comments are closed.