torsdag, juni 15, 2006

The arcitecture of a Meta Directory system

One of the major projects at Karolinska Institutet the last 3 years is called KIMKAT. It has gotten the moniker of a meta directory system, but in many ways that's not entirely correct. Since I've been one of the lead architects and developers in this project, I wanted to write about the technical architecture for the system, what choices panned out well, and things I would change if I could.

The problem
Karolinska Institutet is a fair sized university (quite big for Sweden). We have about 20k students and something like 5-7000 employees. Universities have a tendency to become decentralized, with local solutions for all problems. KI is no exception. Information about people at KI exists in more than 10 disparate systems only at the administration. This information costs much money to keep fresh, and it's also very inefficient. The vision with KIMKAT is to have one central source for all datums about persons, organizations and resources, where external systems can find up-to-date and current data. It should also be possible for these systems to contribute domain specific data into KIMKAT. (For example, our phone directory system should probably be the source for phone numbers.)

Different parts of the solution
We have worked with this problem in a project which has oscillated between 10 and 20 project members during 2½ years. Of course, there are many ways to solve this kind of problem, and the main differentiation between them is how much of a hack you want it to be. KI specifically wanted to avoid yet another hack solution (since other uni's in Sweden has gone down this route, and it seems both costly and becomes unmaintainable in only one or two years), so the focus of the the project group was to find and implement a solution that would hold for many years to come.

The first problem was that we had no definite source for organization information. Neither did we have anything for our affiliates. All that information was stored ad-hoc in the mail system, and on paper. So, we instigated a regime where each datum of information has one primary source which is responsible for maintaining and updating it. For employees this is our HR-system (called Primula). For students, it's our student database called LADOK. For organizations we created a new primary source called KOrg, and for affiliates KAff. We also created a database for all KIMKAT information that we couldn't write back to the other primary sources. This effectively became it's own primary source.

Our meta directory solution is based around a meta engine; a scripted system that reads and collects data from our primary sources and writes this in a unified data structure to the central KIMKAT database OmniaS. For this task we evaluated several different products, and also considered writing our own, but in the end we chose IBM Tivoli Directory Integrator, which is Java-based and very easy to use for easier situations. It uses BSF to allow scripting, which is almost always needed for more intricate solutions. Suffice to say, our final system contains lots of ITDI scripting.

The OmniaS database are surrounded with an EJB tier with Hibernate and accessors for reading the data in a well defined onthology implemented with JavaBeans. There are no Entity EJB's in KIMKAT. (Nor anywhere else at KI, as far as I know - and hope.)

Right now the primary user of the OmniaS information is our web interface, KKAWeb, which is used to update information in our primary sources. It's also used to establish new affiliations and organizations. There are some information the we deemed was necessary for KKAWeb and other applications that would eventually display KIMKAT-data in some way, but wasn't really business data. Because of this we created an external data source called KDis for this data. KKAWeb and another application called KIKAT uses this data extensively, but mostly for things like I18N and sorting.

Since our primary sources are of a very diverse nature, KKAWeb doesn't write to them directly. Instead the updated value objects are sent to an update topic with JMS, and every primary source has one or more message driven beans listening for just those messages that pertains to it. Then that bean updates the database.

The final part of the puzzle is another topic which ITDI sends all updates to, as soon as something is changed in one of the primary sources. This allows external systems to get hold of change events in whatever way they need. The change information is sent with JMS using S-expressions to represent data, since not all consumers will be Java-based.

Lessons learned
The KIMKAT project is late, and the deadline have been put off several times. There are several reasons for this, but the main one is probably due to our inability to give good time estimates. There are a few architectural and design decisions that I would probably have done another way if we would start everything over. The biggest one is there data source called KDis. That was a big mistake for several reasons. First of all, having joins between different databases is a pain. And having database constraints between databases are also really cumbersome.

If we did this again, I would probably drop one of the tiers between the sesssion EJB's and the OmniaS database. Right now, much time goes to serialize and deserialize between different kinds of value objects.

And third, our updating service is a really good idea, actually, but I think that the implementation could have been done in a better way. There is something nagging me with it, but I can't but the finger on it right now.

But all in all, our architecture have been really successful, and it feels like something that will be able to stand the test of time. Now we just have to integrate all other systems with this, so everyone gets all the real benefits from this solution.

Inga kommentarer: