Saturday, January 21, 2006

I thought I might blog a little bit about my archive project.

Some background. The archive is an automatically generated web site from pictures in the family. I think most geeks have built the same thing. I build and maintain this to work on my Java/Sevlet/Tomcat skills and because it's fun. The unique points about mine are:
  • Reads the photo information and uses the photo date to map to a calendar. (This has now been picked up by other projects, but I was doing this a couple years ago.) You can see all photos taken on a particular date.
  • Each family member maintains their section on their computer. There is a sync operation they run to upload (or remove) any changes to sync up with the archive. They are in control of their section. This sync operation also acts as a backup for the archive. We even use it in the same house. The sync operation also publishes a follow up email with links to the new photos.
  • Has useful functionality like Blog This which sets up a blogger posting with a link and a thumbnail on the image.
  • Maintains a recent changes section to see newly added elements.
In version 6 and before, the archive was made by running a 20 minutes process on the photo collection. The process generated a 998 MB static site on the 30,000+ elements consisting of about 100,000 html pages. This was due to the fact that there were index pages for every folder, date, and recent changes set and at least 3 version of every page due to the fact that pages had previous / next / up links.

When new photos were addeed, a process went through and scanned the archive folder structure and generated a static web site to match. This was fine at the beginnning, and I made some major performance gains in the generation, but with 30,000 photos we were still looking at about a 20 minute publish time at best.

Each element has an associated XML file with additional attributes - attributes set from reading the photo information, added from the process or modified by user edits. I like this design instead of a database because the storage is all file based and organization is all file based. Everything matches what it synced. All XML processing is done using castor.

Tod came along and suggested we add tagging support. I'm still working on that at this moment and have some new ideas not seen on other projects. But regardless, I realized that if we were going to add tags, I could not keep generating the site in a static manner. It would be a lot more pages to generate to index and display elements under all of the tags. It would also be nearly impossible to regenerate specific pages appropriatly when the user added a tag.

So I decided to switch to a totally dynamic generation design.

Around version 5, I added the concept of an ID to the elements. I did this so links from other blogs or sites would work, even if the photos was moved. Any photo could be accessed by it's unique ID. Whenever the system published the static site, it publised an additionl master list of IDs and locations in an xml file.

This meant that when the server started up, it read the ID map and was able to resolve requests for a photo by ID by remapping to the location of the element.

In version 7, two major changes were made. All links were replaced by links based on IDs. In addition, all pages were generated dynamically based on their ID. The ID mapping information now contains some additional information about the element including its children (for folders in the file system hierarchy view), its date (for the calendar) and its title amoung other things. The primary source of the mapping informaiton is still the individual xml files associated with the element. There's a 1-2 minute process that can rescan the file system and rebuild the cache.

With the cache can handle, the servlet can dynamically render all pages in the archive without having to read in the element's xml attributes except when the element is viewed on it's own page (in any section). All of the related previous, next, up characteristics can be determined from the cache. The entire calendar map is pulled from the cache.

At this point, I'm ready to work on the next phase which is being able to tell the cache that either a new element has been added, or an element has been modified. This will support the ability to add a tag and see the changes immediately. This will also have a nice benefit in that when the sync is happening, as soon as the element is uploaded, it can be added to the cache and immediately visible on the site (pending thumbnail generation).

Finally, in version 7.1, I have moved the syncing process which used to be a independent application into the website. The entire archive is run as a single war file under tomcat. This will avoid having to do any inter-process communication when updating the cache.

I'll chat more as things progress. I think the Archive system is very interesting. There are dozens of ways to do a lot of these things, but it's fun to play around with the design and still be pragmatic about how things are done.

An additional note, now that the application is setup in Tomcat as a single war, I have a ant task in Eclipse that deploys new versions with a single click. Nothing like making a code change and putting it in production with a single click! Luckily with the more recent changes, this deployment can be made and the web site be back up in about 20 seconds.

No comments:

Post a Comment