originally posted to the Apache Development mailing list
Hello! I apologize if this has been discussed in this fashion many times,Â but I have attempted to read around and wasn’t able to directly find anyÂ indication that it has been. Please flame me offlist for my naivite.
THE MIRRORING PROBLEM
As a website’s popularity grows, it becomes increasingly desirable to haveÂ ”mirrors” of the website located in various places, in order to spread theÂ processing and bandwidth expense of serving a page across many servers andÂ to reduce the path length traversed by a packet going from server to client.Â The Apache Group itself uses mirrors, as do the Qmail and Postfix projects,Â the Linux Kernel site, and innumerable other popular websites.
There are several ways to inform a client as to the availability of a fileÂ on alternate servers:
- Click On It Yourself.This approach, the one used by most Open Source project pages, involvesÂ a clickable list of mirrors being presented in the HTML body; it is assumedÂ that a “kind” user will find a mirror instead of downloading from the mainÂ site. Some sites, like http://qmail.org/, somewhat enforce this usageÂ pattern by prompting for a location before a user can engage the site. Some,Â like Apache, use a dynamic list of mirrors to reduce the probability thatÂ some poor singular mirror that was listed first will get all the traffic.This approach is nicely centralized and is easy to administer, but is aÂ pain for the user. Cookies to remember a user’s preferred location might beÂ useful in helping make localization a one-time effort and not a continuousÂ one. This is also not a standards-based approach. Every website must go itÂ on their own. Thankfully, this is not hard.
- Use Clever DNS ServersThis is somewhat the IRC-server approach, and moreso the approach thatÂ Akamai adopted. Most largescale commercial websites use “clever DNS” serversÂ that can field a reasonable guess as to what webservers are likely to beÂ closest to you and to return their IP addresses. This requires noÂ client-side intelligence or user interaction. The seamless, scalable, andÂ elegant nature of this approach has made it strongly compelling for theÂ commercial web. I don’t know what Open Source DNS software is capable ofÂ location-based IP issuance: I would love to hear of any.This approach is equally centralized but requires control over the DNSÂ server, something that many small to midsized websites don’t have. Getting aÂ ”smarter DNS” into ISPs that did proximity-based IP address returns wouldn’tÂ require even modifying MX records, and could be a real coup. But thisÂ approach also requires mirroring the site in its entirety.
- Use HTTP RedirectsThis approach is not used nearly as often as the first two. A scriptÂ could be written to redirect a web browser wanting to download a given fileÂ to a specific mirror where file resides. This has the advantage of notÂ requiring all files to be on all mirrors, or even the same set of files onÂ all mirrors. This does require writing some (simple) new software to manageÂ the connection redistribution; this could be an Apache module. One of itsÂ actions could be to simply let the request be served by the local host untilÂ some certain bandwidth/CPU/memory threshold was crossed, at which point itÂ could begin dishing out redirects to mirrors likely to be near theÂ requestor.This approach is more powerful than the above two (it’s seamless, butÂ doesn’t require mirroring the whole site). It would work best as an ApacheÂ module, which would require control over the web server being used toÂ service requests, but a user could theoretically change their entire site toÂ be served by a CGI that could perform that same function. This wouldÂ probably require changing the site’s layout and would involve a great dealÂ of work on their part.
- Use HTTP HeadersThe next approach is to use two new fields in the HTTP response to aÂ HEAD request: “X-Mirrored-By” and “X-MD5″. A sample HTTP request/response:
[client] HEAD /very/big.movie HTTP/1.1 [client] Host: MovieServer.com [client] [server] HTTP/1.1 200 OK [server] Content-Length: 205392839 [server] Content-Type: movie/quicktime [server] X-Mirrored-By: http://mirror.in.co.uk/movserv/the.movie [server] X-Mirrored-By: http://downunder.com.au/mirrors/ms/funny.mov [server] X-Mirrored-By: http://friend.in.co.tw/movies/big.movie [server] X-MD5: 5FD298A9782394C2
This would enable the client to find the mirror closest to it andÂ possibly even download the file simultaneously from multiple locations. TheÂ MD5 checksum and content length would ensure that the end result wasÂ correct, something that the other methods above don’t provide.
This approach has not yet been implemented; I would like to bring it upÂ for discussion with you, the Apache developers. It could be used today withÂ setups that allow websites to control their own headers.
I’ve reviewed the HTTP 305 Error Code, which seemed like it might be aÂ good fit for this sort of thing, but I then discovered that only proxies areÂ allowed to transmit that code.
- Use an Orthogonal Peer-To-Peer System
Finally, some recent companies, such as RedSwoosh, have begun rollingÂ out technologies to intercept HTTP requests and attempt to service them onÂ their own network, using the URL as a content key instead of a destination.Â These new-style networks have the advantage of not having to conform toÂ existing client-server expectations in the HTTP world and can easily benefitÂ from increased security, multipoint downloads, and so forth, often withoutÂ requiring any changes at all to be made in the webserver.The downside is loss of definitive control over the locations from whichÂ a file is being distributed and the dependence upon systems that may not beÂ either open or standards based and may only run on certain platforms.
- Use a Generic Index Into Orthogonal Systems
Bitzi, as an example, provides for XML tags that can specify variousÂ properties about a file. An intelligent client could do an HTTP HEAD on theÂ web server, grab the MD5 or Tiger-Tree hash of the file to be downloaded,Â grab the Bitzi tag based on the hash, and query various P2P networksÂ (Gnutella, Fastrack/Morpheus/Kazaa, AudioGalaxy, etc.) for the file asÂ reported by users of Bitzi. This is a much more ad-hoc situation and perhapsÂ better suited for users producing or mirroring informal rich media files.Â The server-side implementation would only require sending back an MD5 hashÂ of the file, however.
Thoughts? This certainly does cut out a good deal of work for the OpenÂ Source community. It’s quite likely that there already exists software to doÂ most of what I’ve discussed here, but that I’m simply unaware of it. TheÂ Apache module to do conditional redirection is the one that I’m currentlyÂ most excited about.
Please upbraid me now.
David E. Weekly