Problems with Zap/Zebra and Merged Results

In a Zap/Zebra configuration where there are multiple back end Zebra databases, Zap can be used to query them all and return collated results in a single list. To those familiar with Zap, this functionality can be enabled by using "servertotal=s" (in either .zap script or in the query string of a request).

There is an apparent (and unsuprising!) problem with this. Merged results are a highly problematic, and most likely inefficient feature to implment. Indeed, Zap 1.4.5 falls over in some circumstances when performing merged results searches.

For more details about the general problems of merged results searching, see below. For details of Zap, read on...

Zap Issues with Merged Results Searching

This problem becomes evident when searching beyond 'the end of hits' from one database, but well inside the hit range of the another. This is best illustrated by example:

Imagine having two Zebra databases available. For a given search term, one database returns 42 hits, the other 62. This should (in theory) provide a total of 104 hits to view. In my experimentation, "number" was always 10, although you can get the same problems with smaller values too.

In my situation (and all databases will vary in this regard), Zap has problems after hit 44. This seems an unusual number because the first few pages have a reasonably even distribution of results from each database. Anyway, after hit 44, Zap continues to operate (from a browser perspective) but SEGVs (and thus kills Apache children). After hit 52, Zap attempts to allocate negative amounts of memory, which causes grumbles from Apache. These are (I think, from having looked at the source) not handled correctly, and so Zap fails to respond to the browser. Apache resolves the situation eventually, but clearly, Zap has a problem.

Apache reports the following in the error log file:

[Tue Feb 24 02:34:04 2004] [notice] child pid 4043 exit signal Segmentation fault (11)

    [Tue Feb 24 02:34:52 2004] [notice] child pid 4046 exit signal Segmentation fault (11)19

02:33:18-24/02: [fatal] Out of memory - malloc (-112 bytes)

    02:33:22-24/02: [fatal] Out of memory - malloc (-96 bytes)21

Delving into the mod_zap source code (a dangerous thing to do - it's got almost no comments) shows that it may well be possible for negative malloc()s to take place. I hacked in a few changes to try to work around this (by making a few int variables unsigned) but was unsuccessful (besides, there are dozens of malloc()s of various sorts, and I didn't mess with all of them).

There now follows some complete and utter conjecture...

I think Zap performs merged results searching in a simplistic way. That is, if you want the first page of 10 results, then Zap searches all databases, asking for 10 results. It then sorts them (in what ever manner you prescribe) and then drops anything below the 10th before sending them back as HTML.

That's fine, except all kinds of problems occur when you get to results after that. Anything beyond result 11 is unreliable because you may not have two databases that have evenly distributed results. I suspect Zap doesn't cater for this, and so has problems with merged results beyond the first page. Those problems arn't easily apparent until you get to much later responses, where code issues step in to make things nice and clear ;-)

How to do Merged Results Searching (Properly)

This is a scheme I've devised for doing Merged Results Searching, properly. I have no idea what software implements this, or even if it's the best way to do it. Caveat reader.

For the purposes of this discussion, I'm going to assume you want rank sorted results. I'm also going to assume just two databases (to keep things simple). The receipe still works with non-rank sorting and more databases, it's just more complex to explain(!). Here's the receipe:

Connect to all databases, perform the search (so all DBs are ready to send you results) and collect one result from each DB.

Now you have to make some choices. Lets say the first hit from database A has a rank of 1000, whilst the first hit from database B has a rank of 700. Now, you have to keep getting results from database A until you get one that has a rank of less than 700. You can put those in a list if you like, because they're in the correct order for sending out (apart from maybe the last one - you might need to shuffle that down a bit).

Once you reach less than 700, you have to switch over, now collecting responses from database B until you get below the rank of the last hit you got from database A. And so it goes on. Keep going until you have 10 hits to return (or however many you want). Clearly, you have to insert these responses in the right places on your list.

Quite easy so far, although probably slower than any other kind of search, because you can't batch transfer responses from the databases.

When returning responses other than the first (ie. subsequent pages), you either need knowledge of the previous search (I'm guessing this is unlikely in most scenarios) or otherwise you have to do the whole thing again, but keep going until you get to the end of the required responses for the page you want to return. Then, throw away all the responses before the page and just return what's left (so, if you want page 2, ie. responses 11-20, then keep going until you have 20, then throw away 1-10).

(Obvious!) Optimisation

Clearly, the above method may be really slow because you'll be asking your databases for one response at a time. Most databases are better at turning out bundles of responses.

There's no reason why you can't request (say) five response at a time in the above arrangement. You just have to inspect each one in the list to see when you've reached the required rank. This increases the complexity of your code, and also increases the liklihood of fetching more responses than you actually need. However, all said and done, it's probably still faster then doing repeated single fetches. There'll be a trade off though, so some experimentation with any implementation will be required.

A word of warning here - you might imagine that getting 10 results from two databases would be sufficient for results 1-10 and 11-20 - this is not true!! To understand this, look at the extremes of this scenario:

Lets say, database A has lots of high ranking results, where as database B has only low ranking results. Your first 10 results will be fine, because you have merged the top scorers together than sorted the list. However, 11-20 are not actually the 11th to 20th result, because they come from database B, whereas they should come from database A.

If you want to optimise this scenario into performing a single, large batch search on each database, then you have to request as many results as the last result in a page. So, for page 2, that would be result 20 (ie. 11-20). You have to request 20 results from each database, and then sort that list, throwing away 1-10 and 21-40. Clearly, you throw away a lot of results, even at a low page number like this. If you're working on page 10, you're going to be doing a lot of batch transfers, only to throw away 95% (190 results) of what you receive (you'd have to throw away 90% (90 results) even if you were working very efficiently though). This problem gets worse rapidly as the number of databases goes up.