filter preprocessing for performance improvement

List overview All Threads
Download

newer

older

A Survey on Defect Management...

deprecated functions

Eric Irrgang

2 Mar 2007 2 Mar '07

9:51 a.m.

I wonder if it might be reasonable for the slapd frontend to do some preprocessing of search filters to recognize no-ops and such. Most specifically, any search filter of the form (&(objectclass=*)(filter2)) can be reduced to (filter2), right?

I have a problem in that the first time someone performs a search for 'objectclass=*' after slapd is restarted, the server is really bogged down for a while. Once the search has completed once, this is not a problem. I assume that's due to the IDL cache. However, I currently have to keep the server unavailable after restarting slapd for upwards of half an hour while I do an 'objectclass=*' search the first time.

I suppose a followup might be some preevaluation of ACLs before performing easily-recognized searches. For instance, the backend should not have to cope with a search for objectclass=* if the user doesn't have read access to anything as is often the case with an anonymous bind.

-- Eric Irrgang - UT Austin ITS Unix Systems - (512)475-9342

Show replies by date

Howard Chu

2 Mar 2 Mar

10:53 a.m.

Eric Irrgang wrote:

...

I wonder if it might be reasonable for the slapd frontend to do some preprocessing of search filters to recognize no-ops and such. Most specifically, any search filter of the form (&(objectclass=*)(filter2)) can be reduced to (filter2), right?

(objectclass=*) is already a no-op in back-bdb/back-hdb. I made that optimization back in November 2001.

So yes, it's a nice idea, been there, done that.

...

I have a problem in that the first time someone performs a search for 'objectclass=*' after slapd is restarted, the server is really bogged down for a while. Once the search has completed once, this is not a problem. I assume that's due to the IDL cache. However, I currently have to keep the server unavailable after restarting slapd for upwards of half an hour while I do an 'objectclass=*' search the first time.

Changing the behavior of (objectclass=*) isn't going to make any difference. If you specify a filter that is too broad, then slapd has to trawl through every entry in the DB because every entry satisfies the condition.

...

I suppose a followup might be some preevaluation of ACLs before performing easily-recognized searches. For instance, the backend should not have to cope with a search for objectclass=* if the user doesn't have read access to anything as is often the case with an anonymous bind.

That's not a fair assumption.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc Chief Architect, OpenLDAP http://www.openldap.org/project/

Francis Swasey

11:03 a.m.

On 3/2/07 12:51 PM, Eric Irrgang wrote:

...

I have a problem in that the first time someone performs a search for 'objectclass=*' after slapd is restarted, the server is really bogged down for a while. Once the search has completed once, this is not a problem. I assume that's due to the IDL cache. However, I currently have to keep the server unavailable after restarting slapd for upwards of half an hour while I do an 'objectclass=*' search the first time.

Does your server have an index for objectclass?

-- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)

Jon Roberts

3 Mar 3 Mar

3:07 a.m.

Francis Swasey wrote:

...

On 3/2/07 12:51 PM, Eric Irrgang wrote:

...
I have a problem in that the first time someone performs a search for 'objectclass=*' after slapd is restarted, the server is really bogged down for a while. Once the search has completed once, this is not a problem. I assume that's due to the IDL cache. However, I currently have to keep the server unavailable after restarting slapd for upwards of half an hour while I do an 'objectclass=*' search the first time.

Does your server have an index for objectclass?

I was just today thinking about something along the lines of filter preprocessing (at the client level actually) that prevented say a contains search like (telephonenumber=*67530*) on an attribute that the directory has not indexed for substring searches (case of telephonenumber). Something at the server level would be better of course.

Jon Roberts

Pierangelo Masarati

6:08 a.m.

Jon Roberts wrote:

...

I was just today thinking about something along the lines of filter preprocessing (at the client level actually) that prevented say a contains search like (telephonenumber=*67530*) on an attribute that the directory has not indexed for substring searches (case of telephonenumber). Something at the server level would be better of course.

Something like that was discussed long time ago when I proposed the "limits" feature (which eventually got into slapd in its current form). It's hard to tell what such constraint would mean. However, if one only looks at the presence of a substrings filter in a search, unexpected results may occur; for example:

(telephonenumber=*67530*) => reject

but what about

(!(telephonenumber=*67530*)) => ?

(&(uid=foo)(telephonenumber=*67530*)) => ?

A better approach, which we recently developed for a customer, would be to define what filter is to be considered acceptable and what is not, and then analyze the logic of the filter to see if it matches that of the requirement. For example, logic analysis could allow to determine if a filter is surely acceptable, surely unacceptable, or "grey"; then, decision making could determine what to do in the "grey" cases.

If what you want to control is searches resulting in large candidate sets, you need to define what may potentially lead to large candidate sets. So you need to define what's "large", and what simple filters could lead to large candidates sets.

For example: (uid=foo) is likely to lead to a unique candidate (or at most to a few, unless the "unique" overlay is in use), but (objectClass=person) is likely to lead to more than one candidate, and in some case to nearly all the entries in the database. The administrator should define what's the expected behavior of exact matching on a given attribute.

Then, filter analysis would lead to an answer in terms of likelihood of resulting in large candidate sets.

The filter analyzer would need to use rules like:

- presence => large set - equality => small or large set, depending on attribute - appprox => small or large set, depending on attribute - substrings => large - ordering => large

- not => small to large, large to large - and => small if at least one small, large otherwise - or => large if at least one is large

If the rule is: "reject large candidate lists", filter analysis should walk the filter to terminals, stopping as soon as the rule says that the candidate set could be large. If a small set is expected after the filter tree has been entirely traversed, the operation can continue.

It is not simple, and it needs some configuration from the administrator, to guess what equality rules can be assumed to return "small" candidate sets, but it should work.

To make it more general, attributes could be given a weight, or a probability: some equalities may be believed to return small sets, others to return larger sets, so that

(sn=Smith)

may return a considerable set (=> weight 0.1) but (!(cn=Smith)) could return an even larger set (=> weight 1-0.1 = 0.9).

Then, using P() as a "probability operator" [*]:

P(x=v) => admin defined P(x~=v) => admin defined P(x=*v*) => admin defined (based on subsrt length?) but in any case > P(x=v); 0.5 should be reasonable, so that P(!(x=*v*)) is 0.5) P(x>v), P(x<v) => admin defined (based on v?) but in any case > P(x=v) P(not(f)) => 1-P(f) P(and(f1,f2)) => min(P(f1),P(f2)) P(or(f1,f2)) => max(P(f1),P(f2))

and the requirement could be to have P(filter) < threshold.

If anyone wants to implement anything like that, I suggest to either write an overlay that intercepts search operations and analyzes the filter (that's the approach we followed for very specific filter analysis rules), or to extend limits so that they accept run-time loadable rules.

[*] my statistics are quite naive; one could object that the probability of P(and(x,y)) should rather be P(x)*P(y) and so, but I'm just trying to define a set of rules that make sense. Think of them as "fuzzy" rules rather than consistent maths.

Ing. Pierangelo Masarati OpenLDAP Core Team

SysNet s.n.c. Via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it ------------------------------------------ Office: +39.02.23998309 Mobile: +39.333.4963172 Email: pierangelo.masarati@sys-net.it ------------------------------------------

Howard Chu

5:13 p.m.

Pierangelo Masarati wrote:

...

Jon Roberts wrote:

...
I was just today thinking about something along the lines of filter preprocessing (at the client level actually) that prevented say a contains search like (telephonenumber=*67530*) on an attribute that the directory has not indexed for substring searches (case of telephonenumber). Something at the server level would be better of course.

Something like that was discussed long time ago when I proposed the "limits" feature (which eventually got into slapd in its current form). It's hard to tell what such constraint would mean. However, if one only looks at the presence of a substrings filter in a search, unexpected results may occur; for example:

(telephonenumber=*67530*) => reject

but what about

(!(telephonenumber=*67530*)) => ?

or

(&(uid=foo)(telephonenumber=*67530*)) => ?

A better approach, which we recently developed for a customer, would be to define what filter is to be considered acceptable and what is not, and then analyze the logic of the filter to see if it matches that of the requirement. For example, logic analysis could allow to determine if a filter is surely acceptable, surely unacceptable, or "grey"; then, decision making could determine what to do in the "grey" cases.

If what you want to control is searches resulting in large candidate sets, you need to define what may potentially lead to large candidate sets. So you need to define what's "large", and what simple filters could lead to large candidates sets.

OK, so you want to prevent candidate generation to occur for filter terms which might result in large candidate sets. First of all, assuming that that's even a valid thing to do (noting your issues listed above) I would just define a new limit analogous to sizelimit.unchecked, and skip the probability guessing games. E.g. sizelimit.intermediate which would be checked at intermediate stages of filter evaluation. That would render sizelimit.unchecked moot.

The implementation would apply this limit to each individual filter term lookup, and fail with ADMINLIMIT_EXCEEDED when any term exceeds the limit.

In practice I think this will cause a lot of harm though; it will cause ANDed filters to fail that would otherwise come in under the unchecked limit.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc Chief Architect, OpenLDAP http://www.openldap.org/project/

Pierangelo Masarati

5 Mar 5 Mar

8:13 a.m.

Howard Chu wrote:

...

OK, so you want to prevent candidate generation to occur for filter terms which might result in large candidate sets. First of all, assuming that that's even a valid thing to do (noting your issues listed above) I would just define a new limit analogous to sizelimit.unchecked, and skip the probability guessing games. E.g. sizelimit.intermediate which would be checked at intermediate stages of filter evaluation. That would render sizelimit.unchecked moot.

The implementation would apply this limit to each individual filter term lookup, and fail with ADMINLIMIT_EXCEEDED when any term exceeds the limit.

In practice I think this will cause a lot of harm though; it will cause ANDed filters to fail that would otherwise come in under the unchecked limit.

No, not to each filter term: that's the naive (wrong :) way to do it. I'd apply it to the result of the speculation. Each filter term would contribute to computing the final score, which determines if the limit (let's say the action) applies or not.

I'm not saying I want to implement anything like that. I'm saying that's how I'd design a filter analyzer. Actually, that's how we designed it for a slightly different, and much more deterministic problem: detect when a composite filter complies with a certain, very well defined paradigm; sort of:

assuming there's a deterministic way to determine when a simple filter complies with a certain paradigm,

(&(paradigm)(filter)) => complies (intersection) (|(paradigm)(filter)) => doesn't comply (union) (!(paradigm)) => doesn't comply (negation) (!(filter)) => doesn't comply (negation)

Ing. Pierangelo Masarati OpenLDAP Core Team

6710

Age (days ago)

6713

Last active (days ago)

openldap-devel@openldap.org

6 comments

5 participants

tags (0)

participants (5)

Eric Irrgang
Francis Swasey
Howard Chu
Jon Roberts
Pierangelo Masarati