
Many systems need to derive facts based on rules. For example, a document management system may need to determine attributes of migrated documents based on folder structures in a source system. The folder paths may be inconsistent across different source systems. In this case, the solutions may highly rely on flexible rule base application to determine any number of attributes based on the folder path.
The intended approach is to determine the fact based on logical Regular Expressions (RE) patterns. In the case of document management, the RE pattern would interrupt folder paths. To support this concept in a flexible way, a RE operation can be developed.
The RE operation can use a lookup table where the key is a regular expression.
See the following link for more information about regular expression supported by the Java programming language; http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html
The regular expression will be compared to the value of a specified source attribute. This operation is similar to the lookup operation.
At runtime, the RE operation holds the regular expressions and corresponding values (typically in a stored in a hash table). The RE operation can iterate through the given expressions looking for a match on the corresponding source attribute. The value of the hash table is used if the regular expression matches the source attribute value.
By default, regular expressions do not have an easy way to chain expressions together using AND/NOT logic. The OR logical expression is supported with the character “|”. The RE operation combines regular expressions with a special syntax to support AND/NOT logic.
The RE operation can support chaining expressions together with “AND” logic. This is accomplished by chaining expressions together with “${AND}”. The string “${AND}” can be used to separate two regular expressions. If any of the regular expressions return false then the entire regular expression is false. In the following example, the regular expression “.*USA.*${AND}.*Greece.*”, only returns true if the text contains both “
The RE operation can supports negative logic (NOT) for expressions. This is accomplished by prefixing the expressions with “${NOT}”. In the following example, the regular expression “.*
|
Complex RE |
Value |
Matches |
|
${NOT}.* |
|
False |
|
${NOT}.* |
|
False |
|
${NOT}.* |
|
True |
|
${NOT}.* |
|
False |
|
.*Greece.*${AND}${NOT}.*USA.* ${AND}${NOT}.* |
|
False |
|
.*Greece.*${AND}${NOT}.*USA.* ${AND}${NOT}.* |
|
True |
| /** * Supports AND plus NOT operations in regular expressions * @param sourceValue the source value to test * @param complexRegularExpression the complex regular expression (used ${AND} and ${NOT} to chain expressions together} * @return true if source value matches complex regular */ public static boolean matches(String sourceValue, String complexRegularExpression) { try { //test source contains AND if(complexRegularExpression.indexOf(AND) > -1) { //split AND String [] simpleRegExpressions = split(complexRegularExpression,AND); for (int i = 0; i < simpleRegExpressions.length; i++) { if(!matches(sourceValue, simpleRegExpressions[i]))//recursively call to test simple reg exp return false; } return true; //matched all ands } else if(complexRegularExpression.indexOf(NOT) > -1) { String notRegExp = replace(NOT, "", complexRegularExpression); //return not match return !matches(sourceValue,notRegExp); } } catch (RuntimeException e) { throw new SystemException("complexRegularExpression="+complexRegularExpression+" sourceValue="+sourceValue+" "+Debugger.stackTrace(e)); } return sourceValue.matches(complexRegularExpression);
|
The default functionality for the RE operation can to return the first matching value for regular expressions. The RE operation should also support multiple matching values for a single or repeating value attributes.
This can be supported by placing the configuration settings in a property file.
|
#Sample Properties to support multiple matched dm_document.keywords.repeating=true dm_document.keywords.re.multiple=TRUE |
The information above can inform the RE operation that the attribute keywords of dm_document is both repeating and can match multiple regular expressions.
|
Regular Expression |
Value |
|
null |
Not Available |
|
.* |
|
|
.* |
|
The following is a sample input and output, given the configuration rules above;
|
INPUT |
OUTPUT |
|
|
§ Not Available |
|
|
§ |
|
|
§ § |
The RE operation can use a text separator if multiple matches are supported for a single value attribute. By default, it assumes that all attributes are single value.
The following informs RE operation that the “title” attributes of the dm_document is non-repeating and supports multiple matches separated by “ and”. Note the “\” is used to escape whitespace in the property file value.
|
#Sample Properties to support multiple matched dm_document.title.repeating=false dm_document.tile.re.multiple=TRUE dm_document.title.re.format.separator=\ and |
The RE operation should to support template output for values. The idea of a template is to provide text where certain values can be inserted at runtime.
In the follow example, the text “${subtype} will be replaced with the actual “subtype” value at value.
|
The subtype is ${subtype} |
If the value of the subtype is “Strategy” then the resulting text will be “The subtype is Strategy”.
The following properties can inform the RE operation that formatting should be used when mapping the “PATH“ attribute of the dm_document.
|
dm_document.PATH.re.format=TRUE |
The following issues/scenarios related to documents linked across multiple cabinets have attempted to be addressed by the RELookup operation.
The RE operation should use a sorted hash table (by key) to lookup its regular expressions. The Java’s TreeMap data structure asserts that the map’s key set (that stores the regular expressions) will be retrieved in alphabetical order. The configuration can prefix the regular expressions with a dummy numerical value that maintains the correct sort order for the rules.
In the following examples, the first rule states that the value “MOST IMPORTMENT“ will be used if the text has any number ([0-9]) following by a “_” or “. ” then the word “Introduction” or “INTRODUCTION”.
The second rule states that the subtype “IMPORTMENT” will be used if the PATH has any number followed by a “_” or “. ” then the word “G-SOLUTIONS”.
|
Key |
Value |
|
(001)|(.*[0-9](_|\. )(Introduction|INTRODUCTION).*) |
MOST IMPORTMENT |
|
(002)|(.*[0-9](_|\. )(G-SOLUTIONS).*) |
IMPORTMENT |
The prefixes to these two rules (001) or (002) are also regular expressions. The “or” operation “|” separates the prefix with the actual rule. A path with a value of “001” or “002” will not exist, but it allows the first rule to proceed the second.
The RE operation can support multiple match functionality is used to map one or more rules to values. In this example, a path will be set for each matching regular expression. The value will be set to the first instance in cases of conflicting rules. “Not” RE expressions can be used to distinguish between overlapping rules.
|
Reg Exp (PATH) |
VALUE |
|
(001)|(.*A *) |
HAS A |
|
(002)|(.*A*)${AND}${NOT}.*B.* |
HAS and NOT B |