g-Solutions

  Logical Regular Expressions (RE) Design Pattern

 

1.1      Background

Many systems need to derive facts based on rules. For example, a document management system may need to determine attributes of migrated documents based on folder structures in a source system. The folder paths may be inconsistent across different source systems. In this case, the solutions may highly rely on flexible rule base application to determine any number of attributes based on the folder path.

 

1.2      Solution

 

The intended approach is to determine the fact based on logical Regular Expressions (RE) patterns. In the case of document management, the RE pattern would interrupt folder paths. To support this concept in a flexible way, a RE operation can be developed.

2     Implementation

2.1      RE

The RE operation can use a lookup table where the key is a regular expression.

See the following link for more information about regular expression supported by the Java programming language; http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html

The regular expression will be compared to the value of a specified source attribute. This operation is similar to the lookup operation.

At runtime, the RE operation holds the regular expressions and corresponding values (typically in a stored in a hash table). The RE operation can iterate through the given expressions looking for a match on the corresponding source attribute. The value of the hash table is used if the regular expression matches the source attribute value.

2.1.1    Complex Regular Expression (And/Not)

By default, regular expressions do not have an easy way to chain expressions together using AND/NOT logic. The OR logical expression is supported with the character “|”. The RE operation combines regular expressions with a special syntax to support AND/NOT logic.

2.1.2    AND Operation

 

The RE operation can support chaining expressions together with “AND” logic. This is accomplished by chaining expressions together with “${AND}”. The string “${AND}” can be used to separate two regular expressions. If any of the regular expressions return false then the entire regular expression is false. In the following example, the regular expression “.*USA.*${AND}.*Greece.*”, only returns true if the text contains both “USA” and “Greece”.

2.1.3    NOT Operation

The RE operation can supports negative logic (NOT) for expressions. This is accomplished by prefixing the expressions with “${NOT}”. In the following example, the regular expression “.*USA.*” only returns true if the text does not contain the word “USA”. Note that multiple “${NOT}”(s) can be chained together with “${AND}”(s) (see table below).

 

Complex RE

Value

Matches

${NOT}.*USA.*

USA and Greece

False

${NOT}.*USA.*

USA

False

${NOT}.*USA.*

Greece

True

${NOT}.*USA.*

Greece USA

False

.*Greece.*${AND}${NOT}.*USA.* ${AND}${NOT}.*Turkey.*

Greece Turkey

False

.*Greece.*${AND}${NOT}.*USA.* ${AND}${NOT}.*Turkey.*

Greece Africa

True

Sample Java Code

/**
* Supports AND plus NOT operations in regular expressions
* @param sourceValue the source value to test
* @param complexRegularExpression the complex regular expression (used ${AND} and ${NOT} to chain expressions together}
* @return true if source value matches complex regular
*/
public static boolean matches(String sourceValue, String complexRegularExpression)
{

try
{
//test source contains AND
if(complexRegularExpression.indexOf(AND) > -1)
{
//split AND
String [] simpleRegExpressions = split(complexRegularExpression,AND);
for (int i = 0; i < simpleRegExpressions.length; i++)
{
if(!matches(sourceValue, simpleRegExpressions[i]))//recursively call to test simple reg exp
return false;
}
return true; //matched all ands
}
else if(complexRegularExpression.indexOf(NOT) > -1)
{
String notRegExp = replace(NOT, "", complexRegularExpression);

//return not match
return !matches(sourceValue,notRegExp);
}
}
catch (RuntimeException e)
{
throw new SystemException("complexRegularExpression="+complexRegularExpression+" sourceValue="+sourceValue+" "+Debugger.stackTrace(e));
}


return sourceValue.matches(complexRegularExpression);


}// --------------------------------------------

2.1.4    Multiple Matches

The default functionality for the RE operation can to return the first matching value for regular expressions. The RE operation should also support multiple matching values for a single or repeating value attributes.

 

This can be supported by placing the configuration settings in a property file.

 

#Sample Properties to support multiple matched

dm_document.keywords.repeating=true

dm_document.keywords.re.multiple=TRUE  

 

 

The information above can inform the RE operation that the attribute keywords of dm_document is both repeating and can match multiple regular expressions.

 

Regular Expression

Value

null

Not  Available

.*UK.*

United Kingdom

.*USA.*

United States of America

 

The following is a sample input and output, given the configuration rules above;

 

INPUT

OUTPUT

Jersey

§  Not Available

UK

§  United Kingdom

UK, Jersey, USA

§  United Kingdom

§  United States of America

 

2.1.4.1       Single Value Attribute Multiple Matches

The RE operation can use a text separator if multiple matches are supported for a single value attribute. By default, it assumes that all attributes are single value.

 

The following informs RE operation that the “title” attributes of the dm_document is non-repeating and supports multiple matches separated by “ and”. Note the “\” is used to escape whitespace in the property file value.

 

#Sample Properties to support multiple matched

dm_document.title.repeating=false

dm_document.tile.re.multiple=TRUE

dm_document.title.re.format.separator=\ and 

 

 

 

2.2      Formatting

The RE operation should to support template output for values. The idea of a template is to provide text where certain values can be inserted at runtime.

 

In the follow example, the text “${subtype} will be replaced with the actual “subtype” value at value.

 

The subtype is ${subtype}

 

If the value of the subtype is “Strategy” then the resulting text will be “The subtype is Strategy”.

2.2.1    RE Operation Formatting

The following properties can inform the RE operation that formatting should be used when mapping the “PATH“ attribute of the dm_document.

 

dm_document.PATH.re.format=TRUE

 

2.3      Multiple Linking Issue

The following issues/scenarios related to documents linked across multiple cabinets have attempted to be addressed by the RELookup operation.

2.3.1    Prioritize Matches

The RE operation should use a sorted hash table (by key) to lookup its regular expressions. The Java’s TreeMap data structure asserts that the map’s key set (that stores the regular expressions) will be retrieved in alphabetical order. The configuration can prefix the regular expressions with a dummy numerical value that maintains the correct sort order for the rules.

 

In the following examples, the first rule states that the value “MOST IMPORTMENT“ will be used if the text has any number ([0-9]) following by a “_” or “. ” then the word “Introduction” or “INTRODUCTION”.  

 

The second rule states that the subtype “IMPORTMENT” will be used if the PATH has any number followed by a “_” or “. ” then the word “G-SOLUTIONS”.

 

Key

Value

(001)|(.*[0-9](_|\. )(Introduction|INTRODUCTION).*)

MOST IMPORTMENT

(002)|(.*[0-9](_|\. )(G-SOLUTIONS).*)

IMPORTMENT

 

The prefixes to these two rules (001) or (002) are also regular expressions. The “or” operation “|” separates the prefix with the actual rule. A path with a value of “001” or “002” will not exist, but it allows the first rule to proceed the second.

 

 

2.4      Multiple Matches paths

The RE operation can support multiple match functionality is used to map one or more rules to values. In this example, a path will be set for each matching regular expression. The value will be set to the first instance in cases of conflicting rules. “Not” RE expressions can be used to distinguish between overlapping rules.

Reg Exp (PATH)

VALUE

(001)|(.*A *)

HAS A

(002)|(.*A*)${AND}${NOT}.*B.*

 

HAS and NOT B