Many of the custom reports we develop perform some type of data validation, whether we're checking data delivered to clients, looking for data-entry errors, or preparing data for publishing. Validation usually involves multiple processes, but often the initial issue is to determine whether the data even conforms to the required format.
For this, Regular Expressions can be a really useful tool, but they're also a little cryptic and confusing. Regular Expressions have a variety of uses, but for data validation they let us define a format for data (text strings) and then to check our data for matches or errors.
Recently we were designing a report to validate the Item Tags of PipeRuns and of course the initial issue was determining whether the format of the Item Tag conformed to the project's standard. The correct format is as follows:
LX-100-789-GN12-BL1XY1-ISN
It's possible to write some complicated SQL using SubStrings to pull apart the text, but Regular Expressions are more concise, easier to modify, and easier to test.
Using APSR, we use a short Script to initialise a Regular Expression, and then test each data item against it. For example, to test the Line Number format above, we'd use the following Regular Expression:
^[A-Z]{2}-\d{2,3}-\d{3}-[A-Z]{2}\d{2}-BL\d{1}[A-Z ]{2}\d{1}(-[A-Z]{1,2})?$
That looks confusing, but breaking it down, it all makes sense:
^
Start of the String
[A-Z]{2}
Characters must be between A and Z, exactly 2 of them
\d{2,3}
Between 2 or 3 digits
\d{3}
Exactly 3 digits
[A-Z]{2}
Characters must be between A and Z, exactly 2 of them
\d{2}
Exactly 2 digits
BL
Literal text (the characters B & L)
\d{1}
Exactly 1 digit
[A-Z ]{2}
Characters must be between A and Z, exactly 2 of them
\d{1}
Exactly 1 digit
(-[A-Z]{1,3})?
Between 1 and 3 Characters from A-Z, but they're optional.
$
End of the String
In the first Validation phase we check the PipeRun ItemTag against the Regular Expression. For ItemTag's that fail this first test, there's not much more we can do, but if we get a match, then usually we'd move on to checking the individual components of the ItemTag (Nominal Diameter, Fluid Code, etc).
The Script below takes the "Lines" table created by the Query, adds a new Boolean column to indicate the result of the Regular Expression match, and then iterates through all of the entries in the Lines table (unique PipeRun ItemTags in this case).
This Script actually checks two different Regular Expressions, checking each ItemTag against first one, and if it fails then checks the other. If either Regular Expression is a Match, then "FormatMatch" is set to True (otherwise False).
The remaining Validation checks are then implemented in the Grid window (the Regular Expression can also be implemented in the Grid window, but we chose to use a Script for performance reasons).
In the sample report, we construct a "Derived ItemTag" which is a Calculated Grid Field composed of the various data fields that make up the Item Tag, in the correct format. Ideally, the Derived Item Tag, and the actual Item Tag field will match, but if they don't then someone needs to check why.
The Grid is filtered to only show PipeRuns with errors, and to highlight PipeRuns with invalid Item Tags in yellow, and mismatched Derived Item Tags in pink. Note that we could have configured the Grid to also compare the constituent parts of the Item Tag and highlight the specific sections in error. (apologies for the blurred text, but we can't show our Client's data).
If you'd like to use this report on your own SPPID system, or to modify it for your requirements, a copy of the final APSR report is available (note that you will need APS Reporter/SP installed).
By the way, if you're creating your own Regular Expressions, I'd recommend a copy of Expresso. It's a great tool for building regular expressions, and you can also provide test data to confirm you're finding the right errors.
Regards,
Brian
No comments:
Post a Comment