Blog

Reverse Engineer Undocumented Databases with Dynamic AI

By Bob Cergol

This blog is for any developer who is now, or has ever been, faced with developing an application based on an established database for which little or no documentation exists. Maybe you’ve been challenged to “modernize” a legacy application for which no source code and documentation is available. Maybe the project is to build a suite of reports, but no one knows where the data lives or what it is called. Maybe the project is to integrate your application with some other, entrenched “foreign” proprietary application for which either documentation or support is just not available.

Even when some documentation is available, it is rarely of the technical kind a developer needs in order to understand how the application modeled its data and how users interacted with it. End users typically are simply not able to describe more than the most obvious, superficial data flows, let alone the business rules for enforcing data integrity, and have no clue how the data is structured in the database. Just perhaps, the project is for you to create all that missing documentation! That will most certainly require you to gain a deep understanding of the data structures and data flows. I have found Dynamic AI to be an indispensable tool in dealing with all of the above.

Dynamic AI requires an SQL data source, so for any project where the database comes to me as non-SQL, such as a native DataFlex embedded database, the very first thing I do is migrate all the data to a Microsoft SQL Server database. This step alone usually uncovers bad date values and duplicate key values the user was unaware existed. It sometimes also uncovers nonsensical index definitions, where the same column was specified in more than one index segment, either accidentally or unwittingly by using a DataFlex overlap field that itself contains the same column used elsewhere in the same index.

1) Using Metadata to Learn About a Database

A universal feature of an SQL database is that it contains a catalog of all of its data. This metadata is actually just another database, though perhaps somewhat hidden or exposed differently from all the other databases under the control of the DBMS server. In this blog I will be referring exclusively to Microsoft SQL Server and the SQL scripts shown are adapted from sources found on the internet. One of the best, certainly most in-depth articles I’ve seen on exposing this metadata can be found here: http://www.sqlservercentral.com/articles/Stairway+Series/143968

Sure you can use DataFlex DBExplorer or MS SQL Server Mgt. Studio, and other similar tools, (even Dynamic AI has a database explorer type of menu) to browse your actual database, see a list of tables and expose the columns, etc. – but what I have in mind requires far more agility and far less time and effort to glean more varied and useful information from my investigations.

For starters, besides just seeing all of the tables in a database – I want to see them ordered by the number of rows of data they contain. When there are a very large number of tables in a database, or when the naming convention is cryptic yet some naming pattern is apparent, I’d like to be able to search for all the table names containing a string and see how many rows of data they contain relative to each other.

I want to then instantly be able to expose the structure of a table, and browse its data and then quickly determine the data patterns in a given table: What are all the unique values in a column labeled something-“Type”? What are the max and min values in various columns? Date ranges? Row counts? Other totals by various dimensions? I want to find all the tables that contain a column name itself containing a string of characters, such as “ID” or “CUST” or “JOB” etc.

This tells me instantly which tables are the most central transactional tables and which are the most static header, parent, master, or other system configuration types of tables. It reveals the columns that are the likely basis for relationships, and what columns are likely to be selection lists, etc. It also allows me to very quickly confirm hypothesis about how data in one table is flowing into another and to validate result sets and various totals.

So I’ll leverage Dynamic AI capabilities on top of the backend server’s metadata about the database itself to make some interactive reports – a sort of “BI” on the metadata. Besides being a platform for generating business intelligence, Dynamic AI provides a robust and flexible presentation layer that requires no programming to deliver attractive output to a browser and provide efficient navigation of, and interactivity with the information presented.

2) Database Structure

Let’s start with a super simple example (all examples are using a defined database connection that specifies the server and database being queried): Using the trivial SQL Select statement shown below as the data source view, I made this simplistic Dynamic AI report that shows 80 tables and 3,666 columns exist in this database, and how many columns each table contains. I can use the “free search” to do a sub-string search on table or column names and display data for any table with a mouse click. Drill down displays the table’s column details, and clicking on a formula column I added in report design creates a scrollable list style report in a new tab where the actual data can be explored in detail.


SELECT * FROM INFORMATION_SCHEMA.COLUMNS

If my investigations uncover that a column named Product seems very important I can search the entire database for 3,666 columns to instantly find those tables that contain columns that might be associated with product. I can then investigate those tables with a name that suggests they carry some higher relevance.

I can easily search for columns of specific data type(s) and size from any subset of tables. Let’s say I want to know what “flag” columns might exist that control logic or define some other attribute about the data in a given row in all the tables I’ve determined comprise the Accounts Receivable sub-system.

Sometimes I also want to know what indices exist, because they too are often a strong clue as to what columns play a more important role in the application’s logic. I made two reports to let me review indices, one showing a concise list, the other offering more verbose details about each index. Either one allows me to do free search on all character columns, so I can find all the indices across all the tables that utilize any column named containing the string I am searching for.

You can find the SQL script, vTableIndexesConcise, used in the dynamic view that is the data source for this report, and all the other scripts for the other reports mentioned in this blog, in the attached SQL source file.

Here’s the verbose, detailed index report and the view it uses is made of the SQL script labeled vTableIndexesVerbose.

3) Data Distribution

Knowing what tables store most of the data is extremely useful, just as it is knowing which tables have very few rows. I use this report more than any other report in my investigations. This report also lets me drill into the table structure to see the columns, and display the first 50 columns in a scrollable list report which can be the basis for extremely quick ad-hoc design to uncover the data patterns within the single table. It’s worth mentioning the filters in the report allow finding those table names containing any substring such as “prod” or based on row count. You can use >, >=, <, <= as a prefix to any number to qualify the row count criteria you’re searching for. You can even use “..”for between, i.e. “100..500” would show only tables with a row count between 100 and 500 rows. The SQL script used in the dynamic view that is the data source for the Table Row Counts report is vTableRowCounts.

4) Data Patterns

I often use the following features in Dynamic AI to quickly uncover data patterns:

Automated Intelligence
On-screen, drag-drop report design
Column Actions

AUTOMATED INTELLIGENCE

I’ll use the JCTRNS01 table with over one million rows to demonstrate what the “AI” function can do. On my ThinkPad laptop running 64 bit SQL Server 2014, in about one-half second it generates a dashboard-like presentation of numerous report grids. Each presents a single total grouped by other columns. I can select from among the time dimensions and numeric columns in the drop downs and refresh all the reports. I usually start by selecting the “count” total that Dynamic AI appends to the list of numeric columns it found. This shows me instantly how the data is distributed and what columns have special importance.

The report grids can be closed, moved, and even merged with other grids to create additional row or column groupings within any given report. Drill down to the detail rows is available. Charts can be added too. I can save the AI dashboard as is, or open any single report alone to work in more depth on its design and save it. This is actually a good way to learn a lot about how reports are designed in Dynamic AI because you can look at the design it automatically creates as the starting point for experimentation.

Anyway, I see from this dashboard that in the JCTRNS01 table the columns it chose have extra significance within the application logic that created this data. These columns are likely used in table relationships, or as selection criteria, or as break points in reporting. I can instantly see the full set of distinct values that exist for these various columns, and which are most heavily used.

I can also click on any grouper description and it will then be applied as a filter to all of the reports in the dashboard. This can be very useful when trying to uncover correlations between different data columns, for example: I see the “Type” value “NC” has the highest row count among the various types, other than blank, and I’m wondering how that type might correlate to “Dept” and “Cntr” values. Clicking on “NC” in the type report generates the screen below.

Much to my surprise there wasn’t really any correlation between data for type “NC” and “Dept” or “Cntr” values – but rather I learned that Component ID “ZZ” is used almost exclusively for that type value. That is an important clue about the application logic and will lead to pertinent questions to ask the end-users.

5) Data Auditing

In the above “AI” dashboard I can instantly see what looks like suspect, bad data at a glance:

2 records out of a million+ that are not posted.
3 that are missing an “ent method” value.
75 that have a blank “component id”.

By clicking on those suspicious totals, I can instantly see the full detail data rows and from there a single row/record presented in a form.

Other correlations between data columns can be discovered by using drag-and-drop to merge reports to make new combinations of groupings. This next one shows what correlation exists between the “Type” and “Ent Method” columns in the table.

Tables are rarely used in isolation and with the above techniques, particularly searching for common strings in column names and/or indices across all the tables it doesn’t take long to make educated guesses about how tables relate.

I then use Dynamic AI’s “Join Builder” and pay close attention to the row count returned by the join. If I get greater or fewer records than expected based on the count shown in the main table it could indicate either orphaned records and/or parent records without any children or duplicate parent records.

I tend to favor placing the most transactional table on the left and using left joins because that way I know all of the data in the left most table will be returned. I will then change the join type to “inner / enforced” to expose orphans.

The example below shows the results of using all left joins versus inner joins. The result set row counts match! There’s no internal inconsistency in this set of tables – as expected – since I wrote the application that created the data! But I can tell you, I rarely see a database that does not have various problems, bad data, lack of referential integrity, and internal inconsistencies where totals in header records do not balance with values in the detail records, etc.

Using Dynamic AI I am amazed how quickly I can learn a large database and the inherent application design and logic that created the data, and discover problems in the data.

The use of Automated Intelligence can be applied to a dynamic join, or any dynamic view, the same as for any single table or backend server view.

Finally, once I have some tables joined I will use column actions to quickly create various filters to further investigate the data patterns. I’ll use drag/drop to quickly build a report with groupings and totals, and very often include a count total to see how many rows are going into the various totals. Watch a very short video of what that looks like.

See our video page for other demonstrations of Automated Intelligence, Join Builder and building reports and dashboards fast - with no programming!

View the SQL script referenced in the blog.

Read about: Deliver Better BI With Dynamic AI Styling
Read about: How to Build a What's Trending Dashboard from a Vbulletin Community Forum Database

Quicklinks

Resources

Company

Contact us

We'd love to hear from you. If you have any questions, comments or suggestions, please contact us!

FlexLinks

Subscribe to FlexLinks for useful tips and updates. Simply click the button below and be the first to know!

Flexlinks signup