In every enterprise, a wealth of invaluable information exists in unstructured textual form, but organizations have found it difficult or impossible to access and utilize most of it. Although the number varies greatly by industry, it is generally accepted that about one third of vital corporate data is unstructured. (In some of the more service-related industries it is recognized that unstructured information makes up 80 percent of the enterprise’s data.) Business intelligence experts are in uniform agreement that without the contextual information that is provided by unstructured and semi-structured data, a complete view of corporate performance will not be possible.
Although corporate databases and analytical systems may consist of highly available data, with a supreme amount of integrity and metadata, the complete picture of corporate data will most likely not be attained if large swaths of text and free-form information are not expediently primed and tapped. The best possible key performance indicators (KPI) will be free to evolve only when blocks of unstructured text - much of it being created outside the boundaries of the enterprise (from such things as customer correspondence and satisfaction surveys) - becomes embedded in the BI solutions stack. Unfortunately, bringing an inchoate mass of information, devoid of structure, into the structured world continues to be a mission full of peril and complexity. The biggest initial hurdle organizations face in dealing with and deriving value from unstructured data is their inability to properly leverage existing IT infrastructure and data architecture.
Structured data always behaves in an orderly manner and steadfastly conforms to rules; it is represented by a clearly defined fabric of numbers, tuples, columns, labels, tags, style sheets, etc., and has a well-formed disposition which easily lends itself to repeatable analysis and presupposition. Unstructured data, on the other hand, has no readily discernable format or anatomy and can be found just about anywhere: in spreadsheets, transcribed phone or meeting conversations, guest rating cards, bills of materials, purchase orders, design specifications, traffic tickets, patent applications, etc. Adding to this involution, this data is often intertwined and fused with data of a highly definable structure, with little uniformity or consonance in how both types of data interact with each other or depend on one another.
While the focus on unstructured data usually revolves around text-based information - such as legal contracts, warranties, spreadsheets, brochures, marketing materials, annual reports - there is often a wealth of important knowledge that resides in non-textual formats that consist of pixels and sounds. Phone conversations, voice mails, digital photographs and static pictures are all examples of this type of information, which will always be much more difficult to work with than static text. In many cases, only extremely customized software will be able to render or transform this information into a useful format for business analysis and corporate intelligence. The narrative found on medical ex-rays and scientific photographs (such as geological surveys) will pose problems far beyond their large volumes cryptic formatting; they will most often contain information that can only be understood by a firm grasp of a specialized lexicon or field-specific terminology. Such information can make up the very foundation for a particular industry and will form the basis for much of its compliance and legal operations.
Achieving a degree of success in harnessing the potential of unstructured data is something that continues to bedevil many IT managers. Enterprises need to better understand the nature of the beast that they are trying to tame, to not treat these projects in the same manner as normal data integration missions. Innovative methodologies are required, with project plans that break down into distinct phases:
- Plan – It is vital to properly assess current architecture and devise a plan for sourcing and discovering the unstructured data that will best leverage the current IT infrastructure. What tools can be used? What tools must be purchased or developed? What hidden costs may arise in the forms of architectural additions or business interruptions? Companies will need to answer some intricate questions before they attempt to identify the silos of information that will provide the most ROI to their cause. And after examining specific unstructured data per respective business segment or business processes/application, they will have to understand the full value chain and projected supply chain of the data. For example: Does the data easily lend itself to metrics, KPI and forms of quality assessment and management?
- Discover – Once a methodology is in place and there is agreement on what business segments and processes hold the most valuable unstructured data, a detailed inventory and discovery of data points can be commenced at a specific silo or business-unit level. Here data can be tagged as “data of interest and value” and be better positioned for extraction operations that will ultimately render it in a more structured form. A process of pre-classifying unstructured data will be necessary to achieve optimal performance in transposing that data to a new and more normalized format. Because the discovery phase will give organizations a better understanding, not just of the data landscape, but the corresponding processes that use and consume this data, opportunities for business process reengineering and process innovation will invariably come to light.
- Extract, Transform and Translate – Once unstructured data is identified and visible, the heavy lifting can start, assuming that issues in the previous two phases have been resolved - most importantly that a proper data platform and toolset is in place and resources for extraction have been identified and are readily available. Techniques for mining and revamping the data will vary greatly and depend upon the ontology and heritage of the unstructured data in question. Every kind of unstructured data will have its own idiosyncratic behaviors and character traits and thus have different needs with respect to storage and formatting. There will be wildly divergent degrees of difficulty to read and comprehend each silo of data, both from a manual (everyday human reading) and electronically enabled systems perspective. There is often little commonality or semblance of behavior, storage, structure or format between silos of unstructured data. Unstructured data will be as diverse as each and every business process that created or consumed the piece of data. Because of this, a very robust process control wrapper will be needed to trap a potentially large number of exceptions or anomalies that occur during any text reading/translation process.
- Store – After unstructured data has morphed into an orderly and well-behaved piece of information, it must be stored and archived in a format where it can be easily retrieved and reintroduced back into multiple corporate business streams, both operational and analytical. No matter what format the data will ultimately be persisted in (a relational database management system (RDBMS), an electronic document management system, XML format, etc.) the challenges for storing data will be numerous. But with these challenges come unprecedented opportunities. For example, companies may find themselves integrating their newly transformed (and now structured) data with repositories of unstructured data that share a similar origin or business process. This newly structured data will almost always need to be classified and identified by an original set of metadata. Transforming unstructured and unclassifiable data into a structured format without a clear metadata strategy will be a recipe for failure or future disasters.
- Analyze – Business intelligence and corporate quality management initiatives are frequent drivers of projects which aim to explore and leverage unstructured data elements. In order to ensure well-rounded project success, decisions must be made on the analysis framework before the first piece of unstructured data undergoes its metamorphosis. While today’s business dashboards and BI platforms can be augmented with text-mining applications and functional tool sets that can operate directly (and in real time) on federated unstructured text, experience has shown that the most powerful analysis paradigms result only from deep scrubbing transformation of unstructured data that occurs in multiple steps and transitions, through various staging areas where metadata can be applied and exceptions be efficiently handled.
For analytical processing against unstructured textual data, it is necessary to overcome or address several obstacles. Some of these challenges follow:
- Because no text scanning software is perfect, the process of converting data from paper or voice to e-format may involve a good deal of manual interventions and manual gap analysis in order to assure success. After all, intelligibility can vary greatly depending on an interlocutor’s inflection, vocabulary and speech/writing style. IT staff will need to know the implicit shortcomings of their software, such as how it handles permutations and similarities of words, including misspellings, capitalizations, and punctuations. Multiple iterations of data transformation runs may be in order, along with multiple staging areas to hold the data, as it goes from poor to good.
- For companies with a global footprint, special attention must be paid to the effects of foreign languages. Unstructured text will often times lurk in exotic languages, such as Chinese, Japanese and Arabic. This will result in extra layers of complexity for ETL (extract, transform and load) tasks. Due to this, many times there will be extra steps in all transformation processes: In addition to the primary data extraction job, a routine that translates the foreign language in question to English will be needed.
- Care must be taken not to get tripped up by terminology and semantics. Pieces of information will always mean different things to different people; in the unstructured data world, the words and language used to describe a granular piece of data will inevitably differ among business units. You will struggle with many lexical and semantics rationalization issues that may be far more convoluted than expected.
- Pay close attention to the dependencies of unstructured data to operational business processes. Unstructured data can have turgid intra-dependencies that must be understood in order to best unlock the business potential of the data. These dependencies will, more often than not, exist at the field level on various forms and documents.
- By nature, unstructured information tends to be more open and less subject to the types of security controls we associate with electronic data. When inventorying and discovering areas of unstructured data, there will inevitably be opportunities to better examine the security deficiencies of important business documents. You will be able to glean an improved perspective on how unstructured data is handled and distributed throughout the enterprise, and subsequently propose a more secure means of storage and access.
The world of information will not become less complex any time soon. The more we communicate with one another in a business or social context, the more unstructured data becomes intertwined and dependant on structured data and processes. Some industries will have more unstructured data than others; likewise, there will be a huge variance in the amount of unstructured data between different business segments and functional units inside an organization. Thus, it is extremely difficult to generalize how to approach unstructured data or agree on how to best discover, transform, store and analyze this information.
One certain fact though is that in almost all organizations, there is a tremendous need to better understand how to best incrementally leverage existing technology investments (which operate on structured data sets) so that they can process their universe of unstructured data - from discovery, to sourcing, to transformation, to storage, to analytics and dashboard business intelligence solutions. Raw unstructured data and non-rationalized text can not be hastily placed into a structured world and expected to retain its initial meaning or usefulness.
To be effective, unstructured information must undergo both transformation and integration processes that go far beyond what is normally employed in the enterprise. This will usually not be an inexpensive thing to accomplish, it will require a sophisticated technical architecture and resources that have experience in bridging the chasm between the structured and unstructured domains and textual analytics. Unstructured data projects will be a true cross-business undertaking and require a strong leader that can drive the project from both a data architecture and functional business perspective.
About the Author
William Laurent is one of the world's leading experts in information strategy and governance. For 20 years, he has advised numerous businesses and governments on technology strategy, performance management, and best practices�across all market sectors. William currently runs an independent consulting company that bears his name. In addition, he frequently teaches classes, publishes books and magazine articles, and lectures on various technology and business topics worldwide. As Senior Contributing Author for Dashboard Insight, he would enjoy your comments at firstname.lastname@example.org
Copyright 2009 - Dashboard Insight - All Rights Reserved.