Everyone knows the 80-20 rule – that fantastic rule of thumb that guides many of us through our daily decision-making lives. One way to interpret the rule is that a relative few things (the 20%) are important and that if we focus on understanding or managing those things, we’ll get 80% of value. When it comes to information management, common wisdom suggests that the 20% of enterprise data that is structured represents 80% of the value. Yet with explosion of information from new sources such as the web - structured data alone is not sufficient. With new techniques for managing unstructured (such as text, Word, PDF, XML, image, and audio files) and semi-structured content (unstructured content with some meta-data describing the document’s content), it’s becoming clear that the value of this information can be accessed by all organizations.
Structured Data – Essential, but Not Enough
Most corporations rely on their enterprise applications, such as an ERP or CRM system, or their data warehouse to supply information to decision makers for daily or strategic decisions. Generally speaking, the information captured in these systems is structured data – it was generated to fit into a pre-determined data model or was entered into an electronic form that had fields, such as city and state, which characterized the data elements being entered. There was a time when this information drove efficiencies and helped differentiate organizations, but today these systems are present in almost every large organization and, while this type of information is essential to running the business, it is becoming an undifferentiating commodity.
If you look across just about any business function or industry, to understand what really happened last period, what is happening at the moment, or what might happen next, an organization must focus their energies on the finer points. Consider these examples:
- A restaurant chain sees same store sales drop in one location, but rise in other locations in the region. The data from their financial system tells them of the sales trend, but doesn’t tell them why. It turns out that a number of customers have questioned the cleanliness of the restaurant on an influential local Web blog.
- A manufacturer wants to understand what is leading to higher than expected warranty complaints for their new product line. The narrative of the customer complaint and the free-form notes from the distributor contain details of the issue in their own words – identifying a problem that had not appeared before and was not anticipated.
- Regional contact centers for a financial services company field questions about the company’s new website trading tools. The number of calls, the reason for each call, the detailed customer feedback is captured in the company’s CRM system, but the weekly customer service report doesn’t aggregate the varied customer product suggestions. It could be months before the company realizes customers have recommended a number of easy-to-implement improvements that would lead to even more trades.
It’s clear from these examples that unstructured content (i.e., the text on the web blog, the narrative of the customer complaint, and the customer suggestion entered into the CRM system) completes the story and leads the decision maker from equivocal evidence to a certain conclusion. In fact, while the value of unstructured content is rising, the real value of unstructured content is found when it is combined with the core structured data of the organization to provide a comprehensive view of the business situation. For example, the realization that customer sentiment is negative is even more valuable when combined with the recent sales transactions and customer information to understand how that sentiment has affected the demographics of the audience or the mix of products and services that are purchased.
Historically few organizations have made efficient use of the unstructured content. It is not that the data wasn’t captured. It was clear that there was some value in this type of information and, therefore, it was recorded and stored. But the ability to find and make decisions on the information you needed was extremely limited. The tools that extracted value from the raw information were only somewhat effective and while simple search tools allowed users to find the raw information they left decision makers with a mountain of information to sift through. The result was inaction. Storing the data was somewhat like creating a time capsule – organizations put the information in, didn’t touch it for a long time, and hoped that somewhere down the road technology would unlock it and make sense of what was there.
New Generation of Tools Help Manage Unstructured Content
To manage unstructured content, you need several core tools and technologies, including:
- Data acquisition tools – a variety of web crawlers, file system crawlers, and connectors for content management systems and databases. These tools find a wide variety of unstructured content stored in various systems across the enterprise and the Web and open the content so that it can be enhanced by text analysis tools.
- Text analysis tools – these tools make unstructured content more valuable by extracting the meaning of the text automatically and adding structure to the content so it can be classified, aggregated, and analyzed.
- Database management system – a database that can store and manage unstructured content and structured data fields equally well. This system makes unstructured content available to people doing the analysis.
- Search technologies – this collection of technologies, often captured in a single search engine and/or often incorporated into a specialized analytic application, helps users find the desired content held in the database management system.
Over the past few years, a number of innovations in text analysis, database management systems, and search technologies have made managing unstructured content feasible and more valuable. These include:
Entity Extraction – Text analysis tools have greatly improved their ability to identify entities (such as people, products, places, or organizations) included in the unstructured content and make it available as meta-data that describes the content. This makes the content “searchable” so that the warranty engineer investigating complaints resulting in a “battery” replacement will find all the unstructured content that discusses batteries. In particular, the addition of grammatical analysis, which accurately identifies the entity from pronoun-based a reference, makes more content available for an analysis. Grammatical analysis unlocks the value in a customer statement such as, “I had a problem with that product too,” by making it clear which product the statement concerns.
Sentiment Analysis – One of the most significant improvements has been in the area of sentiment analysis where the opinion (e.g., positive, negative, or neutral) expressed in the content can be automatically identified. The most important change has been the addition of entity-level sentiment, which can identify the tone or satisfaction of a customer or customer group with a company, product, or other entity. This has made Web content, customer service logs, and other unstructured sources valuable for identifying the voice of the customer. Prior to entity-level sentiment hitting the market, document-level sentiment was the state of the art, but that limited use to the analysis of customer surveys or other documents that were focused themselves on a particular entity.
Fine-grained Data Storage and Self-describing Data Models – The term “unstructured content” implies that the content doesn’t fit neatly into the relational, object-oriented, or other structured data models of standard database management systems, making it difficult to unify unstructured content and structured data in a traditional database. In these traditional database management systems, unstructured content is stored merely in a text field. As the need for a more flexible database management system has increased, a number of database management systems have emerged that handle unstructured content and structured data fields equally well. These systems have the capability to store fine-grained unstructured content (i.e. unstructured content classified into detailed entities and described by diverse dimensions) alongside highly dimensioned structured data. These systems also include a self-describing data model that is flexible enough to create a data model and/or taxonomy automatically based on the meta-data, structured dimensions, and the results of text and audio/video analysis tools.
Advanced Search – Search engines go far beyond a simple search box today, helping the user conduct a more accurate search on mountains of unstructured information. The best search engines have configurable rankings so that search results can be ordered based on a broad range of factors such as term frequency, word positions, word proximity, document date, and document popularity. In addition, search engines today provide the user with insight into the data to help them fix “close, but not right” searches and guide them to the information they need. This is accomplished by enhancements such as spell checkers (corrects mis-typed search terms), stemming (allows for word form variations), concept search (allows for synonyms), automatic phrasing (searches for phrases as well as the individual search terms), and context (shows not just the keyword, but the surrounding terms in the source document).
Faceted Navigation – This technique for presenting search results classifies the information according to entities extracted from the content and other meta-data about the content. The facets reveal what information is available in the data set and the possible analytical dimensions. In so doing, it enhances the decision process by guiding the user to the information they need.
While the text analysis, database management systems, and search technologies are presented here as core technologies, these are usually combined into a single business intelligence or data integration platform. These platforms eliminate the complexity inherent in prior generation systems where a number of technology pieces had to be stitched together and sometimes combined in a multi-staged process that made keeping up with rapidly changing data nearly impossible. The platform unifies these technologies with other essential tools for managing unstructured content such as web crawlers and file system crawlers, which are able to find documents and other pools of unstructured content. Organizations can, therefore, manage the entire discovery, extract, and analysis process from a single set of integrated interfaces.
Managing Never-ending Data Streams
How can an organization use these solutions effectively as the growth of unstructured content skyrockets?
The good news is that the current platforms for managing the unstructured content have proven they can handle the load and can help identify the important needles in the data haystacks. As Jeff Catlan, CEO of text analysis vendor Lexalytics describes, “There may be a millions of pieces of information out there, but the most valuable information is the really good stuff and the really bad stuff. That’s what affects you. That’s the sort of thing that the text analytics engines are really good at and the systems can handle large amounts of data.”
The major concern for managing unstructured content, therefore, is not whether you can manage the volume, but what processes should you design to manage the volume. Platforms allow organizations to configure data integration processes that run automatically to bring in unstructured content, process it to make it searchable, and eliminate the unusable portions. The pace of that feed can be customized by the organization based on the nature of the information and the user’s ability to consume the information. For example, the VP of Marketing may want to evaluate 90 million Twitter tweets each day to see if consumers expressed a negative view of the company’s new logo, but an engineering manager may only want to evaluate the narrative comments on 10,000 warranty claims per year to see if she should change any components in the next design cycle.
Unstructured content comes in a variety of forms and from a variety of sources and managing the content means identifying each source and determining the frequency at which that information should be fed to decision makers. Some processes may look like traditional batch processes that are initiated on a weekly or monthly basis, but other processes can be essentially continuous feeds of information that supply information on an hourly or daily basis.
The New 80-20: Unstructured Content Made More Accessible
Beyond the technology enhancements that have taken place over the past few years, probably the biggest change related to the management of unstructured content and semi-structured data, has been the number of cases where unstructured content has had a major business impact. They have not only proven the technology works, but given potential users important benchmarks for the value of this new found knowledge and information.
With unstructured content within the reach of decision makers’ and a greater value being put on the value of this information, it’s clear that the 80-20 dynamic is changing when it comes to data management. Specific business processes and departments, especially customer service and marketing functions are beginning to view unstructured content as a vital ingredient to numerous analyses.
While managing unstructured content might sound like rocket science, it is not out of the reach of most organizations. Unifying unstructured content with structured data can be accomplished usually through incremental changes that are neither difficult nor expensive. Organizations can choose to either enhance their existing data integration or BI platform or augment their structured data infrastructure with a special-purpose platform for handling unstructured content.
About the Author
David Caruso is Vice President, Supply Chain Product Management & Marketing, for Endeca Technologies, Inc., a search applications company.