The business intelligence use of enterprise search has created a general interest in more advanced forms of collecting and analyzing unstructured data. Although organizations have implemented text mining applications in the past, the new focus of unstructured data within the world of BI allows organizations to look beyond searching text strings towards the identification of patterns within both internal and external information sources. As data warehouses become more sophisticated and as data warehousing professionals develop strategies to deal with unstructured data, the possibilities of advanced text mining within the BI framework become increasingly viable.
The more advanced usage of unstructured data involve the actual identification of patterns and market trends through the analysis of text-based information. For instance, the identification of web based patterns within blogs, chat rooms, and online magazines can be used to identify how a product is perceived by its customers, or alternatively by the media in general. Additionally, email and organization-based correspondence can be mined to identify potentially fraudulent activities or gaps in customer satisfaction levels.
Organizations considering utilizing this technology should first gain an understanding of how data mining works. This helps organizations understand how to best access text-based information and leverage this data to analyze trends and patterns. Additionally, once organizations understand how text mining works, they can identify whether text mining is relevant to their organization, and if so, how to apply it within the current BI framework.
Demystifying text mining
Applications of text mining are expansive. Information from blogs, chat rooms, magazines, etc. gives management an overall picture of their respective industry as well as a view into what is happening within their own organization. Gathered information is transformed into relevant analysis by locating patterns in text-based data that resides within the organization, on the Web, or within the public domain.
Although the concept of text mining may seem complicated, understanding the process is easy if the task is broken down step by step. Using a simplified approach that breaks down the overall process into steps offers insight into how text mining works. Also, organizations can identify which areas best suit their requirements and how to best apply text mining within their current processes.
Text mining is generally defined as the process of deriving high quality information from text (http://en.wikipedia.org/wiki/Text_mining). The following steps identify how this occurs:
Step 1: Data Acquisition
Data acquisition entails collecting information from desired locations either within the organization or external to the organization. The information is collected through a combination of web crawlers that search and collect the defined themes or topics deemed relevant by the organization. The information is gathered from document management systems, web pages, news groups and chat rooms as well as feeds from newsletters, magazines and general news agencies.
Step 2: Normalization/Data Transformation
Once the information is collected the text is “normalized”, or transformed into a standard format to help with later analysis. Currently many applications use XML to standardize the information so that it can be utilized by databases. The normalized data identifies what resides within the body of text and captures metadata about the information such as author name, date and the data source. Metadata helps organizations classify the information into categories, or alternatively, aids in gathering more information about the subject or author.
Step 3: Filtering/Classification
After the text is captured and structured, it undergoes a classification of sorts, known as filtering. Filtering identifies potentially relevant information based on the defined criteria. This is accomplished by applying statistical classification methods that go through the information to identify likely matches. Once the data has been captured and put into a structured format, mining can occur.
Step 4: Mining/Pattern Recognition
Mining is the identification and pattern recognition phase of the process. The collected data is analyzed using criteria such as relevant topics. The text is then compared to pre-defined concepts that are associated with the entity. Then relationship entities are compared to identify relationship patterns and recurrences of text usage.
Step 5: Data Analysis
The results of the mining phase are an output of data within a structured data format useable for analysis by the organization. Text-based information in its natural format, whether within documents, magazines, blogs or web pages cannot be analyzed and patterns cannot be identified. Analysis can only occur once the information has been processed into a structured format.
Step 6: Data Visualization
Data output, or data visualization, is the final step. At this point, a visualization tool, such as charts or graphs, or a digital dashboard, is used to show results and help organizations identify patterns over time, as well as identify emerging trends.
Text mining in BI
Business intelligence transforms data into valuable information that can be accessed to help decision makers develop strategic, tactical and operational planning initiatives. The identification and capture of information enables organizations to identify the right data to help with the decision-making process. In addition to traditional data found in databases, other sources of information found in non-traditional sources can be equally, if not more, valuable. For instance, organizations can pool customer sentiment information from CRM applications and gather relevant information from external sources. The combination of pattern identification from a wide range of data sources can offer organizations more insight into how customers really feel about the products and services offered. Alternatively, organizations can mine the Internet to identify product perception, draw comparisons to current sales figures, and forecast sales based on the convergence of both analyses.
Data visualization for text-based information in a way that can be analyzed by organizations is becoming more important as organizations increasingly mine this data. Business intelligence tools offer organizations a wide array of visualization options to present mined text results in a user-friendly way that will identify trends and report on findings to benefit planning. Beyond the actual presentation of this data, business intelligence allows organizations to store the data in a data warehouse. Also, data can be combined with other data to draw an overall picture of both structured data and unstructured data combined.
Organizations’ progress in developing text mining capabilities within their business intelligence environment is dependent on the inherent value of pattern identification and the rewards of unstructured data. As businesses understand how text mining can be employed and its associated benefits, the transition to adoption becomes quite realistic. Forward thinking organizations are now at the threshold of identifying how text mining fits within their current business intelligence framework.
(Copyright 2007 - Dashboard Insight - All rights reserved.)
About the Author
Lyndsay Wise is a senior research analyst for the business intelligence and business performance management space. For more than seven years, she has assisted clients in business systems analysis, software selection and implementation of enterprise applications. She is a monthly columnist for DMReview and writes reviews of leading technologies, products and vendors in business intelligence, data integration, business performance management and customer data integration.