"I've learned that mistakes can often be as good a teacher as success. "
Working with Unstructured Data
What is unstructured data?
In general, any king of data could be classified in one of the following three categories:
- Structured Data
- Semi Structured Data
- Unstructured Data
Structured data are any data kept in an electronic record, where each piece of information has an assigned format and meaning.
Semi-structured data is a form of structured data that does not conform to the formal structure of tables and data models associated with databases but contains nonetheless tags or other markers to separate semantic elements and hierarchies of records and fields within the data.
Unstructured data is any form of data that does not have data model. Unstructured data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program. Moreover, unstructured data is any data stored outside a formatted database of numbers and letters. This can include e-mail messages, complicated reports, presentations, voice mail, still images, and video.
Information as prerequisite for decision making process
Managers are faced with number of decision that has to be made on daily basis. The major input for any kind of decision is information – no matter if the design strategic, tactical or operational, important or less important, etc.
Getting information from structured data is relatively easy and almost “straight forward” process. The problem is because only 15% of data in the organization is structured. What about the rest of 85%, - how much relevant information is stored in this chunk of information? How many high-quality decisions have be made leveraging on 15% of information potential?
The following list shows some examples of unstructured data across the different functions in organization:
- Marketing: Ads, spreadsheets, targets, accounts, forecasts, webinars, seminars, conferences, booth notes, feedback, customer contact notes
- Operations: Manufacturing runs, defective products, reservations, claims processing, precious goods store, delivery notes, scheduling notes
- Sales: Sales leads, sales calls, sales meetings, sales forecasts, spreadsheets, performance evaluations, customer meetings
- Shipping: Delivery directions, fragile specifications, cooling temperature specifications, time of delivery specifications, speed of delivery specifications, tracking
- Accounting: Spreadsheets, notes, Word documents, audit trails, account descriptions
- Call center: Conversations, notes, replies
- Engineering: Bill of material, engineering changes, production archives, design specs
- Finance: notes, annual reports
- Human Resources: Emails, letters, hiring offers, termination documentation, evaluations, job, specifications, employee manuals, holidays, policies
- Legal: Agreements, amendments, proposals, contracts, meeting notes, telephone transcripts, patents, trademarks, nondisclosure
Therefore, answer is clear and there are 2 options:
- You can utilize information potential of unstructured data (which is not easy) and maximize input base for your decisions using both sources structured and unstructured data
- You can leverage your decisions only on structured data
This article considers point 1: how to utilize information potential stored in different types of unstructured data using a data mining techniques (as most commonly used for dealing with unstructured data).
Text mining
In reality a huge portion of data is stored in text databases such as news articles, research papers, books, digital libraries, email messages and Web pages. Nowadays most of the information in government, industry, business and other institutions are stored electronically in the form of text databases. Traditional information retrieval techniques are inadequate for the increasingly vast amounts of text data. If we don’t know what is in the document, we cannot formulate effective queries for analyzing and extracting useful information. Therefore, text mining has become an increasingly popular and essential theme in data mining.
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation. The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts.
There is a many approaches to text mining, which can be classified from different perspectives based on inputs in data mining process and tasks to be performed. In general there are three major approaches:
- The keyword-based approach where the input is a set of key words or terms
- The tagging approach where the input is set of tags
- The information-extraction approach where the input is a semantic information such as event or facts
How it works?
For instance, keyword-based approach collects sets of keywords or terms that occurs frequently and then finds association or correlation relationships among them. In a document database each document can be viewed as a transaction, while a set of keywords can be considered as a set of items in the transaction.
For example, in reality you can extract all the names of people and companies that occur in news text surrounding the topic of wireless technology to try to infer who the players are in that field.
To take another example, if a customer says, "I can't pay because a tree fell on my house;" all of a sudden it is clear that it's not a "bad" delinquency - but rather a sales opportunity for a home loan.
Mining the World Wide Web
The World Wide Web (WWW) is a huge global information service center for news, advertisements, consumer information financial management, education, and many other information services. When we consider WWW as information source should be aware about following presumptions:
- WWW is too huge fro effective data warehousing and data mining
- The complexity of Web pages is greater then traditional text document
- The Web is highly dynamic information source
- Only a small portion of the information on the web is truly useful (99% information on the web is useless for 99% of users)
There is a many Web search engines based in keyword indexing, that helps users to find required information. But, this is not sufficient for effective web search discovery. More advance approaches using Data mining techniques that searches for Web structures, ranks the importance of web contents and mines the web access patterns. In general, web mining tasks can be classified into three categories:
- Web content mining - representing the process to discover useful information from text, image, audio or video data in the web. Web content mining sometimes is called web text mining, because the text content is the most widely researched area. The technologies that are normally used in web content mining are NLP (Natural language processing) and IR (Information retrieval)
- Web structure mining – representing is the process of using graph theory to analyze the node and connection structure of a web site.
- Web usage mining – representing the process of finding out what users are looking for on internet.
The huge advantages in web mining are in ecommerce and personalized marketing. Moreover information gotten from web mining can be used in prevention of criminal activities and terrorism. However, one of the issues of using this technology is invasion of privacy – information concerning an individual can obtained, used, or disseminated even without their knowledge
Multimedia Mining
There are three types of multimedia data: audio data, image data and video data. There appears to be three main pattern discovery approaches that have been used for automatic annotation in multimedia data mining. These approaches primarily differ in terms of how external knowledge is provided to mine concepts. The first approach includes assigning key words or classifying the data. The second approach for automatic annotation is through clustering and here multimedia documents are clustered first and then the resulting clusters are assigned keywords by annotator. The third approach does not rely on manual annotator and it tries to mine concepts by knowing the contextual information
The Multimedia Data Mining (MDM) is a part of multimedia technology, which covers the following areas:
- Media compression and storage.
- Delivering streaming media over networks with required quality of service.
- Media restoration, transformation, and editing.
- Media indexing, summarization, search, and retrieval.
- Creating interactive multimedia systems for learning/training and creative art production.
- Creating multimodal user interfaces.
Conclusion
- More of 85% of whole information potential is in the form of unstructured data
- Discovering valuable information from unstructured data is a complex task and requires usage of special techniques (data mining is a dominant technique for dealing with unstructured data)
- The benefit of usage of all forms of available data (structured, semi-structured and unstructured data) in decision making process is improved decisions, leveraging on full information potential.
References:
Blumberg, R. and Atre, S. “The problem with unstructured data.” DM Rev February 2003.
J. Han and M. Kamber, Data Mining Concepts and Techniques, Second edition (2006)
Bill Inmon and Anthony Nesavich, Managing unstructured data in the organization
Marti Hearst , What Is Text Mining?, 2003
Duo-Mining: Combining Data and Text Mining, Information Management Online, September 16, 2004
Sanjeevkumar R. Jadhav, and Praveenkumar Kumbargoudar, Multimedia Data Mining in Digital Libraries:Standards and Features
2 votes | 2 comments

Comments
Great job! Congratulations!
We did it
BMMsoft, Sybase and Sun accepted in Guinness Book of World Records as the World's Largest Data warehouse 1 PB of structured and unstructured data.
More details on www.bmmsoft.com, www.bmmsoft.eu, or www.mdprofy.com
Post new comment