Best of DBPD: Data Mining: The AI Metamorphosis

来源:百度文库 编辑:神马文学网 时间:2024/04/29 10:48:45

Not many of us get a second chance at a first impression: But with data mining, AI returns smarter and wiser
Data Mining: The AI Metamorphosis
by H.P. Newquist
A decade ago, artificial intelligence (AI) was one of the hottest terms in high technology. Today, however, you could get laughed out of your career for suggesting the use of AI. A decade ago, data mining didn‘t exist. Today, it‘s one of high tech‘s hottest terms. But I‘ve got news for you: Data mining is artificial intelligence.
Once you get over the blasphemy of such a statement, it‘s quite an easy notion to come to terms with. Not that data mining and AI are equivalent; AI is actually much broader in scope and involves a number of different technologies, such as neural networks, expert systems, voice recognition, and knowledge management. But as a practical endeavor, data mining is by and large rooted quite firmly in the field of AI, relying on techniques that were pioneered and even commercialized by the long forgotten and oft disparaged researchers and entrepreneurs of the 1980s.
The truth is that while plenty of AI technologies lived up to some of the hype surrounding their introduction way back when, they were overshadowed by the spectacular failures of gee-whiz startups and dedicated hardware. It wasn‘t the technologies that were incapable but the people who were selling them, which, strangely enough, brings us to data mining. Odds are that 99.9 percent of the computer industry had never heard of data mining until as recently as two years ago. If you work in a big corporation, however, it‘s third on the list of computer hot buttons, right after the Internet and data warehousing. And, in actuality, data mining will make data warehousing an effective tool for corporate use. Without data mining, data warehouses are little more than huge repositories of connected databases that large groups of users can access. And, as many data mining proponents claim, data mining turns databases into knowledge bases, which were the fundamental components of expert systems.
INTO THE SHAFT
Before digging into data mining‘s practical and applied uses, I should point out that we‘re talking about databases with data points numbering in the thousands and millions. Sure, you can use data mining on your personal contact list of 50 names, but that‘s like using a bulldozer to plant a flower garden. A shovel would be much more efficient, and in the case of your data, a simple query system would be sufficient.
Data mining is important to large systems because it finds things in large data repositories that you didn‘t know existed. The simplest metaphor I can think of is finding the two needles out of three that match in a haystack when you didn‘t even know any needles were in there. In this case, the haystack is your database, the individual lengths of hay represent your data fields, and the needles represent data fields with a relationship worth more to you than all the hay put together.
Let‘s get down to specifics. Fundamentally, data mining does two things with data: It finds relationships and makes forecasts. Within these two categories, data mining is good at producing the following six information types:
Classes (sometimes referred to as "classifications"). This information type consists of shared characteristics, such as how many or what percentage of all people over the age of 50 have both checking and savings accounts but no investments in mutual funds. A data mining tool must use pattern recognition to create these classes. Classes are the most common form of data mining.
Clusters (or categories). Clusters are a form of class (and thus a subset), but they consist of patterns and relationships that haven‘t been predefined or were "hidden." These arcane relationships could be valuable once uncovered. For example, a data mining tool might search randomly through data on credit card use and uncover the fact that 80 percent of married women use credit cards only for home furnishings or clothes and not in restaurants, while 98 percent of unmarried women use credit cards primarily in restaurants. The tool finds these relationships even though the user isn‘t specifically looking for them.
Associations. Unlike the previous two information types, associations are event-driven. That is, an association exists between two occurrences in an event such that the completion of one occurrence implies the existence of the next. Data miners tend to use retail analogies for this process. For example, when people buy beer, 60 percent of the time they also buy some form of snack, unless they are buying beer in case quantity, in which case more than 70 percent buy some form of bread, such as hot dog or hamburger rolls.
Sequences. Like associations, sequences are events, but they are linked over time and are relevant to a specific instance. For example, credit card holders who have requested an increase in their credit limits will usually make a larger-than-normal purchase within two weeks of that approval. Or, 87 percent of people renting a new apartment will purchase a shower curtain before they purchase a new piece of furniture and drapes, while 70 percent of new home buyers will purchase draperies before anything else. Sequences lead us directly from the relationship category to the forecasting category.
Forecasts. Just as they sound, forecasts involve predicting the future based on current and ongoing data. Forecasts are applicable to almost any corporate situation, from predicting product sales to ordering inventory, to plans for hiring personnel, to estimating corporate growth. Data mining supports forecasting by extracting all relevant data--including data that might not seem relevant to a human forecaster--and applying it, together with relevant fluctuations, to a comprehensive forecast.
Similar sequences. Although not as common as the five previous information types, similar sequences extend the concept of sequences by combining it conceptually with classes. For example, after discovering a sequence in a particular time, a user might want to find other sequences occurring at the same time (such as during stock market trading) or search for similar sequences over time.
SAME GAME, NEW NAME
It‘s no coincidence that most data mining vendors are either brand-new startups or were associated with the AI business in a former life. The association between data mining and AI exists because the most popular forms of data mining use neural networks, rules of induction, or decision trees, all of which were first productized during AI‘s go-go years. HNC Software Inc., one of the largest marketers of data mining solutions, began life in the 1980s as Hecht-Nielsen Neurocomputer, a developer of neural network tools. Today, HNC markets the Database Mining Marksman, a combination software/card system that uses HNC‘s neural net technology for predictive modeling. The company also sells the Database Mining Workstation, a stand-alone system that interacts with corporate data as a node on a network.
Proving itself one of the most foresighted data mining vendors (after all, it survived the collapse of the AI industry), HNC has also made its first foray into that most extensive and amorphous of all databases, the Internet, with its new Convectis tool. A $100,000 piece of software that runs on Solaris workstations, Convectis will process data and text not only to create relationships among data but also to discern the context and content of Web-resident data. HNC calls this approach "content mining" and claims that it can categorize Web data (including graphics) just as data mining tools do with more structured, numerically based databases. Convectis will be marketed by Aptex, a new division of HNC. InfoSeek Corp., creator of content search engines, already plans to incorporate Convectis into its InfoSeek product to help automate the creation of topic directories.
NeuralWare Inc. is another AI veteran in the data mining field. A decade-old vendor of neural networks, it now sells NeuralWorks Predict, a neural net­based program for data mining on PC and Unix workstations. Pilot Software Inc. got its core group of developers from the group that created the Darwin data mining tool at former AI behemoth Thinking Machines Corp. Likewise, the founders of Information Discovery Inc., which sells a product called IDIS Predictive Modeler, came from an old AI company called IntelligenceWare Inc., which sold PC-based expert systems. IBM Corp., which has traditionally entered advanced technology markets late in the game despite its pioneering research in such fields as expert systems and neural networks, now has a data mining group dedicated to developing and marketing its highly touted Intelligent Miner. Other companies on this AI veteran list include Logic Programming Associates Ltd., Integral Solutions Inc., and Attar Software Inc.
Existing firms such as SAS Institute Inc., Silicon Graphics Inc., and NCR Corp. also have data mining products ready for the commercial market. Then there are the dozen or so startup data mining companies. DataMind Corp. is less than two years old and has just introduced its DataMind Professional Edition and DataCruncher. Ultragem was incorporated this year (by former AI researchers) and is pitching its genetic data mining--based on genetic algorithm techniques--as the best way to dig into your data. We can expect more startups to form at about the same rate rabbits breed.
MINING IN EXISTING FIELDS
The rank-and-file database companies, especially those whose main business is in query and reporting, have joined with mining vendors to add increased functionality to their products. Business Objects Inc. plans to link its tools with IBM‘s Intelligent Miner, SPSS Inc.‘s statistical analysis software, Angoss Software International Ltd.‘s KnowledgeSeeker, DataMind Professional, 1Soft Corp.‘s AC2, Right Information Systems Inc.‘s 4Thought, and Silicon Graphics Inc.‘s MineSet visualization tool. Cognos Corp. is doing the same thing with Angoss and Right.
More importantly, database vendor Red Brick Systems recently announced that the newest version of its Red Brick Warehouse will incorporate DataMind‘s data mining software. This development makes Red Brick the first relational database tool vendor to incorporate data mining features directly into a database product, allowing users to augment traditional queries with analytical and statistical queries without switching to another application. (IntelliCorp Inc., founded as an expert system company and now selling object-oriented database tools, proposed something akin to data mining with a product called IntelliScope, which would apply expert system techniques to corporate databases. But IntelliScope died before it ever saw the commercial light of day.)
Many of data mining‘s early users are reluctant to talk about their efforts and successes for the very same reasons AI‘s first users were wary of discussing theirs (in some cases they are the same companies): They use this technology to gain a competitive advantage and have little interest in sharing the secrets of their success. This is especially true in the financial market, where software tools that can squeeze even infinitesimal amounts of new information from existing data can produce results that translate into thousands--even millions--of dollars.
Data mining users tend to be traditional first adapters: large corporations--such as aerospace firms, financial institutions, large computer/electronics concerns, and multinational manufacturers--that have lots of money and are willing to spend it on advanced technology. However, you can add retailers to the list when it comes to data mining. In fact, retail and finance (including banking, insurance, and trading) are currently the two largest application areas for data mining. American Express, General Motors, GTE, MasterCard, AT&T, Coopers & Lybrand, and DuPont number among the tight-lipped organizations that already have extensive data mining operations in place. Perhaps the most publicized data mining endeavor is that of Pilot Software in cooperation with its parent company, Dun & Bradstreet Corp., and various subsidiaries, including A.C. Nielsen Co. and Moody‘s Investors Service. Of special note is Pilot‘s work with A.C. Nielsen, the foremost gatherer of media-related consumer data. Pilot hopes to take Nielsen data and get even more use out of it. (As if finding out what people watch on TV in the privacy of their own homes isn‘t enough!)
In other applications, Chase Manhattan Bank uses the HNC Database Mining Workstation to model its customer base by market segment, something it felt existing statistical analysis tools couldn‘t accomplish comprehensively or efficiently. Star Bank also uses HNC‘s data mining products to predict which customers might be inclined to leave the bank for another institution at a given time. This application is valuable because it focuses on maintaining an existing client base, which costs far less than attracting and signing new customers. Andersen Consulting used an early version of DataMind Professional with a major auto manufacturer to detect fraudulent warranty claims from car dealers. Check-processing giant Automatic Data Processing Inc. has also developed applications with DataMind, and Hewlett-Packard Co. has signed on with DataMind‘s Open Data Warehouse initiative. Customer Focus International Inc. is using Information Discovery‘s Predictive Modeler to sort through customer data and determine which types of customers tend to buy certain products.
Although this race to the data mine is just heating up, it‘s already expanding beyond the realm of normal database operations. Another startup company, Hyper-Parallel Inc., plans to introduce data mining tools for massively parallel and multiprocessing systems, which tend to have databases larger than the state of Montana. With efforts of this scope underway, database use will never be the same again. And with proper planning, realistic expectations, minimal hype, and well-managed projects, vendors and users may get more out of databases than was ever thought possible. When all is said and done, maybe, just maybe, data mining will avoid becoming the canary in the coal mine that AI became.
H. P. Newquist is author of The Brain Makers (Sams Publishing, 1994), a history of the intelligent systems industry. He recently became publisher of Database Programming & Design! He is also founder of The Relayer Group, a New York City­based marketing analysis and research firm. He can be reached via e-mail at 70400. 1100@compuserve.com.
search -home -archives -contacts -site index
Copyright 1997 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.
Questions? Comments? We would love to hear from you!_xyz