Data Mining on the Web

来源：百度文库编辑：神马文学网时间：2024/04/28 10:57:28

Data Mining on the Web

There‘s Gold in that Mountain of Data

By Dan R. Greening

(Copyright Web Techniques, 1999. To appear in January 2000 issue.Reproduction is prohibited.)

When visitors interact with your site, they provide informationabout themselves and how they respond to your content: which linksvisitors click, where they spend most of their time, which searchterms they use, and when they browse. Some visitors may even fillout a lifestyle survey or provide names and addresses. Complex contentalso contains important information, such as words in articles,job descriptions and resumes, and features of competitive or complementaryproducts. All this information is often stored in a database.

As a result, you have a lot of information on your Web visitorsand content, but you probably aren‘t making the best use of it.Data warehouse reporting systems, such as those provided by trafficanalyzers, aggregate and report facts over different dimensions.(See my article titled "Tracking Users," Web Techniques,July 1999.)

These warehouse reporting systems are commonly called online analyticprocessing (OLAP) systems. OLAP systems can report only on directlyobserved and easily correlated information. They rely on you todiscover patterns and decide what to do with them. OLAP systemswon‘t tell you that people frequently buy potato chips, onion soupmix, and sour cream at the same time, and they won‘t discover thatsome people love any movie that contains an explosion. The informationis even too complex for humans to discover these patterns usingan OLAP system.

To solve this problem, marketers and business analysts use data-miningtechniques. These are machine learning algorithms that find buriedpatterns in databases, and report or act on those findings. Thereare many data-mining techniques, and it‘s difficult for one personto understand the entire field. The best we can do in one articleis provide an introduction to the problems that data-mining techniquescan solve, mention the techniques usually applied to those problems,and give some insight into vendors offering solutions.

Know Your Visitor

To use data mining on your Web site, you have to establish andrecord visitor and item characteristics, and visitor interactions.

Visitor characteristics include demographics, psychographics,and technographics. Demographics are tangible attributes such ashome address, income, purchasing responsibility, or recreationalequipment ownership. Psychographics are personality types that mightbe revealed in a psychological survey, such as highly protectivefeelings toward children (commonly called "gatekeeper moms"),impulse-buying tendencies, early technology interest, and so on.Technographics are attributes of the visitor‘s system, such as operatingsystem, browser, domain, and modem speed. If you have a phone numberor address, you can sometimes obtain household demographic or psychographicinformation through direct marketing service providers, such asWebcraft or Acxiom. Business demographics are available throughDun & Bradstreet.

Item characteristics include Web content information-media type,content category, URL-as well as product information-SKU (stock-keepingunit, basically a product number), product category, color, size,price, margin, available quantities, promotion level, and so on.

Visitor statistics accumulate when visitors interact with items,the Web site, or the company. Visitor-item interactions includepurchase history, advertising history, and preference information.Purchase history is a list of products and purchase dates. Advertisinghistory indicates which items were shown to a visitor. Preferenceinformation refers to item ratings provided by a visitor. Click-streaminformation is a history of hyperlinks that a visitor has clickedon. Link opportunities are hyperlinks that have been presented toa visitor.

Visitor-site statistics are typically per-session characteristics,such as total time, pages viewed, revenue, and profit per sessionwith a visitor. Visitor-company information might include totalnumber of customer referrals from a visitor, total profit, totalpage views, number of visits per month, last visit, and so on. Visitor-companyinformation can include brand measurements. Brand associations,for example, are lists of positive or negative concepts a visitorassociates with the brand, which can be measured by surveying visitorsperiodically. Permissions are attributes that a visitor providesindicating how marketing information contributed by the visitorcan be used, such as permission to send email, to share informationwith marketing partners, and so on.

If you do nothing else in response to this article, I urge youto do two things: First, decide how you might use information recordedabout your site‘s visitors, write a privacy statement, and makethat statement available on your Web site. See www.truste.org forassistance. Think about privacy from the visitor‘s point of view.Visitors prefer to view products and pages that interest them, sothey usually share information for that purpose. However, they typicallywant you to ask for permission before sending them marketing email,or sharing their contact information with partner companies. Ifyou provide a privacy statement documenting your intended uses,and give visitors an email address for comments, your visitors willlet you know whether the policy is acceptable.

Second, record the data now, even if you do not have a data-miningprocess in place. You will find most data-mining tool vendors allowfor an initialization step in which they incorporate historicaldata into your data-mining system.

List Your Goals

The great advantage of Web marketing is that you can measure visitorinteractions more effectively than in brick-and-mortar stores ordirect mail. Data mining works best when you have clear, measurablegoals. The following are some goals you might consider:

Increase average page views per session;
Increase average profit per checkout;
Decrease products returned;
Increase number of referred customers;
Increase brand awareness;
Increase retention rate (such as number of visitors that have returned within 30 days);
Reduce clicks-to-close (average page views to accomplish a purchase or obtain desired information);
Increase conversion rate (checkouts per visit).

If you‘ve instrumented your site to record the visitor, content,and interaction characteristics, and you‘ve determined a set ofmeasurable marketing goals, congratulations! You are farther alongthan most marketers. Now you can gain value from data mining.

Understand Your Problem

The first step to solving a problem is articulating the problemclearly. Common problems Web marketers want to solve are how totarget advertisements, personalize Web pages, create Web pages thatshow products often bought together, classify articles automatically,characterize groups of similar visitors, estimate missing data,and predict future behavior. All involve discovering and leveragingdifferent kinds of hidden patterns.

Targeting. Marketers use targeting to select the peoplereceiving a fixed advertisement, to increase profit, brand recognition,or other measurable outcome. Targeting on the Web must account fordifferent advertising ad space costs. Web sites with valuable visitorstypically charge more for ad space.

On sites where visitors register, advertisers can target on thebasis of demographics. For example, people living in different partsof the country or visiting different Web sites may have differingpropensities to purchase sports-team-branded apparel, gay traveltours, or discount car parts. Therefore, if you target the peoplemost likely to purchase your product, you can reduce your cost foran ad campaign and increase the total profit.

Some sites let you target ads on the basis of IP address, underthe theory that DNS registration information or surveys providethe physical location of the IP address. However, because nationaldial-up ISPs often share a pool of IP addresses, this is not a reliablemethod. As we say in the business, "Half the U.S. populationlives in Vienna, Virginia" (AOL‘s corporate address).

Data mining can help you select the targeting criteria for anad campaign. Web publications have a set of variables by which theycan target advertisements. By performing a test ad using "run-of-site"(that is, untargeted) ad space you can associate demographic variableswith conversion. People "convert" when they accomplishthe marketing goal, such as performing a click-through, purchase,registration, and so on. Data mining can identify the combinationof criteria that maximizes the profit. For example, data miningmight discover that targeting based on the logical expression

(java-consultant) or (software-engineer and purchasing-authority< 10,000

will increase the click-through on a JavaBean banner ad.

There is a huge variety of data-mining tools that support targeting,because targeting is extensively used in direct mail marketing.

Personalization. Marketers use personalization to selectthe advertisements to send to a person, to maximize some measurableoutcome. Here we use "advertisement" loosely to referto any recommendation or item offered by a site. Even a simple hyperlinkin a menu or an article could be considered an advertisement.

Personalization is the converse of targeting. Targeting optimizesthe types of people that will see an advertisement, reducing costby showing the advertisement to more people in a broader campaign.It is most useful for prospecting-finding people who haven‘t visitedyour site yet-because there‘s a cost to advertising on outside Websites. But targeting is pointless on your own site, where advertisementsare free. Why would you not show your products to a person visitingyour own site?

In contrast, personalization optimizes the advertisements thata person sees, raising revenue because the person sees more interestingstuff. Personalization can be used for external advertising, butyou‘re more likely to use it on your own site. External sites don‘tusually give you enough information about individual visitors todo good personalization.

Some personalization systems, such as Broadvision One-to-One,rely on the marketer to write rules for tailoring advertisementsto visitors. These are "rules-based personalization systems."If you have historical information, you can buy data-mining toolsfrom a third party to generate the rules. Rules-based personalizationsystems are usually deployed in situations where there are limitedproducts or services offered, such as insurance and financial institutions,where human marketers can write a small number of rules and walkaway.

Other personalization systems, such as Andromedia LikeMinds, emphasizeautomatic realtime selection of items to be offered or suggested.Systems that use the idea that "people like you make good predictorsfor what you will do" are called "collaborative filters."These systems are usually deployed in situations where there aremany items offered, such as clothing, entertainment, office supplies,and consumer goods. Human marketers go insane trying to determinewhat to offer to whom, when there are thousands of items to offer.As a result automatic systems are usually more effective in theseenvironments. Personalizing from large inventories is complex, unintuitive,and requires processing huge amounts of data.

Association. Also called market-basket analysis, associationidentifies items that are likely to be purchased or viewed in thesame session. If you place references to these items together onthe same page in a Web catalog, you may remind your visitor to purchaseor view something otherwise forgotten. If you hold a promotion onone item in an association group, you‘re likely to increase purchasesof other items in that group.

Association can be deployed in situations even where you havestatic catalog pages. In this case, you rely on the visitor to selectthe first catalog page to view, and then serve up related itemsas cross-sells. Association is the data-mining solution Amazon useswhen it says, "Customers who bought The Grapes of Wrath alsobought The Great Gatsby."

Knowledge Management. These systems seek to identify andleverage patterns in natural language documents. A more specificterm is "text analysis," since the vast majority operateon text. The first step is associating words and context with high-levelconcepts. This can be done in a directed way by training a systemwith documents that have been tagged by a human with the relevantconcepts. The system then builds a pattern matcher for each concept.When presented with a new document, the pattern matcher decideshow strongly the document relates to the concept.

This approach can be used to sort incoming documents into predefinedcategories. Companies use this approach to build automatic siteindices for visitors. News and portal sites use this to reduce thecost of categorizing and selecting news from syndicators. Some systemsalso provide automatic summaries of key points, and cross-referencedocuments to related material.

Knowledge management systems can be used to personalize onlinepublications. Imagine a pattern matcher for the "what Dan Greeninglikes" concept. This system would find new documents that containwords and context also contained in articles that I‘ve read before.Products in this area include Autonomy and HNC SelectResponse.

Knowledge management systems can assist in creating automaticresponses to help requests. For example, inbound requests to a customer-supportemail address can be categorized, and an automatic response canbe sent from a library of FAQs. Vendors in this area include Kanaand eGain. (See the box "Knowledge Management" in theNovember 1999 article "You Asked For It: Solving the CustomerSupport Dilemma.")

One of the most interesting applications in this area is AbuzzBeehive, which creates a "knowledge network" within acommunity of experts. If you send a question to Beehive, it firsttries to find a good answer in its archive. If it doesn‘t have agood answer, it redirects the question to an expert it thinks canproperly respond. If the expert does respond, it squirrels the responseaway in case the question is asked again. In this way, it buildsup a permanent, adapting knowledge base.

Abuzz has created something I find both exciting and spooky: amore informative organism bred from machine and human. Beehive isa computer broker that brings together human experts with differentspecializations. Students of biology will note this parallels importantevolutionary events, such as the aggregation and differentiationof single-celled organisms into more effective multicelled organisms.

Clustering. Sometimes called segmentation, clustering identifiespeople who share common characteristics, and averages those characteristicsto form a "characteristic vector" or "centroid."Clustering systems usually let you specify how many clusters toidentify within a group of profiles, and then try to find the setof clusters that best represents the most profiles.

Clustering is used directly by some vendors to provide reportson general characteristics of different visitor groups. These techniquesrequire training, and suffer from drift on Web sites with dynamicWeb pages. (Again, see the article "Tracking Users," WebTechniques, July 1999.)

Estimation and Prediction. Estimation guesses an unknownvalue, such as income, when you know other things about a person.Prediction guesses a future value, such as the probability of buyinga car next year, when a person hasn‘t done it yet, or the expectednumber of stocks that a person will trade in the coming year. Thesame algorithms can perform estimation and prediction.

Estimation is often used in demographics to fill in the blanks.If you don‘t know what income a person has, an estimator can identifyother variables that correlate well with income-such as location,car preference, job title-then find other people with similar traitsand use them to estimate income and confidence value.

Prediction can compute important future attributes of a person-suchas lifetime monetary value, next visit interval, learning speed,promotion susceptibility, and so on-based on the same approach.These values can be used in personalization applications.

Marketers often aggregate information to understand groups ofcustomers. Even adding up or averaging past events over differentdimensions-such as visitor category, content category, referrer,and time-can provide useful information. This simple aggregationis called OLAP, online analytic processing: online because the marketeruses an online reporting engine to interactively move through thedata; analytic because the marketer is passively looking throughpast data, not trying to change it.

Prediction can be applied in combination with OLAP techniquesto generalize properties of groups of people visiting a Web site.This can help a marketer to slice and dice the data to find whichitem attributes or site characteristics appeal to the most valuablecustomers.

Decision Trees. A decision tree is essentially a flow chartof questions or data points that ultimately leads to a decision.For example, a car-buying decision tree might start by asking whetheryou want a 1999 or 2000 model year car, then ask what type of car,then ask whether you prefer power or economy, and so on, until itdetermines what might be the best car for you. Decision tree systemstry to create optimized paths, ordering the questions so a decisioncan be made in the least number of steps.

Decision tree systems are incorporated in product-selection systemsoffered by many vendors. They‘re great for situations in which avisitor comes to a Web site with a particular need. But once thedecision has been made, the answers to the questions contributelittle to targeting or personalization for that visitor in the future.

For example, decision trees are used in the "paper clip"office assistant in Microsoft Office: It watches what you clickon, and observes your mistakes. It may decide you need help andbring up a help page with more information. Some of us find thepaper clip helpful. Others wish we could strangle it.*

Picking a Solution

Data mining isn‘t for the faint of heart. You face three majorproblems. First, many good data-mining professionals are seriousnerds who speak the foreign language of statistics. Second, thereare few plug-and-play solutions. And third, everything useful isexpensive.

I wrote this article to strengthen your resolve.

The previous sections showed you how to determine the data youshould collect, the metrics you hope to improve, and the frameworkof the problem. If you know these things, you can communicate morefluently with data-mining professionals.

Use caution when listening to traditional offline data-miningprofessionals. It‘s likely that your Web site operates at a fasterrate, involves more data, and is more mission critical than anythingthey‘ve done. Traditionalists are familiar with a more relaxed world:where data mining is used once per month, rather than once per click;where data accumulates in gigabytes per year rather than gigabytesper month; and where a crashed application needs to be fixed inthe morning, rather than instantly by redundant machines and fail-saferollover.

Data-mining algorithms overlap in the problems they can solve,but for a given problem there‘s usually a "best algorithm."When you buy a product, make sure the algorithm it uses is appropriatefor the task you‘re trying to perform. The box titled "Picks,Pans, & Dynamite" discusses the most common data-miningtechniques used on the Web.

Though data-mining applications are expensive, everything is relative.Andromedia‘s LikeMinds personalization system increased averagespend rate on the Levi-Strauss online store by 33 percent and increasedrepeat visitation by 225 percent. This adds up to a lot of revenue.

The world of Web data mining is simultaneously a minefield anda gold mine. By saving data associated with visitors, content, andinteractions, you can at least ensure you‘ll be able use it later.Despite the difficulties, you might consider evaluating and incorporatingdata-mining applications now. The sooner you start learning fromyour data, the sooner you can leave your competitors in the dust.

Dan holds a Ph.D. in computer science from UCLA, emphasizingparallel statistical optimization. He is currently chief technologyofficer at Andromedia. He can be reached at dan@greening.name .

* Editor‘s Note: No office supplies were harmed in the processof writing this article.

Mining Camps

Data-Mining Vendors

Many vendors include data mining in their Web products. This listoffers some representative vendors in different fields.

Abuzz

www.abuzz.com

Creates a multihuman knowledge network for answering complex questions.Uses neural networks. (For explanations of terms, see the box titled"Picks, Pans, & Dynamite.")

Andromedia

www.andromedia.com

Realtime personalization based on click-stream, preference profiles,purchase history, and demographics (LikeMinds). Uses collaborativefiltering. Ties in with realtime Web marketing analysis and reportingsystem (Aria).

Autonomy

www.autonomy.com

Knowledge management for categorization and text-portal personalization.Uses neural networks.

DataSage

www.datasage.com

Hybrid prediction/OLAP system reports on visitor and item characteristicsassociated with predicted market value. Uses clustering and neuralnetworks.

eGain

www.egain.com

Knowledge management system for email customer support. Uses neuralnetworks.

HNC

www.hncs.com

Targeting system for advertisements (SelectCast). Knowledge managementsystem for email customer service (SelectResponse). Uses neuralnetworks.

Kana

www.kana.com

Knowledge management system for email customer support. Uses neuralnetworks.

Personalogic

www.personalogic.com

Decision tree system for selecting products by desired features.Uses Bayesian networks.

Personify

www.personify.com

Clustering system lets sites analyze general characteristics ofgroups of visitors.

SAS

www.sas.com/software

Provides traditional (non-realtime) data-mining tools that mustbe assembled for a particular application. Uses clustering, neuralnetworks. -DG

Picks, Pans, & Dynamite

Data-Mining Algorithms

The data-mining algorithms used on the Web fall into several generalcategories.

Neural networks work something like your brain. When patterns arepresented to you, your brain eventually figures out that certainpatterns are associated with other desired outcomes. This can beapplied to targeting, estimation, prediction, and knowledge management.Neural networks must be trained, sometimes taking hours of CPU time.They don‘t adapt to new patterns until trained again, and they needto be carefully tuned by a human.

Collaborative filters organize profile data by person, then usethis logic: People who have done things you have done are good predictorsfor what you will do. In a sense, they are a restricted type ofneural network, with the input data in a regular form. This restrictiongives collaborative filters three great advantages: They adapt rapidlyto new behavior patterns. They can predict for thousands of datapoints simultaneously. And they don‘t need to be tuned. This makescollaborative filters ideal for realtime personalization applications.

Bayesian networks build a directed graph of conditional probabilities.As a visitor provides more information about himself or herself,a Bayesian network adjusts the probabilities of each possible endresult. This allows a Web system to accelerate the visitor‘s experienceby bringing the most likely things to the visitor‘s attention assoon as possible. Bayesian networks are most appropriate to helpsatisfy short-term visitor goals, such as answering customer supportquestions, diagnosing problems, or selecting an appliance. However,training a Bayesian network is often extremely slow. -DG

Data Mining on the Web Data Mining on the Web Web-based data mining Web-based data mining Web Extraction Products (Web Crawler, Web Grabber, Web Data Mining) Ink on the Web Best of DBPD: Data Mining: The AI Metamorphosis Data Mining COS论坛 The Georgian Times on the Web: Comprehensive ... Dapper: The Quest To Unlock Web Data (Legally) The Third Largest Social Network on the Web G... Web Content Mining Best Eclipse Tutorials and Videos on the Web Your Guide To Music On The Web – Part #1 Part 2 of my Guide To Music On The Web, The Data Dump Leveraging On-Site Search Data Leveraging On-Site Search Data RapidMiner -Open-Source Data Mining mit der Java Software RapidMiner RUBY ON RAILS：WEB2.0世界新生的创造力 - The Way We Web Web 2.0 Takes On Colleges And Universities: The Dawn Of Education 2.0 Weblog Publishing as Support for Exploratory Learning on the World Wide Web(Germany) Web Data Extraction,Web Extraction,HTML Extra... Wall Street retreats on disappointing data