Web信息抽取技术纵览二

来源：百度文库编辑：神马文学网时间：2024/04/30 14:51:32

第六章          总结和讨论
第 6.1. 节         总结 ...
第 6.2. 节         讨论 ...
第6.1.节                总结
信息抽取是近十年来新发展起来的领域。 MUC 等国际研讨会给予高度关注，并提出了评价这类系统的方法，定义了评价指标体系。
信息抽取技术的研究对象包括结构化、半结构化和自由式文档。对于自由式文档，多数采用了自然语言处理的方法，而其他两类文档的处理则多数是基于分隔符的。
网页是信息抽取技术研究的重点之一。通常用分装器从一特定网站上抽取信息。用一系列能处理不同网站的分装器就能将数据统一表示，并获得它们之间的关系。
分装器的建造通常是费事费力的，而且需要专门知识。加上网页动态变化，维护分装器的成本将很高。因此，如何自动构建分装器便成为主要的问题。通常采用的方法包括基于归纳学习的机器学习方法。
有若干研究系统被开发出来。这些系统使用机器学习算法针对网上信息源生成抽取规则。 ShopBot ， WIEN ， SoftMealy 和 STALKER 生成的分装器以分隔符为基础，能处理结构化程度高的网站。 RAPIER ， WHISK 和 SRV 能处理结构化程度稍差的信息源。所采用的抽取方法与传统的 IE 方法一脉相承，而学习算法多用关系学习法。
网站信息抽取和分装器生成技术可在一系列的应用领域内发挥作用。目前只有比价购物方面的商业应用比较成功，而最出色的系统包括 Jango ， Junglee 和 MySimon 。
第6.2.节                 讨论
目前的搜索引擎并不能收集到网上数据库内的信息。根据用户的查询请求，搜索引擎能找到相关的网页，但不能把上面的信息抽取出来。“暗藏网”不断增加，因此有必要开发一些工具把相关信息从网页上抽取并收集起来。
由于网上信息整合越来越重要，虽然网站信息抽取的研究比较新，但将不断发展。机器学习方法的使用仍将成为主流方法，因为处理动态的海量信息需要自动化程度高的技术。在文献 [52] 中提出，结合不同类型的方法，以开发出适应性强的系统，这应是一个有前途的方向。在文献 [36] 中，一种混合语言知识和句法特征的方法也被提出来。
本文介绍的系统多数是针对 HTML 文档的。以后几年 XML 的使用将被普及。 HTML 描述的是文档的表现方式，是文档的格式语言。 XML 则可以告诉你文档的意义，即定义内容而不只是形式。这虽然使分装器的生成工作变得简单，但不能排除其存在的必要性。
将来的挑战是建造灵活和可升级的分装器自动归纳系统，以适应不断增长的动态网络的需要。
参考文献
[1] S. Abiteboul.
Querying Semistructured Data.
Proceedings of the International Conference on Database Theory (ICDT), ,
January 1997.
[2] B. Adelberg.
NoDoSE - A tool for Semi-Automatically Extracting Semistructured Data from Text
Documents.
Proceedings ACM SIGMOD International Conference on Management of Data, Seat-
tle, June 1998.
[3] D. E. Appelt, D. J. Israel.
Introduction to Information Extraction Technology.
Tutorial for IJCAI-99, , August 1999.
[4] N. Ashish, C. A. Knoblock.
Semi-automatic Wrapper Generation for Internet Information Sources.
Second IFCIS Conference on Cooperative Information Systems (CoopIS),
olina, June 1997.
[5] N. Ashish, C. A. Knoblock.
Wrapper Generation for semistructured Internet Sources.
SIGMOD Record, Vol. 26, No. 4, pp. 8--15, December 1997.
[6] P. Atzeni, G. Mecca.
Cut & Paste.
Proceedings of the 16‘th ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems (PODS‘97), , May 1997.
[7] M. Bauer, D. Dengler.
TrIAs - An Architecture for Trainable Information Assistants.
Workshop on AI and Information Integration, in conjunction with the 15‘th National
Conference on Artificial Intelligence (AAAI-98), , July 1998.
[8] P. Berka.
Intelligent Systems on the Internet.
http://lisp.vse.cz/ berka/ai-inet.htm, Laboratory of Intelligent Systems, University
of Economics,
[9] L. Bright, J. R. Gruser, L. Raschid, M. E. Vidal.
A Wrapper Generation Toolkit to Specify and Construct Wrappers for Web Accessible
Data Sources (WebSources).
Computer Systems Special Issue on Semantics on the WWW, Vol. 14 No. 2, March
1999.
[10] S. Brin.
Extracting Patterns and Relations from the World Wide Web.
International Workshop on the Web and Databases (WebDB‘98), , March 1998.
[11] M. E. Califf, R. J. Mooney.
Relational Learning of Pattern-Match Rules for Information Extraction.
Proceedings of the ACL Workshop on Natural Language , July 1997.
[12] M. E. Califf.
Relational Learning Techniques for Natural Language Information Extraction.
Ph.D. thesis, Department of Computer Sciences, , August
1998. Technical Report AI98-276.
[13] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J.
Ullman, J. Widom.
The TSIMMIS Project: Integration of Heterogeneous Information Sources.
In Proceedings of IPSJ Conference, pp. 7--18, , Japan, October 1994.
[14] B. Chidlovskii, U. M. Borghoff, P-Y. Chevalier.
Towards Sophisticated Wrapping of Web-based Information Repositories.
Proceedings of the 5‘th International RIAO Conference, , June 1997.
[15] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery.
Learning to Extract Symbolic Knowledge from the World Wide Web.
Proceedings of the 15‘th National Conference on Artificial Intelligence (AAAI-98),
, , July 1998.
[16] M. Craven, S. Slattery, K. Nigam.
First-Order Learning for Web Mining.
Proceedings of the 10‘th European Conference on Machine , April
1998.
[17] R. B. Doorenbos, O. Etzioni, D. S. Weld.
A Scalable Comparison-Shopping Agent for the World Wide Web.
Technical report UW-CSE-, , 1996.
[18] R. B. Doorenbos, O. Etzioni, D. S. Weld.
A Scalable Comparison-Shopping Agent for the World-Wide-Web.
Proceedings of the first International Conference on Autonomous Agents, ,
February 1997.
[19] O. Etzioni
Moving up the Information Food Chain: Deploying Softbots on the World Wide Web.
AI Magazine, 18(2):11-18, 1997.
[20] D. Florescu, A. Levy, A. Mendelzon.
Database Techniques for the World Wide Web: A Survey.
ACM SIGMOD Record, Vol. 27, No. 3, September 1998.
[21] D. Freitag.
Information Extraction from HTML: Application of a General Machine Learning Ap-
proach.
Proceedings of the 15‘th National Conference on Artificial Intelligence (AAAI-98),
, , July 1998.
[22] D. Freitag.
Machine Learning for Information Extraction in Informal Domains.
Ph.D. dissertation, , November 1998.
[23] D. Freitag.
Multistrategy Learning for Information Extraction.
Proceedings of the 15‘th International Conference on Machine Learning (ICML-98),
, , July 1998.
[24] R. Gaizauskas, Y. Wilks.
Information Extraction: Beyond Document Retrieval.
Computational Linguistics and Chinese Language Processing, vol. 3, no. 2, pp. 17--60,
August 1998,
[25] H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J.
Widom.
Integrating and Accessing Heterogeneous Information Sources in TSIMMIS.
In Proceedings of the AAAI Symposium on Information Gathering, pp. 61--64, Stan-
ford, , March 1995.
[26] S. Grumbach and G. Mecca.
In Search of the Lost Schema.
Proceedings of the International Conference on Database Theory (ICDT‘99),
, January 1999.
[27] J-R. Gruser, L. Raschid, M. E. Vidal, L. Bright.
Wrapper Generation for Web Accessible Data Source.
Proceedings of the 3‘rd IFCIS International Conference on Cooperative Information
Systems (CoopIS-98), New York, August 1998.
[28] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo.
Extracting Semistructured Information from Web.
Proceedings of the Workshop on Management of Semistructured Data, , Ari-
zona, May 1997.
[29] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, V. Vassalos.
Template-Based Wrappers in the TSIMMIS System.
Proceedings of the 26‘th SIGMOD International Conference on Management of Data,
, , May 1997.
[30] C-H. Hsu.
Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers
and Contextual Rules.
Workshop on AI and Information Integration, in conjunction with the 15‘th National
Conference on Artificial Intelligence (AAAI-98), , July 1998.
[31] C-H. Hsu and M-T Dung.
Generating Finite-Sate Transducers for semistructured Data Extraction From the
Web.
Information systems, Vol 23. No. 8, pp. 521--538, 1998.
[32] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. G.
Philpot, S. Tejada.
Modeling Web Sources for Information Integration.
Proceedings of the 15‘th National Conference on Artificial Intelligence (AAAI-98),
, , July 1998.
[33] N. Kushmerick, D. S. Weld, R. Doorenbos.
Wrapper Induction for Information Extraction.
15‘th International Joint Conference on Artificial Intelligence (IJCAI-97), ,
August 1997.
[34] N. Kushmerick.
Wrapper Induction for Information Extraction.
Ph.D. Dissertation, . Technical Report UW-CSE-,
1997.
[35] N. Kushmerick.
Wrapper induction: Efficiency and expressiveness.
Workshop on AI and Information Integration, in conjunction with the 15‘th National
Conference on Artificial Intelligence (AAAI-98), , July 1998.
[36] Kushmerick, N.
Gleaning the Web.
IEEE Intelligent Systems, 14(2), March/April 1999.
[37] S. Lawrence, C.l. Giles.
Searching the World Wide Web.
Science magazine, v. 280, pp. 98--100, April 1998.
[38] A. Y. Levy, A. Rajaraman, J. J. Ordille.
Querying Hetereogeneous Information Sources Using Source Descriptions.
Proceedings 22‘nd VLDB Conference, , September 1996.
[39] S. Muggleton, C. Feng.
Efficient Induction of Logic Programs.
Proceedings of the First Conference on Algorithmic Learning Theory, ,
1990.
[40]
Extraction Patterns: From Information Extraction to Wrapper Induction.
Information Sciences Institute, , 1998.
[41]
Extraction Patterns for Information Extraction Tasks: A Survey.
Workshop on Machine Learning for Information Extraction, , July 1999.
[42] Muslea, S. Minton, C. Knoblock.
STALKER: Learning Extraction Rules for Semistructured, Web-based Information
Sources.
Workshop on AI and Information Integration, in conjunction with the 15‘th National
Conference on Artificial Intelligence (AAAI-98), , July 1998.
[43] Muslea, S. Minton, C. Knoblock.
Wrapper Induction for Semistructured Web-based Information Sources.
Proceedings of the Conference on Automatic Learning and Discovery CONALD-98,
, June 1998.
[44] Muslea, S. Minton, C. Knoblock.
A Hierarchical Approach to Wrapper Induction.
Third International Conference on Autonomous Agents, (Agents‘99), Seattle, May
1999.
[45] S. Nestorov, S. Aboteboul, R. Motwani.
Inferring Structure in Semistructured Data.
Proceedings of the 13‘th International Conference on Data Engineering (ICDE‘97),
, , April 1997.
[46] STS Prasad, A. Rajaraman.
Virtual Database Technology, XML, and the Evolution of the Web.
Data Engineering, Vol. 21, No. 2, June 1998.
[47] J.R. Quinlan, R. M. Cameron-Jones.
FOIL: A Midterm Report.
European Conference on Machine Learning, , 1993.
[48] A. Rajaraman.
Transforming the Internet into a Database.
Workshop on Reuse of Web information, in conjunction with WWW7, Brisbane, April
1998.
[49] A. Sahuguet, F. Azavant.
WysiWyg Web Wrapper Factory (W
http://cheops.cis.upenn.edu/ sahuguet/WAPI/wapi.ps.gz,
nia, August 1998.
[50] D. Smith, M. Lopez.
Information Extraction for Semistructured Documents.
Proceedings of the Workshop on Management of Semistructured Data, in conjunction
with PODS/SIGMOD, , , May 1997.
[51] S. Soderland.
Learning to Extract Text-based Information from the World Wide Web.
Proceedings of the 3‘rd International Conference on Knowledge Discovery and Data
Mining (KDD), , August 1997.
[52] S. Soderland.
Learning Information Extraction Rules for Semistructured and Free Text.
Machine Learning, 1999.
[53] K. Zechner.
A Literature Survey on Information Extraction and Text Summarization.
Term paper, , 1997.
[54] About mySimon.
http://www.mysimon.com/about mysimon/company/backgrounder.anml

Web信息抽取技术纵览二 Web信息抽取技术纵览一 Web信息抽取技术纵览一 (1) Web信息抽取技术纵览一 (3) Web信息抽取技术纵览一 (1) WEB网页结构化信息抽取技术介绍(网页库级) WEB网页结构化信息抽取技术介绍(网页库级) 一堆信息抽取的资料文档中文全文检索网_网页库级垂直搜索引擎全套技术自然语言处理专家个人主页信息自动抽取一堆信息抽取的资料文档一堆信息抽取的资料文档信息检索及信息过滤方法纵览抽取示例 -- 网页数据抓取，网页文本抓取，信息采集信息检索及信息过滤方法纵览2 金坛二中新闻网 >> 信息 >> 教师园地 >> 技术教程《Web信息架构》-导航系统技术引领Web 2.0 J2EE Web技术学习 Web开发技术史话 Web 2.0 用户界面技术 J2EE Web技术学习《Web信息架构》-组织系统纵览上海世博会上的10大技术涨停技术－教你如何捕捉涨停版(绝对经典)(二) 信息...