2 词汇挖掘与实体挖掘
2020-03-01 129浏览
- 1.《知识图谱: 概念与技术》 第2讲 词汇挖掘与实体挖掘 任翔 南加州大学
- 2.Mining Structures from Massive Text Data Machine? Unstructured Text Data (account for ~80% of all data in organizations) 2 Human? Knowledge & Insights (Chakraborty, 2016), Thanks to flaticon.com
- 3.Knowledge Graph Relation Entity 3 Attributeshttp://searchengineland.com/laymans-visual-guide-googles-knowledge-graph-search-api-2419350
- 4.Structure Mining The Mona Lisa is a halflength portrait painting by the Italian Renaissance artist Leonardo da Vinci that has Leonardo di ser Piero da Vinci (15 April 1452 – 2 May 1519) Entity Mona Lisa Relation Mona Lisa Attribute Names & Values paint Attribute Names …... Massive Text Corpus 4 4 Attribute Values
- 5.A Product UseCase:Finding “Interesting Hotel Collections” Technology Transfer to TripAdvisor Features for “Catch a Show” collection 1 2 3 4 5 6 7 broadway shows beacon theater broadway dance center broadway plays david letterman show radio city music hall theatre shows Features for “Near The High Line” collection Grouping hotels based on structured facts extracted from the review text 5 1 2 3 4 5 6 7 high line park chelsea market highline walkway elevated park meatpacking district west side old railwayhttp://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/
- 6.Why Text to Structure? Structured Search & Exploration Facet Taxonomy Construction 6 Graph Mining & Network Analysis Structured Feature Generation 6
- 7.PriorArt:Extracting Structure with Domain Expert Effort News Reviews Domain Experts Scientific Papers … Extraction Rules Machine-Learning Models Text Corpus … NYTimes News Yelp reviews PubMed Papers Stanford CoreNLP CMU NELL UW KnowItAll USC AMR IBM Alchemy APIs Google Knowledge Graph Microsoft Satori … Entities, Relations, and Attribute Names &Values • 7 Models for the same task may require different labeled data in different domains
- 8.ThisLecture:Automatic Structure Mining from Massive Text Corpora News Reviews Scientific Papers … Public Knowledge Bases AutoPhrase CoType MetaPAD … Extraction Rules Machine-Learning Models NYTimes News Yelp reviews PubMed Papers Text Corpus • • 8 Enables quick development of applications in various domains. Extracts complex structures without introducing additional8human effort.
- 9.“Automatic” Definition Automatic Minimal Human Effort Using only existing general knowledge bases without any other human effort. Number of Wikipedia articles ERAstructures:entity names, entity types, typed relationships ... That’s it? Problem solved? Everything can be found in KBs? Rapidly growing! Freely available! • • • 9 Common knowledge Life sciences Art …https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
- 10.Automatic StructureMining:Methodology Typing and Relation Extraction Methods Knowledge Bases Massive Text Corpus Automatic Phrase Mining Methods (SIGMOD’15, arXiv’17) Entity Names & Context Units 10 (KDD’15, KDD’16, EMNLP’16, WWW’17) Meta PatternDriven Attribute Name & Value Discovery Methods (KDD’17) Typed Entity & Relations Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)
- 11.Lecture Outline 11 Introduction PartI:Entity Extraction through Phrase Mining Part II:Entity Typing
- 12.PartI:Entity Extraction through Phrase Mining
- 13.Automatic StructureMining:Methodology Typing and Relation Extraction Methods Knowledge Bases Massive Text Corpus Automatic Phrase Mining Methods (SIGMOD’15, arXiv’17) Entity Names & Context Units 13 (KDD’15, KDD’16, EMNLP’16, WWW’17) Meta PatternDriven Attribute Name & Value Discovery Methods (KDD’17) Typed Entity & Relations Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)
- 14.Definition:Quality Phrase Mining Quality phrase mining seeks to extract a ranked list of phrases with decreasing quality from a large collection of documentsExamples:Scientific Papers 14 Expected Results data mining machine learning information retrieval … support vector machine … the paper … News Articles Expected Results US President Anderson Cooper Barack Obama … Obama administration … a town …
- 15.Why Phrase Mining? w/ phrase mining w/o phrase mining • What is “united”? • Which Dao? Applications in NLP, IR, Text Mining Document analysis Indexing in search engine • United Airline! • David Dao! 15 Keyphrases for topic modeling Summarization
- 16.What Kind of Phrases Are of “High Quality”?Popularity:Frequency “information retrieval” vs. “cross-language information retrieval” Concordance:A sequence of words that occur more frequently than expected “powerful tea” vs. “strong tea”; “active learning” vs. “learning classification” Concordance can be measured using many statistical measures, e.g., significance score, mutual information, t-test, z-test, chi-squared test, likelihood ratio, … Informativeness “this paper” (frequent but not discriminative, not informative) Completeness “vector machine” vs. “support vector machine” 16
- 17.Our Recent Efforts on Phrase Mining Maria Danilevsky, Chi Wang, Nihit Desai, Xiang Ren, Jingyi Guo, and Jiawei Han. “Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SIAM Data Mining Conf. (SDM), 2014 ToPMine (2014-2015) Code package downloadable athttp://elkishk2.web.engr.illinois.eduAhmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", 2015 Int. Conf. on Very Large Data Bases (VLDB'15) SegPhrase (2015) GitHubSource:'>Source: