翟成祥 - 文本大数据分析与挖掘:机遇,挑战,及应用前景

2020-02-27 115浏览

  • 1.文本大数据分析与挖掘:机遇,挑战,及应用前景 Analysis and Mining of Big TextData:Opportunities, Challenges, and Applications ChengXiang Zhai (翟成祥) Department of Computer Science University of Illinois at Urbana-Champaign USA
  • 2.Text data cover all kinds of topicsTopics:People Events Products Services, … …Sources:Blogs 45M reviews Microblogs Forums Reviews ,… 53M blogs 65M msgs/day 1307M posts 115M users 10M groups …
  • 3.人= 主观智能“传感器” Humans as Subjective & Intelligent “Sensors” Real World Sense Sensor Report Data Weather Locations Networks Thermometer Geo Sensor Network Sensor Perceive Express 3C , 15F, … 41°N and 120°W …. 01000100011100 “Human Sensor” 3
  • 4.文本数据的特殊应用价值 Unique Value of Text Data • 对所有大数据应用都有应用价值: Useful to all big data applications • 特别有助于挖掘,利用有关人的行为,心态,观点的知识: Especially useful for mining knowledge about people’s behavior, attitude, and opinions • 直接表达知识;高质量数据( Directly express knowledge about our world ) 小文本数据应用 (Small text data are also useful!) Data  Information  Knowledge Text Data
  • 5.Opportunities of Text Mining Applications 4. Infer other real-world variables (predictive analytics) 2. Mining content of text data + Non-Text Data Real World Observed World Perceive Express Text Data + Context (Perspective) (English) 3. Mining knowledge about the observer 1. Mining knowledge about language
  • 6.Challenges in Understanding Text Data (NLP) Semantic analysis A dog is chasing a boy on the playground Lexical analysis Det Noun Aux Verb Det Noun Prep Det Noun (part-of-speech Noun Phrase Complex Verb Noun Phrase Noun Phrase tagging) Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb Phrase Verb Phrase Sentence Prep Phrase Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic analysis (speech act)
  • 7.NLP is hard! • Natural language is designed to make human communication efficient. As a result, – we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. – we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve. • This makes EVERY step in NLP hard – Ambiguity is a killer! – Common sense reasoning is pre-required.
  • 8.Examples of Challenges • Word-levelambiguity:'>ambiguity: