文本大数据分析与挖掘:机遇、挑战及应用前景
2020-02-27 123浏览
- 1.文本大数据分析与挖掘:机遇,挑战,及应用前景 Analysis and Mining of Big TextData:Opportunities, Challenges, and Applications ChengXiang Zhai (翟成祥) Department of Computer Science University of Illinois at Urbana-Champaign USA
- 2.Text data cover all kinds of topicsTopics:People Events Products Services, …Sources:Blogs Microblogs Forums Reviews ,… … 45M reviews 53M blogs 65M msgs/day 1307M posts 115M users 10M groups …
- 3.人= 主观智能“传感器” Humans as Subjective & Intelligent “Sensors” Real World Sense Weather Report Sensor Thermometer 3C , 15F, … Geo Sensor Locations 41°N and 120°W …. Network Sensor Networks Perceive Data 01000100011100 Express “Human Sensor” 3
- 4.文本数据的特殊应用价值 Unique Value of Text Data • 对所有大数据应用都有应用价值: Useful to all big data applications • 特别有助于挖掘,利用有关人的行为,心态,观点的知识: Especially useful for mining knowledge about people’s behavior, attitude, and opinions • 直接表达知识;高质量数据( Directly express knowledge about our world ) 小文本数据应用 (Small text data are also useful!) Data Information Knowledge Text Data
- 5.Opportunities of Text Mining Applications 4. Infer other real-world variables (predictive analytics) + Non-Text Data 2. Mining content of text data Observed World Real World Text Data + Context Perceive Express (Perspective) (English) 3. Mining knowledge about the observer 1. Mining knowledge about language
- 6.Challenges in Understanding Text Data (NLP) Lexical analysis (part-of-speech tagging) A dog is chasing a boy on the playground Det Semantic analysis Noun Aux Det Noun Prep Det Noun Phrase Complex Verb Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb Noun Phrase Noun Noun Phrase Prep Phrase Verb Phrase Verb Phrase Sentence Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back. Pragmatic analysis (speech act)
- 7.NLP is hard! • Natural language is designed to make human communication efficient. As a result, – we omit a lot of common sense knowledge, which we assume the hearer/reader possesses. – we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve. • This makes EVERY step in NLP hard – Ambiguity is a killer! – Common sense reasoning is pre-required.
- 8.Examples of Challenges • Word-levelambiguity:'>ambiguity: