SpaCy括号标记化:(LRB,RRB)对未正确标记化
如果RRB
后面的单词没有用空格隔开,则它将被识别为单词的一部分。
In [34]: nlp("Indonesia (CNN)AirAsia ")
Out[34]: Indonesia (CNN)AirAsia
In [35]: d=nlp("Indonesia (CNN)AirAsia ")
In [36]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[36]:
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'),
('(', '(', 'PUNCT', '-LRB-'),
('CNN)AirAsia', 'CNN)AirAsia', 'PROPN', 'NNP')]
In [39]: d=nlp("(CNN)Police")
In [40]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[40]: [('(', '(', 'PUNCT', '-LRB-'), ('CNN)Police', 'cnn)police', 'VERB', 'VB')]
预期结果是
In [37]: d=nlp("(CNN) Police")
In [38]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]
Out[38]:
[('(', '(', 'PUNCT', '-LRB-'),
('CNN', 'CNN', 'PROPN', 'NNP'),
(')', ')', 'PUNCT', '-RRB-'),
('Police', 'Police', 'NOUN', 'NNS')]
这是一个错误吗?有解决此问题的建议吗?
-
使用自定义标记器将
r'\b\)\b'
规则(请参见此regex演示)添加到中infixes
。regex与a匹配,该a)
之前带有任何单词char(字母,数字,_
和Python
3中的其他稀有字符),并带有此类型的char。您可以进一步自定义此正则表达式,因此很大程度上取决于您要与之匹配的上下文
)
。查看完整的Python演示:
import spacy import re from spacy.tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex nlp = spacy.load('en_core_web_sm') def custom_tokenizer(nlp): infixes = tuple([r"\b\)\b"]) + nlp.Defaults.infixes infix_re = spacy.util.compile_infix_regex(infixes) prefix_re = compile_prefix_regex(nlp.Defaults.prefixes) suffix_re = compile_suffix_regex(nlp.Defaults.suffixes) return Tokenizer(nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search, infix_finditer=infix_re.finditer, token_match=nlp.tokenizer.token_match, rules=nlp.Defaults.tokenizer_exceptions) nlp.tokenizer = custom_tokenizer(nlp) doc = nlp("Indonesia (CNN)AirAsia ") print([(t.text, t.lemma_, t.pos_, t.tag_) for t in doc] )
输出:
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'), ('(', '(', 'PUNCT', '-LRB-'), ('CNN', 'CNN', 'PROPN', 'NNP'), (')', ')', 'PUNCT', '-RRB-'), ('AirAsia', 'AirAsia', 'PROPN', 'NNP')]