NLTK-块语法不读逗号
发布于 2021-01-29 14:59:40
from nltk.chunk.util import tagstr2tree
from nltk import word_tokenize, pos_tag
text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center."
tagged_text = pos_tag(text.split())
grammar = "NP:{<NNP>+}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged_text)
print(result)
输出:
(S
(NP John/NNP Rose/NNP Center/NNP)
is/VBZ
very/RB
beautiful/JJ
place/NN
and/CC
i/NN
want/VBP
to/TO
go/VB
there/RB
with/IN
(NP Barbara/NNP Palvin./NNP)
Also/RB
there/EX
are/VBP
stores/NNS
like/IN
(NP Adidas/NNP ,Nike/NNP ,Reebok/NNP Center./NNP))
我用于分块的语法仅适用于nnp标记,但是如果单词与逗号连续,它们仍将在同一行上。
(S
(NP John/NNP Rose/NNP Center/NNP)
is/VBZ
very/RB
beautiful/JJ
place/NN
and/CC
i/NN
want/VBP
to/TO
go/VB
there/RB
with/IN
(NP Barbara/NNP Palvin./NNP)
Also/RB
there/EX
are/VBP
stores/NNS
like/IN
(NP Adidas,/NNP)
(NP Nike,/NNP)
(NP Reebok/NNP Center./NNP))
我应该在“ grammar =“中写什么,还是可以像上面写的那样编辑输出?如您所见,我只为我的命名实体项目解析专有名词,请帮助我。
关注者
0
被浏览
183
1 个回答
-
使用
word_tokenize(string)
代替string.split()
:>>> import nltk >>> from nltk.chunk.util import tagstr2tree >>> from nltk import word_tokenize, pos_tag >>> text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center." >>> tagged_text = pos_tag(word_tokenize(text)) >>> >>> grammar = "NP:{<NNP>+}" >>> >>> cp = nltk.RegexpParser(grammar) >>> result = cp.parse(tagged_text) >>> >>> print(result) (S (NP John/NNP Rose/NNP Center/NNP) is/VBZ very/RB beautiful/JJ place/NN and/CC i/NN want/VBP to/TO go/VB there/RB with/IN (NP Barbara/NNP Palvin/NNP) ./. Also/RB there/EX are/VBP stores/NNS like/IN (NP Adidas/NNP) ,/, (NP Nike/NNP) ,/, (NP Reebok/NNP Center/NNP) ./.)