Python

如何使用正则表达式删除推文的主题标签，@ user，链接

发布于 2021-01-29 15:10:24

我需要使用Python预处理推文。现在我想知道分别删除所有标签，@ user和tweet链接的正则表达式是什么？

例如，

original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
- 已处理的推文： I really love that shirt at Macy
原始推文： @shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
- 已处理的推文： Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
原始推文： I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
- 已处理的推文： I am at Starbucks 7419 3rd ave at 75th Brooklyn

我只需要每个推文中有意义的词即可。我不需要用户名，任何链接或标点符号。

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

以下示例是一个近似的例子。不幸的是，仅通过正则表达式没有正确的方法。以下正则表达式仅去除URL（不只是http），任何标点，用户名或任何非字母数字字符。它还将单词分隔为单个空格。如果您想按预期分析推文，则系统中需要更多智能。考虑到没有标准tweet提要格式的一些认知性自我学习算法。

这是我的建议。

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

这是你的例子的结果

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>

这是一些不完美的例子

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看