如何使用正则表达式删除推文的主题标签,@ user,链接

发布于 2021-01-29 15:10:24

我需要使用Python预处理推文。现在我想知道分别删除所有标签,@ user和tweet链接的正则表达式是什么?

例如,

  1. original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
    • 已处理的推文: I really love that shirt at Macy
  2. 原始推文: @shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
    • 已处理的推文: Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
  3. 原始推文: I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
    • 已处理的推文: I am at Starbucks 7419 3rd ave at 75th Brooklyn

我只需要每个推文中有意义的词即可。我不需要用户名,任何链接或标点符号。

关注者
0
被浏览
62
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    以下示例是一个近似的例子。不幸的是,仅通过正则表达式没有正确的方法。以下正则表达式仅去除URL(不只是http),任何标点,用户名或任何非字母数字字符。它还将单词分隔为单个空格。如果您想按预期分析推文,则系统中需要更多智能。考虑到没有标准tweet提要格式的一些认知性自我学习算法。

    这是我的建议。

    ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    

    这是你的例子的结果

    >>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'I really love that shirt at Macy'
    >>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
    >>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
    >>>
    

    这是一些不完美的例子

    >>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'I c RT that s my excited face and my regular face The expression never changes'
    >>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
    >>> # Though after you add # to the regex expression filter, results become a bit better
    >>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
    >>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
    >>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
    'New comment by diego bosca Re Re wrong regular expression'
    >>> #See how miserably it performed?
    >>>
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看