Python:UserWarning:此模式具有匹配组。要实际获得组,请使用str.extract

发布于 2021-01-29 17:43:06

我有一个数据框,我尝试获取字符串,其中的列上包含一些字符串Df像

member_id,event_path,event_time,event_duration
30595,"2016-03-30 12:27:33",yandex.ru/,1
30595,"2016-03-30 12:31:42",yandex.ru/,0
30595,"2016-03-30 12:31:43",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:44",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:45",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:46",yandex.ru/search/?lr=10738&msid=22901.25826.1459330364.89548&text=%D1%84%D0%B8%D0%BB%D1%8C%D0%BC%D1%8B+%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD&suggest_reqid=168542624144922467267026838391360&csg=3381%2C3938%2C2%2C3%2C1%2C0%2C0,0
30595,"2016-03-30 12:31:49",kinogo.co/,1
30595,"2016-03-30 12:32:11",kinogo.co/melodramy/,0

和另一个带有网址的df

url
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_bq_phoenix
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnyj_telefon_fly_
003\.ru\/sonyxperia
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony
003\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/mobilnye_telefony_smartfony\/brands5D5Bbr_23
1click\.ru\/sonyxperia
1click\.ru\/[a-zA-Z0-9-_%$#?.:+=|()]+\/chasy-motorola

我用

urls = pd.read_csv('relevant_url1.csv', error_bad_lines=False)
substr = urls.url.values.tolist()
data = pd.read_csv('data_nts2.csv', error_bad_lines=False, chunksize=50000)
result = pd.DataFrame()
for i, df in enumerate(data):
    res = df[df['event_time'].str.contains('|'.join(substr), regex=True)]

但它还给我

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.

我该如何解决?

关注者
0
被浏览
149
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    中的至少一个正则表达式模式urls必须使用捕获组。 str.contains仅针对其中的每一行返回True或False
    df['event_time']-不使用捕获组。因此,UserWarning警告您正则表达式使用捕获组,但未使用匹配项。

    如果要删除,则UserWarning可以从正则表达式模式中找到并删除捕获组。它们没有显示在您发布的正则表达式模式中,但是它们必须在您的实际文件中。在字符类之外查找括号。

    或者,您可以通过以下方式禁止此特定的UserWarning

    import warnings
    warnings.filterwarnings("ignore", 'This pattern has match groups')
    

    在致电之前str.contains


    这是一个简单的示例,演示了问题(和解决方案):

    # import warnings
    # warnings.filterwarnings("ignore", 'This pattern has match groups') # uncomment to suppress the UserWarning
    
    import pandas as pd
    
    df = pd.DataFrame({ 'event_time': ['gouda', 'stilton', 'gruyere']})
    
    urls = pd.DataFrame({'url': ['g(.*)']})   # With a capturing group, there is a UserWarning
    # urls = pd.DataFrame({'url': ['g.*']})   # Without a capturing group, there is no UserWarning. Uncommenting this line avoids the UserWarning.
    
    substr = urls.url.values.tolist()
    df[df['event_time'].str.contains('|'.join(substr), regex=True)]
    

    版画

      script.py:10: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
      df[df['event_time'].str.contains('|'.join(substr), regex=True)]
    

    从正则表达式模式中删除捕获组:

    urls = pd.DataFrame({'url': ['g.*']})
    

    避免了UserWarning。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看