需要在python中比较1.5GB左右的超大文件

发布于 2021-01-29 16:03:02

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"

以上是示例数据。数据是根据电子邮件地址排序的,文件很大,约为1.5Gb

我想要在另一个csv文件中输出类似这样的内容

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH@GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days

即,如果条目是第一次发生,我需要附加1,如果条目是第二次,我需要附加2,同样地,我的意思是我需要不计算文件中电子邮件地址的出现次数,并且如果电子邮件存在两次或更多次,我想区别在日期之间,记住
日期没有排序,
因此我们还必须针对特定的电子邮件地址对它们进行排序,我正在寻找使用numpy或pandas库或任何其他可以处理这种类型的巨大数据而又不给与错的库的python解决方案绑定内存异常我有带centos
6.3的双核处理器,内存为4GB

关注者
0
被浏览
177
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    另一种可能的方式(系统管理员),避免数据库和SQL查询以及运行时进程和硬件资源中的大量要求。

    更新20/04 添加了更多代码和简化方法:

    1. 转换的时间戳,以秒(大纪元)和使用UNIX sort,使用电子邮件和这个新的领域(即:sort -k2 -k4 -n -t, < converted_input_file > output_file
    2. 初始化3个变量EMAILPREV_TIMECOUNT
    3. 在每一行进行交互,如果遇到新的电子邮件,则添加“ 1,0天”。更新PREV_TIME=timestampCOUNT=1EMAIL=new_email
    4. 下一行:3种可能的情况
      • a)如果同一封电子邮件,不同的时间戳记:计算天数,增加COUNT = 1,更新PREV_TIME,添加“ Count,Difference_in_days”
      • b)如果相同的电子邮件,相同的时间戳:增加COUNT,则添加“ COUNT,0天”
      • c)如果是新电子邮件,请从3开始。

    替代1.的是添加一个新字段TIMESTAMP,并在打印出该行后将其删除。

    注意:如果1.5GB太大而无法一次整理,请使用电子邮件将其拆分为较小的卡盘。您可以在不同的机器上并行运行这些块

    /usr/bin/gawk -F'","' ' { 
        split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " "); 
        for (i=1; i<=12; i++) mdigit[month[i]]=i; 
        print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
    )}' < input.txt |  /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
    

    output_file.txt:

    “ DF”,“ 00000000@11111.COM”,“ FLTINT1000130394756”,“ 26JUL2010”,“ B2C”,“
    6799.2”,1280102400“ DF”,“ 0001HARISH@GMAIL.COM”,“ NF252022031180”,“
    09DEC2010”,“ B2C“,” 3439“,1291852800” DF“,” 0001HARISH@GMAIL.COM“,”
    NF251742087846“,” 12DEC2010“,” B2C“,” 1000“,1292112000” DF“,”
    0001HARISH@GMAIL.COM“,” NF251352240086”,“ 22DEC2010”,“ B2C”,“
    4006”,1292976000

    您将输出通过管道传输到Perl,Python或AWK脚本以处理步骤2至4。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看