在不使用 dom 方法的情况下迭代解析大型 xml 文件-尊龙凯时

问题描述

我有一个 xml 文件

i have an xml file

我想读取每个电子邮件标签的 xml 文件.也就是说，有一次我想读取电子邮件 id=1..从中提取正文，读取的电子邮件 id=2...并从中提取正文...等等

i want to read the xml file for each email tag. that is, at a time i want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on

我尝试使用 dom 模型进行 xml 解析，因为我的文件大小为 100 gb..该方法不起作用.然后我尝试使用:

i tried to do this using dom model for xml parsing, since my file size is 100 gb..the approach does not work. i then tried using:

  from xml.etree import elementtree as et
  tree=et.parse('myfile.xml')
  root=et.parse('myfile.xml').getroot()
  for i in root.findall('email/'):
              print i.get('body')

现在，一旦我获得了 root..我不明白为什么我的代码无法解析.

now once i get the root..i am not getting why is my code not been able to parse.

使用 iterparse 时的代码抛出以下错误:

the code upon using iterparse is throwing the following error:

 "unicodeencodeerror: 'ascii' codec can't encode character u'u20ac' in position 437: ordinal not in range(128)"

谁能帮忙

推荐答案

一个iterparse的例子:

an example for iterparse:

import cstringio
from xml.etree.elementtree import iterparse
fakefile = cstringio.stringio("""
  
  
  

""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['body']
    elem.clear()

只需将 fakefile 替换为您的真实文件即可.另请阅读 this 了解更多详情.

just replace fakefile with your real file. also read this for further details.