问题描述
我有一个 xml 文件
i have an xml file
. .
我想读取每个电子邮件标签的 xml 文件.也就是说,有一次我想读取电子邮件 id=1..从中提取正文,读取的电子邮件 id=2...并从中提取正文...等等
i want to read the xml file for each email tag. that is, at a time i want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on
我尝试使用 dom 模型进行 xml 解析,因为我的文件大小为 100 gb..该方法不起作用.然后我尝试使用:
i tried to do this using dom model for xml parsing, since my file size is 100 gb..the approach does not work. i then tried using:
from xml.etree import elementtree as et tree=et.parse('myfile.xml') root=et.parse('myfile.xml').getroot() for i in root.findall('email/'): print i.get('body')
现在,一旦我获得了 root..我不明白为什么我的代码无法解析.
now once i get the root..i am not getting why is my code not been able to parse.
使用 iterparse 时的代码抛出以下错误:
the code upon using iterparse is throwing the following error:
"unicodeencodeerror: 'ascii' codec can't encode character u'u20ac' in position 437: ordinal not in range(128)"
谁能帮忙
推荐答案
一个iterparse的例子:
an example for iterparse:
import cstringio from xml.etree.elementtree import iterparse fakefile = cstringio.stringio("""""") for _, elem in iterparse(fakefile): if elem.tag == 'email': print elem.attrib['id'], elem.attrib['body'] elem.clear()
只需将 fakefile 替换为您的真实文件即可.另请阅读 this 了解更多详情.
just replace fakefile with your real file. also read this for further details.