3

I have a large yaml file containing some useless data. When using yaml.load() to load this file, memory consumption exceeds the physical limit of our computer. I can't read it. Whether I can only read partial data which I need to a python dict? are there some library or code to solve this problem?

user545424
  • 15,713
  • 11
  • 56
  • 70
Yuhao Fu
  • 33
  • 2
  • 4

2 Answers2

4

Using PyYaml, you can do something like this:

with open("file.yaml", 'r') as handle:
  for event in yaml.parse(handle):
    # handle the event here

This processes the YAML file event by event, instead of loading it all into a data structure. Of course, you now need to parse the structure manually from the event stream, but this allows you to not process parts of the data further.

You can look at PyYaml's Composer implementation to see how it constructs Python objects from events, and what structure it expects from the event stream.

flyx
  • 35,506
  • 7
  • 89
  • 126
0

Here is another technique I found useful when you have control over the format of the YAML output. Instead of having the data be a single structure, you can split it up into separate YAML documents by using the "---" separator. For example, instead of

- foo: 1
  bar: 2
- foo: 2
  bar: 10

You can write this as:

foo: 1
bar: 2
---
foo: 2
bar: 10

and then use the following python code to parse it:

with open("really_big_file.yaml") as f:
    for item in yaml.load_all(f):
        print(item)
user545424
  • 15,713
  • 11
  • 56
  • 70