1

I'm very new to Python and I'm having trouble working on an assignment which basically is like this:

#Read line by line a WARC file to identify string1.

#When string1 found, add part of the string as a key to a dictionary.

#Then continue reading file to identify string2, and add part of string2 as a value to the previous key.

#Keep going through file and doing the same to build the dictionary.

I can't import anything so it's causing me a bit of trouble, especially adding the key, then leaving the value empty and continue going through the file to find string2 to be used as value.

I've started thinking something like saving the key to an intermediate variable, then going on to identify the value, add to an intermediate variable and finally build the dictionary.

def main ():
###open the file
file = open("warc_file.warc", "rb")
filetxt = file.read().decode('ascii','ignore')
filedata = filetxt.split("\r\n")
dictionary = dict()
while line in filedata:
    for line in filedata:
        if "WARC-Type: response" in line:
            break
    for line in filedata:
        if "WARC-Target-URI: " in line:
           urlkey = line.strip("WARC-Target-URI: ")
geo47
  • 13
  • 5
  • Welcome to Stack Overflow. To get a good answer, please edit your question to add the code you have got so far (see https://stackoverflow.com/help/how-to-ask). In the meantime, note that adding an empty string (`""`) as an initial value might help solve part of your problem. – Matthew Strawbridge Sep 30 '20 at 12:53
  • You might want a function that parses a line, instead of checking a bunch of if statements. This may help: https://docs.python.org/3.8/library/stdtypes.html#str.split – Kenny Ostrom Sep 30 '20 at 13:12
  • What would be example of key and value you want to put to your dictionar? What is the end goal? – Robot Mind Sep 30 '20 at 13:43
  • 1
    Why not use a WARC parsing library, for example [warcio](https://pypi.org/project/warcio/)? WARC files are usually quite big and may include binary content as record payload (PDF documents, images, etc.). In addition, the keywords you're looking for ("WARC-Type: response") could be included as part of the payload. Just imagine stackoverflow is crawled and this page is archived in a WARC file. – Sebastian Nagel Sep 30 '20 at 18:07
  • @SebastianNagel OP says he can't import anything, so I assume external libs are out. Lots of teachers think its good to make kids re-invent the wheel :-) – n1c9 Sep 30 '20 at 19:36

2 Answers2

1

It's not entirely clear what you're trying to do, but I'll have a go at answering.

Suppose you have a WARC file like this:

WARC-Type: response
WARC-Target-URI: http://example.example
something
WARC-IP-Address: 88.88.88.88

WARC-Type: response
WARC-Target-URI: http://example2.example2
something else
WARC-IP-Address: 99.99.99.99

Then you could create a dictionary that maps the target URIs to the IP addresses like this:

dictionary = dict()

with open("warc_file.warc", "rb") as file:
  urlkey = None
  value = None

  for line in file:
    if b"WARC-Target-URI: " in line:
      assert urlkey is None
      urlkey = line.strip(b"WARC-Target-URI: ").rstrip(b"\n").decode("ascii")

    if b"WARC-IP-Address: " in line:
      assert urlkey is not None
      assert value is None

      value = line.strip(b"WARC-IP-Address: ").rstrip(b"\n").decode("ascii")

      dictionary[urlkey] = value

      urlkey = None
      value = None

print(dictionary)

This prints the following result:

{'http://example.example': '88.88.88.88', 'http://example2.example2': '99.99.99.99'}

Note that this approach only loads one line of the file into memory at a time, which might be significant if the file is very large.

Matthew Strawbridge
  • 19,940
  • 10
  • 72
  • 93
0

Your idea with storing the key to an intermediate value is good.

I also suggest using the following snippet to iterate over the lines.

with open(filename, "rb") as file:
    lines = file.readlines()
    for line in lines: 
        print(line)

To create dictionary entries in Python, the dict.update() method can be used. It allows you to create new keys or update values if the key already exists.

d = dict() # create empty dict
d.update({"key" : None}) # create entry without value
d.update({"key" : 123}) # update the value
akaessens
  • 16
  • 1