want to scrap image urls from json data

Question

i,m trying to scrap image urls of each product only jpg extension also with the name available in "alt" from json structure like (also mentioned below) "attributes" > "media_map" > ("b" , "c" , "d" , e which available) > "src" and then "medium" , "lg" , "xl" , "xxl"

              "a218": {
                "label": "Shape",
                "field_type": "button_select",
                "value_order": [
                  "v766",
                  "v767"
                ],
                "values": {
                  "v766": {
                    "label": "Round",
                    "value": "S6CBRO",
                    "price": 35
                  },
                  "v767": {
                    "label": "Rectangle",
                    "value": "S6CBRE",
                    "price": 35,
                    "hypotheticalPrice": 24.5
                  }
                }
              }
            },
            "inventory": {
              "stock": 0,
              "sold": 0,
              "total": 0
            },
            "optional": {},
            "media_map": {
              "b": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/xVx1vleu7iLcR79ZkRZKqQiSzZE/w_125/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "lg": "https://ctl.s6img.com/society6/img/W-ESMqUtC_oOEUjx-1E_SyIdueI/w_550/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xl": "https://ctl.s6img.com/society6/img/z90VlaYwd8cxCqbrZ1ttAxINpaY/w_700/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xxl": null
                },
                "type": "image",
                "alt": "I'M NOT ALWAYS A BITCH (Red) Cutting Board",
                "meta": null
              },
              "c": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/KQJbb4jG0gBHcqQiOCivLUbKMxI/w_125/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "lg": "https://ctl.s6img.com/society6/img/ztGrxSpA7FC1LfzM3UldiQkEi7g/w_550/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xl": "https://ctl.s6img.com/society6/img/PHjp9jDic2NGUrpq8k0aaxsYZr4/w_700/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xxl": "https://ctl.s6img.com/society6/img/m-1HhSM5CIGl6DY9ukCVxSmVDIw/w_1500/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg"```
 below is my code i,m able to access "media_map" but dnt know how to access jpg extension url

```contents = []
with open('urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents
        newlist = []
        for url in contents:
            try:
                page = urlopen(url[0]).read()
                soup = BeautifulSoup(page, 'html.parser')
                scripts = soup.find_all('script')[7].text.strip()[24:]
                data = json.loads(scripts)
                link = data['product']['response']['product']['data']['attributes']['media_map']```

every product have "b" , "c" , "d" or "b" , "c" , "d" , "e" , "f"
or some products have only "b" , "c"
i,m new in scraping but stuck over there

Driftr95 · Answer 1 · 2022-10-28T03:27:36.703

Instead of

link = data['product']['response']['product']['data']['attributes']['media_map']

have

mediaMap = data['product']['response']['product']['data']['attributes']['media_map']

Then you can extract you want from mediaMap

If you want the alts:

mediaAlts = [m['alt'] for m in mediaMap.values() if 'alt' in m]

(just get mediaAlts[0] if you only want the first one)

Or if you want only the image alts:

imgAlts = [
    m['alt'] for m in mediaMap.values() if 'alt' in m 
    and 'type' in m and m['type'] == 'image'
]

If you want all the src links in the first object in media_map:

m1srcs = list(list(mediaMap.values())[0]['src'].values())

To filter down to only jpg:

m1srcs = [s for s in m1srcs if type(s) == str and s.endswith('.jpg')]

EDIT:

For all jpgs of images with alts:

altJpgs = [
    src for srcs in [[
            s for s in mv['src'].values()
            if type(s) == str and s.endswith('.jpg')
        ] for mv in mediaMap.values()
        if type(mv) == dict and 'src' in mv
        and 'alt' in mv # has alt
        and 'type' in mv and mv['type'] == 'image' # has type listed as image 
    ] for src in srcs
]

or maybe for-loops are more readable than list comprehension in this case:

altJpgs = []

for mv in mediaMap.values():
    if type(mv) != dict or 'src' not in mv: continue 
    if 'alt' not in mv: continue 
    if 'type' not in mv and mv['type'] != 'image': continue 

    for s in mv['src'].values():
        if type(s) == str and s.endswith('.jpg'):
            altJpgs.append(s)

(Edit or remove any of the if... lines to adjust the filter)

thanks ,,, ```m1srcs = list(list(mediaMap.values())[0]['src'].values())``` its only getting back the value of "b" not all src in nested json ,,, actually i only want all jpg from all "src" — babar akhter, Oct 27 '22 at 10:10
some products have jpg in "c" , or even some products have in "AZ" so i will have to grab all jpg files from all src — babar akhter, Oct 27 '22 at 10:11
(I did mention that m**1**srcs was just for the *first* object...) I've added an edit with how you might get srcs from all the objects, but I should probably warn you that I typed it on my phone - so watch out for typos/syntax errors — Driftr95, Oct 28 '22 at 03:33
extremly thankfull dear ,, its working and save alot of time — babar akhter, Nov 03 '22 at 08:06

want to scrap image urls from json data

1 Answers1