-2

i,m trying to scrap image urls of each product only jpg extension also with the name available in "alt" from json structure like (also mentioned below) "attributes" > "media_map" > ("b" , "c" , "d" , e which available) > "src" and then "medium" , "lg" , "xl" , "xxl"

              "a218": {
                "label": "Shape",
                "field_type": "button_select",
                "value_order": [
                  "v766",
                  "v767"
                ],
                "values": {
                  "v766": {
                    "label": "Round",
                    "value": "S6CBRO",
                    "price": 35
                  },
                  "v767": {
                    "label": "Rectangle",
                    "value": "S6CBRE",
                    "price": 35,
                    "hypotheticalPrice": 24.5
                  }
                }
              }
            },
            "inventory": {
              "stock": 0,
              "sold": 0,
              "total": 0
            },
            "optional": {},
            "media_map": {
              "b": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/xVx1vleu7iLcR79ZkRZKqQiSzZE/w_125/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "lg": "https://ctl.s6img.com/society6/img/W-ESMqUtC_oOEUjx-1E_SyIdueI/w_550/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xl": "https://ctl.s6img.com/society6/img/z90VlaYwd8cxCqbrZ1ttAxINpaY/w_700/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xxl": null
                },
                "type": "image",
                "alt": "I'M NOT ALWAYS A BITCH (Red) Cutting Board",
                "meta": null
              },
              "c": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/KQJbb4jG0gBHcqQiOCivLUbKMxI/w_125/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "lg": "https://ctl.s6img.com/society6/img/ztGrxSpA7FC1LfzM3UldiQkEi7g/w_550/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xl": "https://ctl.s6img.com/society6/img/PHjp9jDic2NGUrpq8k0aaxsYZr4/w_700/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xxl": "https://ctl.s6img.com/society6/img/m-1HhSM5CIGl6DY9ukCVxSmVDIw/w_1500/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg"```
 below is my code i,m able to access "media_map" but dnt know how to access jpg extension url

```contents = []
with open('urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents
        newlist = []
        for url in contents:
            try:
                page = urlopen(url[0]).read()
                soup = BeautifulSoup(page, 'html.parser')
                scripts = soup.find_all('script')[7].text.strip()[24:]
                data = json.loads(scripts)
                link = data['product']['response']['product']['data']['attributes']['media_map']```

every product have "b" , "c" , "d" or "b" , "c" , "d" , "e" , "f"
or some products have only "b" , "c"
i,m new in scraping but stuck over there

1 Answers1

0

Instead of

link = data['product']['response']['product']['data']['attributes']['media_map']

have

mediaMap = data['product']['response']['product']['data']['attributes']['media_map']

Then you can extract you want from mediaMap

If you want the alts:

mediaAlts = [m['alt'] for m in mediaMap.values() if 'alt' in m]

(just get mediaAlts[0] if you only want the first one)

Or if you want only the image alts:

imgAlts = [
    m['alt'] for m in mediaMap.values() if 'alt' in m 
    and 'type' in m and m['type'] == 'image'
]

If you want all the src links in the first object in media_map:

m1srcs = list(list(mediaMap.values())[0]['src'].values())

To filter down to only jpg:

m1srcs = [s for s in m1srcs if type(s) == str and s.endswith('.jpg')]


EDIT:

For all jpgs of images with alts:

altJpgs = [
    src for srcs in [[
            s for s in mv['src'].values()
            if type(s) == str and s.endswith('.jpg')
        ] for mv in mediaMap.values()
        if type(mv) == dict and 'src' in mv
        and 'alt' in mv # has alt
        and 'type' in mv and mv['type'] == 'image' # has type listed as image 
    ] for src in srcs
]

or maybe for-loops are more readable than list comprehension in this case:

altJpgs = []

for mv in mediaMap.values():
    if type(mv) != dict or 'src' not in mv: continue 
    if 'alt' not in mv: continue 
    if 'type' not in mv and mv['type'] != 'image': continue 

    for s in mv['src'].values():
        if type(s) == str and s.endswith('.jpg'):
            altJpgs.append(s)

(Edit or remove any of the if... lines to adjust the filter)

Driftr95
  • 4,572
  • 2
  • 9
  • 21
  • thanks ,,, ```m1srcs = list(list(mediaMap.values())[0]['src'].values())``` its only getting back the value of "b" not all src in nested json ,,, actually i only want all jpg from all "src" – babar akhter Oct 27 '22 at 10:10
  • some products have jpg in "c" , or even some products have in "AZ" so i will have to grab all jpg files from all src – babar akhter Oct 27 '22 at 10:11
  • (I did mention that m**1**srcs was just for the *first* object...) I've added an edit with how you might get srcs from all the objects, but I should probably warn you that I typed it on my phone - so watch out for typos/syntax errors – Driftr95 Oct 28 '22 at 03:33
  • extremly thankfull dear ,, its working and save alot of time – babar akhter Nov 03 '22 at 08:06