Recipes for webscraping

Bulk download images

class ImgDownloader[source]

ImgDownloader(n_imgs=5, valid_exts=['jpg', 'jpeg', 'png'], save_path='./images')

Methods

Sorry, the docs for this are broken until nbdev fixes the @typedispatch problem upstream.

ImgDownloader.get_imgs[source]

ImgDownloader.get_imgs()

Download all images.

  • For query:str, download one image.
  • For query:list download multiple images in chunks

ImgDownloader.show_samples[source]

ImgDownloader.show_samples(thing:str=None, path:Path=None, n:int=10)

Show samples of downloaded files with name thing from save location path.

If no path is specified, default to the Downloader's self.save_path If no thing is specified, default to the first item in the path .

Example downloading pokemon pictures

Catch em all (gen 1-7)
From: https://en.wikipedia.org/wiki/List_of_generation_I_Pok%C3%A9mon

from itertools import chain
import nest_asyncio
import pandas as pd

nest_asyncio.apply()

gens = ['I']#, 'II', 'III', 'IV', 'V', 'VI', 'VII']
sources = [f'https://en.wikipedia.org/wiki/List_of_generation_{gen}_Pok%C3%A9mon' for gen in gens]
pokemon_names = [pd.read_html(source)[1].iloc[:-1, 0].tolist() for source in sources]
pokemon_names = list(set(chain(*pokemon_names)))  # flatten
save_path = Path('../../data/pokemon')
downloader = ImgDownloader(save_path=save_path)
downloader.get_imgs(pokemon_names)
!ls {save_path}
 Abra	      Dragonite     Hitmonchan	 Marowak     Pinsir	 Starmie
 Aerodactyl   Dratini	    Hitmonlee	 Meowth      Poliwag	 Staryu
 Alakazam     Drowzee	    Horsea	 Metapod     Poliwhirl	 Tangela
 Arbok	      Dugtrio	    Hypno	 Mew	     Poliwrath	 Tauros
 Arcanine     Eevee	    Ivysaur	 Mewtwo      Ponyta	 Tentacool
 Articuno     Ekans	    Jigglypuff	 Moltres     Porygon	 Tentacruel
 Beedrill     Electabuzz    Jolteon	'Mr. Mime'   Primeape	 Vaporeon
 Bellsprout   Electrode     Jynx	 Muk	     Psyduck	 Venomoth
 Blastoise    Exeggcute     Kabuto	 Nidoking    Raichu	 Venonat
 Bulbasaur    Exeggutor     Kabutops	 Nidoqueen   Rapidash	 Venusaur
 Butterfree  "Farfetch'd"   Kadabra	 Nidoran♀    Raticate	 Victreebel
 Caterpie     Fearow	    Kakuna	 Nidoran♂    Rattata	 Vileplume
 Chansey      Flareon	    Kangaskhan	 Nidorina    Rhydon	 Voltorb
 Charizard    Gastly	    Kingler	 Nidorino    Rhyhorn	 Vulpix
 Charmander   Gengar	    Koffing	 Ninetales   Sandshrew	 Wartortle
 Charmeleon   Geodude	    Krabby	 Oddish      Sandslash	 Weedle
 Clefable     Gloom	    Lapras	 Omanyte     Scyther	 Weepinbell
 Clefairy     Golbat	    Lickitung	 Omastar     Seadra	 Weezing
 Cloyster     Goldeen	    Machamp	 Onix	     Seaking	 Wigglytuff
 Cubone       Golduck	    Machoke	 Paras	     Seel	 Zapdos
 Dewgong      Golem	    Machop	 Parasect    Shellder	 Zubat
 Diglett      Graveler	    Magikarp	 Persian     Slowbro
 Ditto	      Grimer	    Magmar	 Pidgeot     Slowpoke
 Dodrio       Growlithe     Magnemite	 Pidgeotto   Snorlax
 Doduo	      Gyarados	    Magneton	 Pidgey      Spearow
 Dragonair    Haunter	    Mankey	 Pikachu     Squirtle
downloader.show_samples('Charmander');

Scrapy