This just sleeps between 8 and 12 seconds and it could be very fragile. Also, it makes whole process longer 1-2 minutes per title?
I use Moviedb to get the picture and IMDB Selenium to get the data
Tests:
Witches' Well 2024
https://www.imdb.com/title/tt29793692/ - it took 1 min 55 sec
The Matrix 1999
https://www.imdb.com/title/tt0133093/ - it took 2 min 10 sec
I limited the tags to 300 as above 500 it crashed the database and i had to manually edited it with DBeaver, rest is in the pictures attached.
I run PVD in a win10 VM as in win11 i can't get it download any data.
When i first got the AWS pages instead of the data ones i thought i got ip banned by imdb so i tried to proxy and VPN my connection with no success. I even copied the vm to my computer at work to test and same result.
Then i looked into why i get the pages and the results pointed to the fact i appeared as a bot getting page after page with no "human" pause between them so i added the sleep.
I found other solutions but not tested them:
change: chrome_options = build_chrome_options(headed=False)
to this: headed_mode = "keywords" in download_url
chrome_options = build_chrome_options(headed=headed_mode)
also this but seemed longer:
add this after page load:
if "challenge.js" in driver.page_source or "AwsWafIntegration" in driver.page_source:
logging.warning("AWS WAF detected — retrying with longer delay")
time.sleep(15)
driver.refresh()
time.sleep(8 )
Regards