Member-only story

Virtual, Headless, and Distributed (Oh My!)

Zecca J. Lehn
5 min readMar 19, 2019

--

Fearless Web Scraping with Python in DataLab Notebooks

This post empowers the Pythonista, with a complete framework to explore the world of data on the internet — all behind randomized proxy servers in a fast parallelized sequence, while protecting your company’s immutable IP from curious eyes, and other potential trolls. With this new outlet, the reader is requested to take all measures, and to not abuse the privilege of their acquired ghost-ninja skills, to not tax any such services inappropriately, nor unethically. The user takes all responsibility for implementing (of course) and all risks associated with running the attached code.

In a connected post, I walked us through the efficient setup on Google Compute Platform (GCP), with all firewalls and permissions required to securely get going with a VM based DataLab Notebook. With it we’ll be exploring Beautiful Soup (a DOM traversal library), Proxies (remote IP servers) and Selenium (for JavaScript rendered data inaccessible without launching a browser), installing Chromium (and *Chromedriver) in Linux, along with parallel reads from your headless browser (here, a browser will run with your remote DataLab Notebook).

The code (here in a Google Collab Notebook), along with the automated installation %bash script, can easily be applied to your own proxy list, if you’re…

--

--

Zecca J. Lehn
Zecca J. Lehn

Written by Zecca J. Lehn

GP at Responsibly Ventures / Host @ posi2ive

Responses (1)

Write a response