Web Scraping with Python
4.5/5
()
About this ebook
Related to Web Scraping with Python
Related ebooks
Python Web Scraping - Second Edition Rating: 5 out of 5 stars5/5Python 3 Object-oriented Programming - Second Edition Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5Python Data Analysis Rating: 4 out of 5 stars4/5Mastering Python Regular Expressions Rating: 5 out of 5 stars5/5Python Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsNumPy Essentials Rating: 0 out of 5 stars0 ratingsInteractive Applications Using Matplotlib Rating: 0 out of 5 stars0 ratingsMastering Social Media Mining with Python Rating: 5 out of 5 stars5/5Learning Data Mining with Python - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Python Design Patterns Rating: 0 out of 5 stars0 ratingsModular Programming with Python Rating: 0 out of 5 stars0 ratingsGetting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsArtificial Intelligence with Python - Second Edition: Your complete guide to building intelligent apps using Python 3.x, 2nd Edition Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Mastering Python Data Analysis Rating: 0 out of 5 stars0 ratingsLearning IPython for Interactive Computing and Data Visualization - Second Edition Rating: 2 out of 5 stars2/5Parallel Programming with Python Rating: 0 out of 5 stars0 ratingsPython for Google App Engine Rating: 0 out of 5 stars0 ratingsPython Unlocked Rating: 0 out of 5 stars0 ratings
Programming For You
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition) Rating: 0 out of 5 stars0 ratingsPython: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards Rating: 0 out of 5 stars0 ratingsSQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsSQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Beginning Programming with Python For Dummies Rating: 3 out of 5 stars3/5Learn JavaScript in 24 Hours Rating: 3 out of 5 stars3/5Problem Solving in C and Python: Programming Exercises and Solutions, Part 1 Rating: 5 out of 5 stars5/5
Reviews for Web Scraping with Python
4 ratings0 reviews
Book preview
Web Scraping with Python - Richard Lawson
Table of Contents
Web Scraping with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawler
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Summary
2. Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors
Comparing performance
Scraping results
Overview
Adding a scrape callback to the link crawler
Summary
3. Caching Downloads
Adding cache support to the link crawler
Disk cache
Implementation
Testing the cache
Saving disk space
Expiring stale data
Drawbacks
Database cache
What is NoSQL?
Installing MongoDB
Overview of MongoDB
MongoDB cache implementation
Compression
Testing the cache
Summary
4. Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementation
Cross-process crawler
Performance
Summary
5. Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Summary
6. Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with the Mechanize module
Summary
7. Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical Character Recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
9kw CAPTCHA API
Integrating with registration
Summary
8. Scrapy
Installation
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Visual scraping with Portia
Installation
Annotation
Tuning a spider
Checking results
Automated scraping with Scrapely
Summary
9. Overview
Google search engine
The website
The API
Gap
BMW
Summary
Index
Web Scraping with Python
Web Scraping with Python
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2015
Production reference: 1231015
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78216-436-4
www.packtpub.com
Credits
Author
Richard Lawson
Reviewers
Martin Burch
Christopher Davis
William Sankey
Ayush Tiwari
Acquisition Editor
Rebecca Youé
Content Development Editor
Akashdeep Kundu
Technical Editors
Novina Kewalramani
Shruti Rawool
Copy Editor
Sonia Cheema
Project Coordinator
Milton Dsouza
Proofreader
Safis Editing
Indexer
Mariammal Chettiar
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite
About the Author
Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.
I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.
About the Reviewers
Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems.
I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing.
William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System.
I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.
Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping.
He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj).
He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee.
Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush.
He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad.
I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me.
Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
The Internet contains the most useful set of data ever assembled, which is largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be extracted to be useful. This process of extracting data from web pages is known as web scraping and is becoming increasingly useful as ever more information is available online.
What this book covers
Chapter 1, Introduction to Web Scraping, introduces web scraping and explains ways to crawl a website.
Chapter 2, Scraping the Data, shows you how to extract data from web pages.
Chapter 3, Caching Downloads, teaches you how to avoid redownloading by caching results.
Chapter 4, Concurrent Downloading, helps you to scrape data faster by downloading in parallel.
Chapter 5, Dynamic Content, shows you how to extract data from dynamic websites.
Chapter 6, Interacting with Forms, shows you how to work with forms to access the data you are after.
Chapter 7, Solving CAPTCHA, elaborates how to access data that is protected by CAPTCHA images.
Chapter 8, Scrapy, teaches you how to use the popular high-level Scrapy framework.
Chapter 9, Overview, is an overview of web scraping techniques that have been covered.
What you need for this book
All the code used in this book has been tested with Python 2.7, and is available for download at http://bitbucket.org/wswp/code. Ideally, in a future version of this book, the examples will be ported to Python 3. However, for now, many of the libraries required (such as Scrapy/Twisted, Mechanize, and Ghost) are only available for Python 2. To help illustrate the crawling examples, we created a sample website at http://example.webscraping.com. This website limits how fast you can download content, so if you prefer to host this yourself the source code and installation instructions are available at http://bitbucket.org/wswp/places.
We decided to build a custom website for many of the examples used in this book instead of scraping live websites, so that we have full control over the environment. This provides us stability—live websites are updated more often than books, and by the time you try a scraping example, it may no longer work. Also, a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally, a