Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you’ll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.
Learn how to parse complicated HTML pagesTraverse multiple pages and sitesGet a general overview of APIs and how they workLearn several methods for storing the data you scrapeDownload, read, and extract data from documentsUse tools and techniques to clean badly formatted dataRead and write natural languagesCrawl through forms and loginsUnderstand how to scrape JavaScriptLearn image processing and text recognition
Since I started the semester and I have been reading internet scraping and network security books. All the books use the example of two arbitrary people Alice and Bob exchanging information.And these examples have been getting better and funnier and weirder. Somehow,I don't know why,but it's maybe because I love reading books or I love fiction,my mind has been looking for patterns in these books between Bob and Alice. My conclusion is that these two are government spies and are knee-deep in cover and are trying to get out important information without letting cluing in on their marks. Also,while typing that,I lmao'd like a hundred times because I'm saying such BS.
But,this book was brilliant.The information was spot-on and wasn't repetitive.It was very helpful and it was one of the most helpful books around.
If you ever want to collect amounts of data off the Internet through Web Scraping, please read this book. If you have done some web scraping, this book provides extremely useful nuggets of information to further enhance your web scraping capabilities. Faced some web scrapping blocker practices? This book has a great section on how to make your scrapper look more "human"!
To balance things out, the author even included a section on the ethics of web scrapping, which is something that ever web scrapper should understand!
I rarely give 5 stars, but this book really took it all the way there. Truly a beautiful soup book!
This is a great text spanning most of the tools, methods and philosophies underpinning web scraping.
It's main problem is a lack of identity: is it teaching web scraping to those with one or two simple tasks, looking to just dip their toe in, or those looking to build production quality web scrapers for large scale tasks? As such it jumps to and fro in the tools it suggests. The start of the book seems lightweight and much of it is replaced by recommendations later in the text. This could be made much clearer from the start.
Having said that, Mitchell's textbook is fairly thorough on the topic, and rewards those who persevere through the start with the more nuanced sides of web scraping (multithreading/processing, solving captchas, finding APIs).
The books gives a good general introduction to BeautifulSoup (which is used for webscraping). However, the focus is too heavily skewed towards less important topics. I would have loved to get more details on BeautifulSoup functions and not about data import to csv etc. since most readers would already have some experience with these sort of tasks.
Решил прочитать эту книгу после просмотра мини-курса по скрапингу от Р.Митчелл на LinkedIn Learning. Видеокурс просто замечательный, но довольно короткий, поэтому хотелось углубить знания. В целом, книга скорее разочаровала. Первые четыре главы были хороши, а потом многое испортилось. Основных проблем две, но они затрагивают почти все последующие главы. Во-первых, автор останавливается на темах, лишь косвенно относящихся к скрапингу, например nltk, обработка pdf и doc файлов. Во-вторых, многие действительно интересные и нужные темы раскрываются лишь мазками, предлагая читателю дальнейшее их самостоятельное изучение. В итоге из всей книги нашел для себя полезными не более сотни строчек кода. А остальное придется почерпнуть где-нибудь еще.
Good introductory book on web scraping, but needs an update.
This book does a really good job describing the main techniques and strategies for web crawling and web scraping. Unfortunately, most of the technologies and libraries used in this book are quite outdated today, so if you want to follow the exercises you will need to use different libraries (which might not necessarily be a bad thing).
A solid overview of web scraping with python. Python is currently the most widely used language for web scraping, and this book gives an overview of how to do it. There are minor errors throughout the text, but the author stated she will fix them in the next edition. If you want a book to read through on scraping rather than exercising your Google search skills, this is the book to get.
Excelente libro, completo y bien explicado. Creo que puede ser una buena iniciación al scraping para cualquiera que tenga un poco de conocimiento de Python. Me sorprendió que los temas que cubre fueron casi exactamente a los que me fui enfrentando por mi cuenta tratando de resolver los problemas que se me presentaban a la hora de buscar información en internet. Hubiera sido de gran ayuda arrancar por acá, aunque tal vez no hubiera entendido nada si hubiera sido así.
Es de gran ayuda la página web propia del libro, y el GitHub con el código.
Es un poco autobombo de los otros libros de O'Reilly, pero realmente parecen valer la pena. Me quedo con ganas de leer más sobre scraping, big data, meterme con algo de machine learning (incluso llegar alguna vez a deep learning). También me anima a leer alguna vez de corrido algún libro sobre VBA. También me queda pendiente algo más de NLP y entender mejor el MySQL. Como sea, fue un buen pantallazo.
Creo que me faltaron ideas concretas de en donde aplicar lo aprendido. Pero creo que cuando me enfrente la próxima a un problema real voy a estar mejor parado, con más ideas desde las que partir.
This book contains wisdom and methods that have been refined by the author after having to webscrape for what might be years. The starting few chapters of the book, while introducing new things, can often feel like a cookbook, which the author finds is a concise way to write code to minimise the work. While those snippets of code can be a boon for some, for me, they took away the creativity of coding. But I will go back to see them once I have had years of experience in scraping to realise what value they hold.
The second half of the book deals with topics I had never imagined could be a webscraping book. And they are amazing and opens up your mind to the extent of possibilities you can go obtain that data that you desire. I think this book would have been perfect if there were code exercises to solve after all relevant chapters.
Genuinely useful book that can still teach basic HTML webscraping, the underlying healthy practices and serve as an introduction to more advanced topics. So it's still worth picking up. However, since its release it's become annoyingly outdated.
PhantomJS was discontinued in 2017, thus Selenium (covered and used with PhantomJS in this book) no longer supports it, and to therefore download it one must step through a few more hurdles. Personally I just keeled over to the headless Chrome driver which seems to have emerged since the release of this book.
The syntax for Selenium has also changed, so the examples involving it won't work without modification - which defeats the purpose of learning it from this book - because by the time you've learnt the correct syntax for Selenium you wouldn't need the text anyway.
Przydatna książka, w której jest opisane, jak sprawnie wyciągać dane ze stron www. Do ekstrakcji danych autorka głównie się skupia na bibliotekach beatiful soup i selenium języka Python. Przy okazji poznajemy wyrażenia regularne i sposoby łączenia/zapisywania/itd z bazą MySql. Na koniec jest opisana w bardzo ciekawy sposób legalność ekstrakcji danych. Książkę będę traktować jako pomoc w swoich projektach.
Формат: Книга Язык: Английский Прочитал книгу в рамках расширения скиллов в Питоне. Наверно как пособие через такую призму книга не очень релевантна, но для расширения кругозора по инструментам и технологиям скрэппинга сайтов вполне интересна. О потраченном на нее времени не пожалел, хотя пока не уверен пригодятся ли мне новые знания на практике. К перечтению - возможно, если потребуется обновить знания или вспомнить детали по некоторым библиотекам указанным в книге.
This entire review has been hidden because of spoilers.
For someone with Python skills but a limited understanding of and skill in web scraping, this is a fantastic book. It covers the basics of a huge range of techniques (HTML wrangling, web APIs, headless browsers, testing) and also comes with some thoughtful discussions, such as the ethics of web scraping. Highly recommend.
good introduction or rather along the way reading. it seems like the author would share much more but maybe in a youtube video or substack post where freedom of speech is more applicable. here they have to remind too often what good boys and girls we should be while scraping was reading it with web scraping by chapagain
A nice introduction to the basics of scraping. Reading this before your first scraping project will probably save you a lot of time and frustration--it's basically a compendium of the basics plus everything you wouldn't know how to search Stack Overflow for. It covers the basics (just grabbing simple HTML and parsing with BeautifulSoup) and touches on more advanced topics (using a headless browser like PhantomJS to parse modern, AJAX-y pages).
If you're more experienced, I'd recommend flipping through it quickly to see if you spot anything you didn't already know. It filled in a few gaps for me.
Probably the best book on web scraping currently available. It not only covers how to handle HTML, but also binary formats like PDF and Word. There are many cautions on how to not shoot in your foot with an automated script that will help you a lot.
I read it while doing a project and it really gave new perspectives and insights that helped me tweak my scrapers as I was reading more. I recommend this book, especially if you have a bit of knowledge about the tools you are using but never done any medium to large projects.
Well written, hands on analysis of how the web works and how to extract information from it--even when it appears in multiple sites and multiple forms. Very inciteful!
A really good introduction to web scraping with Python, this book has saved me a lot time writing my first scraping project. (Also, loved the War and Peace references).