Levelling Up Your Web Scraping Game - Ian Littman - PHP UK 2022

Develop your web scraping expertise with Ian Littman's talk at PHP UK 2022, covering topics from Base64 encoded JSON to terminal emulation, and sharing insights on how to navigate common challenges and obstacles.

Key takeaways
  • Base64 encoded JSON can be a sign of JWT authentication.
  • Firefox has certain frameworks trying to fingerprint users.
  • Successful login with failed login, Chrome is slower than Firefox.
  • CSRF token is present after certain requests.
  • Client to client protection exists, but it’s not perfect.
  • Puppeteer is a tool to gain a JavaScript interface into Chrome.
  • Data transfer protocol is a modified version of message packet.
  • JavaScript extraction can be useful for auto-generated PDFs.
  • Rate center information can be extracted.
  • Insurance carrier information can be obtained from websites.
  • Firefox can be used for reverse engineering.
  • Nathaniel Schutta’s presentation slides are still available for download.
  • Axios is a library that can be used to make HTTP requests.
  • IP addresses can be blocked for scraping.
  • Sites may block scraping or have CAPTCHAs.
  • HTTP browser can catch requests like preflight checks.
  • Preprocessing is important before scraping.
  • Cookies can be used to maintain session state.
  • Dom crawler can be used to extract data.
  • HTTP requests can be made using Axios or mike Meeting.
  • TerminalEmulation is a tool that can be used for web scraping.
  • Rate center information can be used to make requests.
  • The speaker mentions cases where web scraping is legal.
  • The speaker works for an insurance company.
  • Encrypted Excel files can be decrypted.
  • Generally, you don’t need to know how a website works.
  • You can use libraries like PyPDF2 to read Excel files.
  • Data extraction can be done using PyPDF2.
  • JSON files can be parsed using the json module.
  • State can be used to remember information.