{"id":1629,"date":"2019-03-01T04:53:13","date_gmt":"2019-03-01T04:53:13","guid":{"rendered":"https:\/\/blog.hassler.ec\/wp\/?p=1629"},"modified":"2019-02-28T02:59:01","modified_gmt":"2019-02-28T02:59:01","slug":"web-scraping-for-web-developers-a-concise-summary","status":"publish","type":"post","link":"https:\/\/blog.hassler.ec\/wp\/2019\/03\/01\/web-scraping-for-web-developers-a-concise-summary\/","title":{"rendered":"Web scraping for web developers: a concise\u00a0summary"},"content":{"rendered":"<figure id=\"7f0b\" class=\"graf graf--figure graf-after--h3\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded\" data-image-id=\"1*QYXgeKvQq5M0lMGMRFXJvA.jpeg\" data-width=\"4000\" data-height=\"2575\" data-is-featured=\"true\" data-action=\"zoom\" data-action-value=\"1*QYXgeKvQq5M0lMGMRFXJvA.jpeg\" data-scroll=\"native\"><img decoding=\"async\" class=\"progressiveMedia-image js-progressiveMedia-image\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*QYXgeKvQq5M0lMGMRFXJvA.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*QYXgeKvQq5M0lMGMRFXJvA.jpeg\"><\/div>\n<\/div><figcaption class=\"imageCaption\">Photo by&nbsp;<a class=\"markup--anchor markup--figure-anchor\" href=\"https:\/\/unsplash.com\/photos\/pgxZAv-bYkM?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/unsplash.com\/photos\/pgxZAv-bYkM?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">rawpixel<\/a>&nbsp;on&nbsp;<a class=\"markup--anchor markup--figure-anchor\" href=\"https:\/\/unsplash.com\/search\/photos\/fishing-net?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/unsplash.com\/search\/photos\/fishing-net?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/figcaption><\/figure>\n<p id=\"5587\" class=\"graf graf--p graf-after--figure\">Knowing one approach to web scraping may solve your problem in the short term, but all methods have their own strengths and weaknesses. Being aware of this can save you time and help you to solve a task more efficiently.<\/p>\n<p id=\"0ac6\" class=\"graf graf--p graf-after--p\">Numerous resources exist, which will show you a single technique for extracting data from a web page. The reality is that multiple solutions and tools can be used for that.<\/p>\n<p id=\"dc90\" class=\"graf graf--p graf-after--p\">What are your options to programmatically extract data from a web page?<\/p>\n<p id=\"bd6c\" class=\"graf graf--p graf-after--p\">What are the pros and cons of each approach?<\/p>\n<p id=\"151c\" class=\"graf graf--p graf-after--p\">How to use cloud services to increase the degree of automation?<\/p>\n<p id=\"9100\" class=\"graf graf--p graf-after--p\"><strong class=\"markup--strong markup--p-strong\">This guide meant to answer these questions.<\/strong><\/p>\n<p id=\"99a6\" class=\"graf graf--p graf-after--p\">I assume you have a basic understanding of browsers in general,&nbsp;<strong class=\"markup--strong markup--p-strong\">HTTP<\/strong>requests, the&nbsp;<strong class=\"markup--strong markup--p-strong\">DOM<\/strong>&nbsp;(Document Object Model),&nbsp;<strong class=\"markup--strong markup--p-strong\">HTML<\/strong>,&nbsp;<strong class=\"markup--strong markup--p-strong\">CSS selectors<\/strong>, and&nbsp;<strong class=\"markup--strong markup--p-strong\">Async JavaScript<\/strong>.<\/p>\n<p id=\"340f\" class=\"graf graf--p graf-after--p\">If these phrases sound unfamiliar, I suggest checking out those topics before continue reading. Examples are implemented in Node.js, but hopefully you can transfer the theory into other languages if needed.<\/p>\n<h3 id=\"4e38\" class=\"graf graf--h3 graf-after--p\">Static content<\/h3>\n<h4 id=\"3f6a\" class=\"graf graf--h4 graf-after--h3\">HTML source<\/h4>\n<p id=\"0d61\" class=\"graf graf--p graf-after--h4\">Let\u2019s start with the simplest approach.<\/p>\n<p id=\"d3fa\" class=\"graf graf--p graf-after--p\">If you are planning to scrape a web page, this is the first method to try. It requires a negligible amount of computing power and the least time to implement.<\/p>\n<p id=\"9287\" class=\"graf graf--p graf-after--p\">However, it&nbsp;<strong class=\"markup--strong markup--p-strong\">only works if the HTML source code contains the data<\/strong>&nbsp;you are targeting. To check that in Chrome, right-click the page and choose&nbsp;<em class=\"markup--em markup--p-em\">View page source<\/em>. Now you should see the HTML source code.<\/p>\n<p id=\"ceae\" class=\"graf graf--p graf-after--p\">It\u2019s important to note here, that you won\u2019t see the same code by using Chrome\u2019s inspect tool, because it shows the HTML structure related to the current state of the page, which is not necessarily the same as the source HTML document that you can get from the server.<\/p>\n<p id=\"33dc\" class=\"graf graf--p graf-after--p\">Once you find the data here, write a&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.w3schools.com\/cssref\/css_selectors.asp\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.w3schools.com\/cssref\/css_selectors.asp\">CSS selector<\/a>&nbsp;belonging to the wrapping element, to have a reference later on.<\/p>\n<p id=\"7611\" class=\"graf graf--p graf-after--p\">To implement, you can send an HTTP GET request to the URL of the page and will get back the HTML source code.<\/p>\n<p id=\"1ecc\" class=\"graf graf--p graf-after--p\">In&nbsp;<strong class=\"markup--strong markup--p-strong\">Node<\/strong>, you can use a tool called&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/cheeriojs\/cheerio\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/cheeriojs\/cheerio\">CheerioJS<\/a>&nbsp;to parse this raw HTML and extract the data using a selector. The code looks something like this:<\/p>\n<figure id=\"a88b\" class=\"graf graf--figure graf--iframe graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\">\n<pre>const fetch = require('node-fetch');\nconst cheerio = require('cheerio');\n\nconst url = 'https:\/\/example.com\/';\nconst selector = '.example';\n\nfetch(url)\n  .then(res =&gt; res.text())\n  .then(html =&gt; {\n    const $ = cheerio.load(html);\n    const data = $(selector);\n    console.log(data.text());\n  });<\/pre>\n<\/div>\n<\/div>\n<\/figure>\n<h3 id=\"af59\" class=\"graf graf--h3 graf-after--figure\">Dynamic content<\/h3>\n<p id=\"1f99\" class=\"graf graf--p graf-after--h3\">In many cases, you can\u2019t access the information from the raw HTML code, because the DOM was manipulated by some JavaScript, executed in the background. A typical example of that is a SPA (Single Page Application), where the HTML document contains a minimal amount of information, and the JavaScript populates it at runtime.<\/p>\n<p id=\"bce7\" class=\"graf graf--p graf-after--p\">In this situation, a solution is to build the DOM and execute the scripts located in the HTML source code, just like a browser does. After that, the data can be extracted from this object with selectors.<\/p>\n<h4 id=\"cf2e\" class=\"graf graf--h4 graf-after--p\">Headless browsers<\/h4>\n<p id=\"07ab\" class=\"graf graf--p graf-after--h4\">This can be achieved by using a headless browser. A headless browser is almost the same thing as the normal one you are probably using every day but without a user interface. It\u2019s running in the background and you can programmatically control it instead of clicking with your mouse and typing with a keyboard.<\/p>\n<p id=\"d78f\" class=\"graf graf--p graf-after--p\">A popular choice for a headless browser is&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/GoogleChrome\/puppeteer\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/GoogleChrome\/puppeteer\">Puppeteer<\/a>. It is an easy to use Node library which provides a high-level API to control Chrome in headless mode. It can be configured to run non-headless, which comes in handy during development. The following code does the same thing as before, but it will work with dynamic pages as well:<\/p>\n<figure id=\"09d9\" class=\"graf graf--figure graf--iframe graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\">\n<pre>const puppeteer = require('puppeteer');\n\nasync function getData(url, selector){\n  const browser = await puppeteer.launch();\n  const page = await browser.newPage();\n  await page.goto(url);\n  const data = await page.evaluate(selector =&gt; {\n    return document.querySelector(selector).innerText;\n  }, selector);\n  await browser.close();\n  return data;\n}\n\nconst url = 'https:\/\/example.com';\nconst selector = '.example';\ngetData(url,selector)\n  .then(result =&gt; console.log(result));<\/pre>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"31ab\" class=\"graf graf--p graf-after--figure\">Of course, you can do more interesting things with Puppeteer, so it is worth checking out the&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/pptr.dev\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/pptr.dev\/\">documentation<\/a>. Here is a code snippet which navigates to a URL, takes a screenshot and saves it:<\/p>\n<figure id=\"e649\" class=\"graf graf--figure graf--iframe graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\">\n<pre>const puppeteer = require('puppeteer');\n\nasync function takeScreenshot(url,path){\n  const browser = await puppeteer.launch();\n  const page = await browser.newPage();\n  await page.goto(url);\n  await page.screenshot({path: path});\n  await browser.close();\n}\n\nconst url = 'https:\/\/example.com';\nconst path = 'example.png';\ntakeScreenshot(url, path);<\/pre>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"93d4\" class=\"graf graf--p graf-after--figure\">As you can imagine, running a browser requires much more computing power than sending a simple GET request and parsing the response. Therefore execution is relatively costly and slow. Not only that but including a browser as a dependency makes the deployment package massive.<\/p>\n<p id=\"561e\" class=\"graf graf--p graf-after--p\">On the upside, this method is highly flexible. You can use it for navigating around pages, simulating clicks, mouse moves, and keyboard events, filling out forms, taking screenshots or generating PDFs of pages, executing commands in the console, selecting elements to extract its text content. Basically, everything can be done that is possible manually in a browser.<\/p>\n<h4 id=\"270b\" class=\"graf graf--h4 graf-after--p\">Building just the&nbsp;DOM<\/h4>\n<p id=\"d7a1\" class=\"graf graf--p graf-after--h4\">You may think it\u2019s a little bit of overkill to simulate a whole browser just for building a DOM. Actually, it is, at least under certain circumstances.<\/p>\n<p id=\"d165\" class=\"graf graf--p graf-after--p\">There is a Node library, called&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/jsdom\/jsdom\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/jsdom\/jsdom\">Jsdom<\/a>, which will parse the HTML you pass it, just like a browser does. However, it isn\u2019t a browser, but&nbsp;<strong class=\"markup--strong markup--p-strong\">a tool for building a DOM from a given HTML source code<\/strong>, while also executing the JavaScript code within that HTML.<\/p>\n<p id=\"947f\" class=\"graf graf--p graf-after--p\">Thanks to this abstraction, Jsdom is able to run faster than a headless browser. If it\u2019s faster, why don\u2019t use it instead of headless browsers all the time?<\/p>\n<p id=\"98b9\" class=\"graf graf--p graf-after--p\">Quote from the documentation:<\/p>\n<blockquote id=\"6711\" class=\"graf graf--pullquote graf-after--p\"><p>People often have trouble with asynchronous script loading when using jsdom. Many pages load scripts asynchronously, but there is no way to tell when they\u2019re done doing so, and thus when it\u2019s a good time to run your code and inspect the resulting DOM structure. This is a fundamental limitation.<\/p><\/blockquote>\n<blockquote id=\"7947\" class=\"graf graf--pullquote graf-after--pullquote\"><p>\u2026 This can be worked around by polling for the presence of a specific&nbsp;element.<\/p><\/blockquote>\n<p id=\"dc84\" class=\"graf graf--p graf-after--pullquote\">This solution is shown in the example. It checks every 100 ms if the element either appeared or timed out (after 2 seconds).<\/p>\n<p id=\"9aef\" class=\"graf graf--p graf-after--p\">It also often throws nasty error messages when some browser feature in the page is not implemented by Jsdom, such as: \u201c<em class=\"markup--em markup--p-em\">Error: Not implemented: window.alert\u2026\u201d or \u201cError: Not implemented: window.scrollTo\u2026\u201d.<\/em>&nbsp;This issue also can be solved with some workarounds (<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/jsdom\/jsdom#virtual-consoles\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/jsdom\/jsdom#virtual-consoles\">virtual consoles<\/a>).<\/p>\n<p id=\"ecd9\" class=\"graf graf--p graf-after--p\">Generally, it\u2019s a lower level API&nbsp;than&nbsp;Puppeteer, so you need to implement certain things yourself.<\/p>\n<p id=\"af78\" class=\"graf graf--p graf-after--p\">These things make it a little messier to use, as you will see in the example. Puppeteer solves all these things for you behind the scenes and makes it extremely easy to use. Jsdom for this extra work will offer a fast and lean solution.<\/p>\n<p id=\"5d0c\" class=\"graf graf--p graf-after--p\">Let\u2019s see the same example as previously, but with Jsdom:<\/p>\n<figure id=\"575b\" class=\"graf graf--figure graf--iframe graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\">\n<pre>const jsdom = require(\"jsdom\");\nconst { JSDOM } = jsdom;\n\nasync function getData(url,selector,timeout) {\n  const virtualConsole = new jsdom.VirtualConsole();\n  virtualConsole.sendTo(console, { omitJSDOMErrors: true });\n  const dom = await JSDOM.fromURL(url, {\n    runScripts: \"dangerously\",\n    resources: \"usable\",\n    virtualConsole\n  });\n  const data = await new Promise((res,rej)=&gt;{\n    const started = Date.now();\n    const timer = setInterval(() =&gt; {\n      const element = dom.window.document.querySelector(selector)\n      if (element) {\n        res(element.textContent);\n        clearInterval(timer);\n      }\n      else if(Date.now()-started &gt; timeout){\n        rej(\"Timed out\");\n        clearInterval(timer);\n      }\n    }, 100);\n  });\n  dom.window.close();\n  return data;\n}\n\nconst url = \"https:\/\/example.com\/\";\nconst selector = \".example\";\ngetData(url,selector,2000).then(result =&gt; console.log(result));<\/pre>\n<\/div>\n<\/div>\n<\/figure>\n<h4 id=\"f184\" class=\"graf graf--h4 graf-after--figure\">Reverse engineering<\/h4>\n<p id=\"ab9e\" class=\"graf graf--p graf-after--h4\">Jsdom is a fast and lightweight solution, but it\u2019s possible even further to simplify things.<\/p>\n<p id=\"6133\" class=\"graf graf--p graf-after--p\">Do we even need to simulate the DOM?<\/p>\n<p id=\"42f7\" class=\"graf graf--p graf-after--p\">Generally speaking, the webpage that you want to scrape consists of the same HTML, same JavaScript, same technologies you\u2019ve already know. So,<strong class=\"markup--strong markup--p-strong\">&nbsp;if you<\/strong><strong class=\"markup--strong markup--p-strong\">find that piece of code from where the targeted data was derived, you can repeat the same operation in order to get the same result.<\/strong><\/p>\n<p id=\"bff0\" class=\"graf graf--p graf-after--p\">If we&nbsp;<strong class=\"markup--strong markup--p-strong\">oversimplify<\/strong>&nbsp;things, the data you\u2019re looking for can be:<\/p>\n<ul class=\"postList\">\n<li id=\"e896\" class=\"graf graf--li graf-after--p\">part of the HTML source code (as we saw in the first paragraph),<\/li>\n<li id=\"84e2\" class=\"graf graf--li graf-after--li\">part of a static file, referenced in the HTML document (for example a string in a javascript file),<\/li>\n<li id=\"70e5\" class=\"graf graf--li graf-after--li\">a response for a network request (for example some JavaScript code sent an AJAX request to a server, which responded with a JSON string).<\/li>\n<\/ul>\n<p id=\"5e70\" class=\"graf graf--p graf-after--li\"><strong class=\"markup--strong markup--p-strong\">All of these data sources can be accessed with network requests.<\/strong>&nbsp;From our perspective, it doesn\u2019t matter if the webpage uses HTTP, WebSockets or any other communication protocol, because all of them are reproducible in theory.<\/p>\n<p id=\"9851\" class=\"graf graf--p graf-after--p\">Once you locate the resource housing the data, you can send a similar network request to the same server as the original page does. As a result, you get the response, containing the targeted data, which can be easily extracted with regular expressions, string methods, JSON.parse etc\u2026<\/p>\n<p id=\"7836\" class=\"graf graf--p graf-after--p\">With simple words, you can just take the resource where the data is located, instead of processing and loading the whole stuff. This way the problem, showed in the previous examples, can be solved with a single HTTP request instead of controlling a browser or a complex JavaScript object.<\/p>\n<p id=\"836d\" class=\"graf graf--p graf-after--p\">This solution seems easy in theory, but most of the times it can be&nbsp;<strong class=\"markup--strong markup--p-strong\">really time-consuming<\/strong>&nbsp;to carry out and requires some experience of working with web pages and servers.<\/p>\n<p id=\"2c0b\" class=\"graf graf--p graf-after--p\">A possible place to start researching is to observe network traffic. A great tool for that is the&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/developers.google.com\/web\/tools\/chrome-devtools\/network-performance\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/developers.google.com\/web\/tools\/chrome-devtools\/network-performance\/\">Network tab in Chrome DevTools<\/a>. You will see all outgoing requests with the responses (including static files, AJAX requests, etc\u2026), so you can iterate through them and look for the data.<\/p>\n<p id=\"4cab\" class=\"graf graf--p graf-after--p\">This can be even more sluggish if the response is modified by some code before being rendered on the screen. In that case, you have to find that piece of code and understand what\u2019s going on.<\/p>\n<p id=\"1ba3\" class=\"graf graf--p graf-after--p\">As you see, this solution may require way more work than the methods featured so far. On the other hand, once it\u2019s implemented, it provides the best performance.<\/p>\n<p id=\"2a74\" class=\"graf graf--p graf-after--p\">This chart shows the required execution time, and the package size compared to Jsdom and Puppeteer:<\/p>\n<figure id=\"5cad\" class=\"graf graf--figure graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div class=\"aspectRatioPlaceholder-fill\"><\/div>\n<div class=\"progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded\" data-image-id=\"1*36D8phqv-iUx6SVmrqhJcQ.jpeg\" data-width=\"1000\" data-height=\"583\" data-action=\"zoom\" data-action-value=\"1*36D8phqv-iUx6SVmrqhJcQ.jpeg\" data-scroll=\"native\"><canvas class=\"progressiveMedia-canvas js-progressiveMedia-canvas\" width=\"75\" height=\"42\"><\/canvas><img decoding=\"async\" class=\"progressiveMedia-image js-progressiveMedia-image\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*36D8phqv-iUx6SVmrqhJcQ.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*36D8phqv-iUx6SVmrqhJcQ.jpeg\"><\/div>\n<\/div>\n<\/figure>\n<p id=\"52ae\" class=\"graf graf--p graf-after--figure\">These results aren\u2019t based on precise measurements and can vary in every situation, but shows well the approximate difference between these techniques.<\/p>\n<h3 id=\"06fc\" class=\"graf graf--h3 graf-after--p\">Cloud service integration<\/h3>\n<p id=\"bdd0\" class=\"graf graf--p graf-after--h3\">Let\u2019s say you implemented one of the solutions listed so far. One way to execute your script is to power on your computer, open a terminal and execute it manually.<\/p>\n<p id=\"e213\" class=\"graf graf--p graf-after--p\">This can become annoying and inefficient very quickly, so it would be better if we could just upload the script to a server and it would execute the code on a regular basis depending on how it\u2019s configured.<\/p>\n<p id=\"5961\" class=\"graf graf--p graf-after--p\">This can be done by running an actual server and configuring some rules on when to execute the script. Servers shine when you keep observing an element in a page. In other cases, a cloud function is probably a simpler way to go.<\/p>\n<p id=\"1462\" class=\"graf graf--p graf-after--p\">Cloud functions are basically containers intended to execute the uploaded code when a triggering event occurs. This means you don\u2019t have to manage servers, it\u2019s done automatically by the cloud provider of your choice.<\/p>\n<p id=\"4348\" class=\"graf graf--p graf-after--p\">A possible trigger can be a schedule, a network request, and numerous other events. You can save the collected data in a database, write it in a&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/developers.google.com\/sheets\/api\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/developers.google.com\/sheets\/api\/\">Google sheet<\/a>&nbsp;or send it in an&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.w3schools.com\/nodejs\/nodejs_email.asp\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.w3schools.com\/nodejs\/nodejs_email.asp\">email<\/a>. It all depends on your creativity.<\/p>\n<p id=\"0267\" class=\"graf graf--p graf-after--p\">Popular cloud providers are&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/aws.amazon.com\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/aws.amazon.com\">Amazon Web Services<\/a>(AWS),&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/cloud.google.com\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/cloud.google.com\/\">Google Cloud Platform<\/a>(GCP), and&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/azure.microsoft.com\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/azure.microsoft.com\">Microsoft Azure<\/a>&nbsp;and all of them has a function service:<\/p>\n<ul class=\"postList\">\n<li id=\"f1ed\" class=\"graf graf--li graf-after--p\"><a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/aws.amazon.com\/lambda\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/aws.amazon.com\/lambda\/\">AWS Lambda<\/a><\/li>\n<li id=\"840c\" class=\"graf graf--li graf-after--li\"><a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/cloud.google.com\/functions\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/cloud.google.com\/functions\/\">GCP Cloud Functions<\/a><\/li>\n<li id=\"2ea2\" class=\"graf graf--li graf-after--li\"><a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/azure.microsoft.com\/services\/functions\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/azure.microsoft.com\/services\/functions\/\">Azure Functions<\/a><\/li>\n<\/ul>\n<p id=\"ec62\" class=\"graf graf--p graf-after--li\">They offer some amount of free usage every month, which your single script probably won\u2019t exceed, unless in extreme cases, but&nbsp;<strong class=\"markup--strong markup--p-strong\">please check the pricing before use<\/strong>.<\/p>\n<p id=\"248b\" class=\"graf graf--p graf-after--p\">If you are using Puppeteer, Google\u2019s C<em class=\"markup--em markup--p-em\">loud Functions<\/em>&nbsp;is the simplest solution. Headless Chrome\u2019s zipped package size (~130MB) exceeds AWS Lambda\u2019s limit of maximum zipped size (50MB). There are some techniques to make it work with Lambda, but GCP functions&nbsp;<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/cloud.google.com\/blog\/products\/gcp\/introducing-headless-chrome-support-in-cloud-functions-and-app-engine\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/cloud.google.com\/blog\/products\/gcp\/introducing-headless-chrome-support-in-cloud-functions-and-app-engine\">support headless Chrome by default<\/a>, you just need to include Puppeteer as a dependency in&nbsp;<em class=\"markup--em markup--p-em\">package.json<\/em>.<\/p>\n<p id=\"5bc6\" class=\"graf graf--p graf-after--p\">If you want to learn more about cloud functions in general, do some research on serverless architectures. Many great guides have already been written on this topic and most providers have an easy to follow documentation.<\/p>\n<h3 id=\"855c\" class=\"graf graf--h3 graf-after--p\">Summary<\/h3>\n<p id=\"389d\" class=\"graf graf--p graf-after--h3\">I know that every topic was a bit compressed. You probably can\u2019t implement every solution just with this knowledge, but with the documentation and some custom research, it shouldn\u2019t be a problem.<\/p>\n<p id=\"7b8b\" class=\"graf graf--p graf-after--p graf--trailing\">Hopefully, now you have a high-level overview of techniques used for collecting data from the web, so you can dive deeper into each topic accordingly.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>Source:&nbsp;<a href=\"https:\/\/medium.freecodecamp.org\/web-scraping-for-web-developers-a-concise-summary-3af3d0ca4069\"><strong>https:\/\/medium.freecodecamp.org\/web-scraping-for-web-developers-a-concise-summary-3af3d0ca4069<\/strong><\/a><\/p>\n<p>Written by<\/p>\n<div class=\"u-tableCell\">\n<div class=\"u-relative u-inlineBlock u-flex0\"><img decoding=\"async\" class=\"avatar-image avatar-image--small alignleft\" src=\"https:\/\/cdn-images-1.medium.com\/fit\/c\/120\/120\/1*WJx-rbLvSTek_8tU715ixg.jpeg\" alt=\"Go to the profile of David Karolyi\"><\/p>\n<div class=\"avatar-halo u-absolute u-textColorGreenNormal svgIcon\"><\/div>\n<\/div>\n<\/div>\n<div class=\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\">\n<h3 class=\"ui-h3 u-fontSize18 u-lineHeightTighter\"><a class=\"link link--primary u-accentColor--hoverTextNormal\" dir=\"auto\" title=\"Go to the profile of David Karolyi\" href=\"https:\/\/medium.freecodecamp.org\/@davidkarolyi\" rel=\"author cc:attributionUrl\" aria-label=\"Go to the profile of David Karolyi\" data-user-id=\"38927691045a\" data-collection-slug=\"free-code-camp\">David Karolyi<\/a><\/h3>\n<div class=\"ui-caption u-textColorGreenNormal u-fontSize13 u-tintSpectrum u-accentColor--textNormal u-marginBottom7\">Medium member since Jan 2019<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<div class=\"u-tableCell \"><a class=\"link u-baseColor--link avatar avatar--roundedRectangle\" title=\"Go to freeCodeCamp.org\" href=\"https:\/\/medium.freecodecamp.org\/?source=footer_card\" aria-label=\"Go to freeCodeCamp.org\" data-action-source=\"footer_card\" data-collection-slug=\"free-code-camp\"><img decoding=\"async\" class=\"avatar-image u-size60x60 alignleft\" src=\"https:\/\/cdn-images-1.medium.com\/fit\/c\/120\/120\/1*MotlWcSa2n6FrOx3ul89kw.png\" alt=\"freeCodeCamp.org\"><\/a><\/div>\n<div class=\"u-tableCell u-verticalAlignMiddle u-breakWord u-paddingLeft15\">\n<h3 class=\"ui-h3 u-fontSize18 u-lineHeightTighter u-marginBottom4\"><a class=\"link link--primary u-accentColor--hoverTextNormal\" href=\"https:\/\/medium.freecodecamp.org\/?source=footer_card\" rel=\"collection\" data-action-source=\"footer_card\" data-collection-slug=\"free-code-camp\">freeCodeCamp.org<\/a><\/h3>\n<p class=\"ui-body u-fontSize14 u-lineHeightBaseSans u-textColorDark u-marginBottom4\">Stories worth reading about programming and technology from our open source community.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Photo by&nbsp;rawpixel&nbsp;on&nbsp;Unsplash Knowing one approach to web scraping may solve your problem in the short term, but all methods have [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1631,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[49,12,118,44,47,112,29],"tags":[],"class_list":["post-1629","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-back-end","category-bloghassler-ec","category-internet","category-javascript","category-medium","category-nodejs","category-programacion"],"_links":{"self":[{"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/posts\/1629","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/comments?post=1629"}],"version-history":[{"count":2,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/posts\/1629\/revisions"}],"predecessor-version":[{"id":1633,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/posts\/1629\/revisions\/1633"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/media\/1631"}],"wp:attachment":[{"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/media?parent=1629"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/categories?post=1629"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.hassler.ec\/wp\/wp-json\/wp\/v2\/tags?post=1629"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}