Serverless Scraping LightServerless Scraping Light

A detailed step-by-step guide for building a REST API in Node.js with AWS Lambda, API Gateway, DynamoDB, and Serverless Framework with deployment on AWS

Overview

We will be building a function app (rest api) in Node.js using Serverless Framework while understanding the concepts of development and deployment of serverless computing.

Prerequisite

  • Node.js
  • Basic understanding of AWS Cloud
  • Bash CLI

What is the Serverless?

Serverless (also called function apps) is a cloud-native development model that allows developers to build and run applications without having to manage servers. So technically it's not Serverless, just that the runtime/server is fully managed and scaled by the cloud hosting provider.

Serverless Vs Traditional Infra

What is the Serverless Framework?

Serverless Framework is a free and open-source web framework written using Node.js for building and deploying function apps (primarily on AWS). It comes with a great set of advantages (as said by Mathijs):

  • Provides serverless.yaml configuration to orchestrate functions into regions and tiers
  • Abstracts cloud specifications
  • Event-driven configuration + capabilities
  • Powerful CLI to configure, generate and deploy cloud functions
  • Multi-stage deployment (integration with CI/CD pipelines)
  • Cloud platform agnostic (deploy to any cloud account)
  • Many official and community plugins for common solutions
  1. Setup

    Let's start with some hands on, Serverless Framework provides out of the box cli to facilitate setup, run and deployment of a function app. Run the below command in the terminal to install Serverless CLI:

    # node: v16.17
    # npm: v6.14
    # use nvm to install LTS latest version of the node on your machine.
     
    npm install -g serverless
  2. Creating a new project

    Run the below commands to set up a aws-node-http-api-typescript-dynamodb project. You can check out all examples projects here.

    # Create a service in a new folder 'myService'
    serverless create --template-url https://github.com/serverless/examples/tree/v3/aws-node-http-api-typescript-dynamodb --path my-service
    
    # ✔ Project successfully created in "my-service" from "aws-node-http-api-typescript-dynamodb" template (11s)

    It will create a file structure as below:

    my-service
    ├── README.md
    ├── package.json
    ├── serverless.yml
    ├── todos
    │   ├── create.ts
    │   ├── get.ts
    │   ├── list.ts
    │   └── update.ts
    ├── tsconfig.json
    └── tslint.json

    The function flow will be as below:

    Serverless Function Flow

  3. Understanding the serverless.yml

  4. Running on local

    As I earlier said Serverless Frameworks come with a great set of official and community plugins, so to run the function app locally there is a plugin that comes in handy for the job. It's called serverless-offline.

    Simply run the below command on bash to add to the project:

    npm install serverless-offline --save-dev

    And add the below to serverless.yaml:

    plugins:
      - serverless-offline

    Then run using serverless offline. For the complete file refer to openwishlist/preview in serverless.yml

  5. Connecting DynamoDB on local

    You can also add serverless-dynamodb-local to add a local dynamodb instance on an offline run.

    npm install serverless-dynamodb-local --save-dev

    And add the below to serverless.yaml:

    plugins:
    - serverless-dynamodb-local
    
    custom:
    dynamodb:
    stages:
      - dev
      - live
    start:
      migrate: true

    Then run using serverless offline which will start a local dynamodb and migrate a schema automatically. For the complete file refer to openwishlist/preview in serverless.yml

  6. Building a scraping service

    Finally, let's focus on building a scraping service. It's like a cat and mouse game, the website owners use newer and better bot blocking services, while the scrapers keep finding evasions to avoid getting detected.

    The scraping service I built is part of a larger project and is complete opensource. Feel free to explore the code and contribute at openwishlist/preview.

    Also, you can check out the detailed design of the service below and continue reading this article:

    Serverless Scraping Service Flow

    As per the current design, the service depends on a curl request and headless chrome to get the html content of the page. The lambda function tries to initially get data by a curl request (which is almost get blocked in top ecommerce stores like Amazon and Walmart) and then goes to fallback which is headless chrome driven by puppeteer to load the page and get the html content.

    The detailed directory tree can be seen below:

    # tree -I "node_modules|test*|spec"
    .
    ├── README.md
    ├── config
    │   └── default.json
    ├── jest-dynamodb-config.js
    ├── jest.config.ts
    ├── layers
    │   └── chromium-v109.0.6-layer.zip
    ├── package-lock.json
    ├── package.json
    ├── renovate.json
    ├── serverless.yml
    ├── src
    │   ├── api
    │   │   └── preview
    │   │       └── get.ts
    │   ├── lib
    │   │   ├── browser.ts
    │   │   ├── parsers
    │   │   │   ├── amazon.ts
    │   │   │   ├── manual.ts
    │   │   │   ├── parser.ts
    │   │   │   └── schema.ts
    │   │   └── scraper.ts
    │   ├── models
    │   │   └── result.ts
    │   └── utils
    │       └── common.ts
    ├── tsconfig.json
    └── tslint.json
    • The src/api/preview/get.ts file is the primary API request handler mapped to https://preview.api.play.adapttive.com/preview/<base64_encoded_url>
    • The src/lib/scraper.ts file holds all the logic to loads the request URL and grabs the HTML content using the most efficient src/lib/parser. The scraper.ts tries to load the URL using axios GET request on error or in case blocked, fallbacks to src/lib/browser.ts.
    • browser.ts: Running axios request on AWS Lambda is a pretty easy job, but launching a headless chrome takes a lot of efforts and experimentation. As there are a lot of blockers like:

      • AWS Lambda package size limit (~50MB): it has been resolved using a Lamda Layers of Chromium. The layers/chromium-v109.0.6-layer.zip is pushed during the build to Lambda layers and linked to the Lambda function. The mapping is done using serverless.yaml as below:
                            
                                
        layers:
          HeadlessChrome:
            name: HeadlessChrome
            compatibleRuntimes:
              - nodejs16.x
            description: Required for headless chrome
            package:
              artifact: layers/chromium-v109.0.6-layer.zip
              
                            
                        
      • Headless Chrome getting detected as a robot browser: There are a lot of firewall rules and security protocols to identify the HTML page requester as a robot are in place, Like captcha verification used in Cloudfare. I have used a puppeteer library called puppeteer-extra with puppeteer-extra-plugin-stealth to makes headless browser behave like a real user.
      • Differences in Local and AWS Lambda environments: Testing locally can be little difficult with headless-chrome as your local chrome can have a different version. So, always verify the local chrome version with your puppeteer compatible version. Also, puppeteer-extra does not work properly on the Windows development environment.
    • parsers: I am using cheerio library for parsing and traversing the HTML to get the required data, but the HTML is not same for all ecommerce stores, which made me to create 3 different types of HTML parsers.

      • schema.ts: For most of ecommerce stores the HTML head contains application/ld+json which is standard product data json specified by schema.org. The cheerio finds this json and prepares the result using the same.
      • manual.ts: The manual parsers traverses the HTML and uses head->title and head->description to prepare the result.
      • amazon.ts: The amazon parsers works only if domain matches to applicable amazon stores in any region. It searches the required values from page data and prepares the result.
  7. How to debug?

    The could be a lot of scenarios causing the curl or headless-chrome request to be blocked or fail. You can try the below steps to fix:

    1. Run serverless on local with verbose logging and verify error logs:

      serverless offline start --verbose
    2. Check puppeteer and chrome version compatibility on the official puppeteer support page: https://pptr.dev/chromium-support. Both versions should match or it will fail to work properly.
    3. Comment out all puppeteer-extra if used.
    4. Comment out all puppeteer-extra plugins if used.
    5. Disable stealth evasions one by one and find the one breaking if using puppeteer-extra-plugin-stealth
        const stealth = StealthPlugin();
        // stealth.enabledEvasions.delete('iframe.contentWindow')
        // stealth.enabledEvasions.delete('chrome.app')
        // stealth.enabledEvasions.delete('chrome.csi')
        // stealth.enabledEvasions.delete('chrome.loadTimes')
        // stealth.enabledEvasions.delete('chrome.runtime')
        // stealth.enabledEvasions.delete('defaultArgs')
        // stealth.enabledEvasions.delete('iframe.contentWindow')
        // stealth.enabledEvasions.delete('media.codecs')
        // stealth.enabledEvasions.delete('navigator.hardwareConcurrency')
        // stealth.enabledEvasions.delete('navigator.languages')
        // stealth.enabledEvasions.delete('navigator.permissions')
        // stealth.enabledEvasions.delete('navigator.plugins')
        // stealth.enabledEvasions.delete('navigator.webdriver')
        // stealth.enabledEvasions.delete('sourceurl')
        stealth.enabledEvasions.delete('user-agent-override') // this one blocking the chrome launch
        // stealth.enabledEvasions.delete('webgl.vendor')
        // stealth.enabledEvasions.delete('window.outerdimensions')
        console.log(stealth.enabledEvasions)
    1. Add DEBUG env variable to run the test script for debug logs on terminal
    DEBUG=puppeteer-extra,puppeteer-extra-plugin:* npx ts-node  browser.ts 

    Logs with DEBUG enabled

    1. Enable remote debug args and check with chrome://inspect
    const options = {
          args: chromium.args,
          defaultViewport: chromium.defaultViewport,
          headless: chromium.headless,
          executablePath: await chromium.executablePath(),
          ignoreHTTPSErrors: true
        };
    
    options.args.push('--remote-debugging-port=9222');
    options.args.push('--remote-debugging-address=0.0.0.0');
    
    const browser = await puppeteer.launch(options);
    1. Join discord for community support at https://extra.community
  8. Deployment

References: