A detailed step-by-step guide for building a REST API in Node.js with AWS Lambda, API Gateway, DynamoDB, and Serverless Framework with deployment on AWS
Overview
We will be building a function app (rest api) in Node.js using Serverless Framework while understanding the concepts of development and deployment of serverless computing.
Prerequisite
- Node.js
- Basic understanding of AWS Cloud
- Bash CLI
What is the Serverless?
Serverless (also called function apps) is a cloud-native development model that allows developers to build and run applications without having to manage servers. So technically it's not Serverless, just that the runtime/server is fully managed and scaled by the cloud hosting provider.
What is the Serverless Framework?
Serverless Framework is a free and open-source web framework written using Node.js for building and deploying function apps (primarily on AWS). It comes with a great set of advantages (as said by Mathijs):
- Provides
serverless.yaml
configuration to orchestrate functions into regions and tiers - Abstracts cloud specifications
- Event-driven configuration + capabilities
- Powerful
CLI
to configure, generate and deploy cloud functions - Multi-stage deployment (integration with CI/CD pipelines)
- Cloud platform agnostic (deploy to any cloud account)
- Many official and community plugins for common solutions
-
Setup
Let's start with some hands on, Serverless Framework provides out of the box
cli
to facilitate setup, run and deployment of a function app. Run the below command in the terminal to install Serverless CLI:# node: v16.17 # npm: v6.14 # use nvm to install LTS latest version of the node on your machine. npm install -g serverless
-
new project
Creating aRun the below commands to set up a
aws-node-http-api-typescript-dynamodb
project. You can check out all examples projects here.# Create a service in a new folder 'myService' serverless create --template-url https://github.com/serverless/examples/tree/v3/aws-node-http-api-typescript-dynamodb --path my-service # ✔ Project successfully created in "my-service" from "aws-node-http-api-typescript-dynamodb" template (11s)
It will create a file structure as below:
my-service ├── README.md ├── package.json ├── serverless.yml ├── todos │ ├── create.ts │ ├── get.ts │ ├── list.ts │ └── update.ts ├── tsconfig.json └── tslint.json
The function flow will be as below:
-
Understanding theserverless.yml
service
is a function/lambda name suffixed withenvironment
andframeworkVersion
is the Serverless Framework version, and the latest is3
.service: serverless-http-api-typescript-dynamodb
provider
contains all the details of a cloud provider, default is set toaws
. It also holds theruntime
likenodejs12.x
,environment
variables, andiam
roles. Refer to: Infrastructure Providersprovider: name: aws runtime: nodejs18.x lambdaHashingVersion: '20201221' environment: DYNAMODB_TABLE: ${self:service}-${sls:stage} httpApi: cors: true iam: role: statements: - Effect: Allow Action: - dynamodb:Query - dynamodb:Scan - dynamodb:GetItem - dynamodb:PutItem - dynamodb:UpdateItem - dynamodb:DeleteItem Resource: "arn:aws:dynamodb:${aws:region}:*:table/${self:provider.environment.DYNAMODB_TABLE}"
functions
holds all the mapping of api endpoints and request type with the handler/implementation file. You can examine that thecreate
path is mapped with thetodos/create.create.ts
file. Refer: Lambda Functionsfunctions: create: handler: todos/create.create events: - httpApi: path: /todos method: post list: handler: todos/list.list events: - httpApi: path: /todos method: get get: handler: todos/get.get events: - httpApi: path: /todos/{id} method: get update: handler: todos/update.update events: - httpApi: path: /todos/{id} method: put
resources
holds all the resources like the database tableTodosDynamoDbTable
schema. You can add all the required cloud resources here likeS3::Bucket
,Events::Rule
,SNS::Subscription
, etc. For more details refer: Infrastructure Resourcesresources: Resources: TodosDynamoDbTable: Type: 'AWS::DynamoDB::Table' DeletionPolicy: Retain Properties: AttributeDefinitions: - AttributeName: id AttributeType: S KeySchema: - AttributeName: id KeyType: HASH BillingMode: PAY_PER_REQUEST
-
Running on local
As I earlier said Serverless Frameworks come with a great set of official and community plugins, so to run the function app locally there is a plugin that comes in handy for the job. It's called serverless-offline.
Simply run the below command on bash to add to the project:
npm install serverless-offline --save-dev
And add the below to
serverless.yaml
:plugins: - serverless-offline
Then run using
serverless offline
. For the complete file refer toopenwishlist/preview
in serverless.yml -
Connecting DynamoDB on local
You can also add serverless-dynamodb-local to add a local
dynamodb
instance on an offline run.npm install serverless-dynamodb-local --save-dev
And add the below to
serverless.yaml
:plugins: - serverless-dynamodb-local custom: dynamodb: stages: - dev - live start: migrate: true
Then run using
serverless offline
which will start a local dynamodb and migrate a schema automatically. For the complete file refer toopenwishlist/preview
in serverless.yml -
Building a scraping service
Finally, let's focus on building a scraping service. It's like a cat and mouse game, the website owners use newer and better bot blocking services, while the scrapers keep finding evasions to avoid getting detected.
The scraping service I built is part of a larger project and is complete opensource. Feel free to explore the code and contribute at openwishlist/preview.
Also, you can check out the detailed design of the service below and continue reading this article:
As per the current design, the service depends on a
curl
request andheadless chrome
to get the html content of the page. The lambda function tries to initially get data by acurl
request (which is almost get blocked in top ecommerce stores like Amazon and Walmart) and then goes to fallback which isheadless chrome
driven bypuppeteer
to load the page and get the html content.The detailed directory tree can be seen below:
# tree -I "node_modules|test*|spec" . ├── README.md ├── config │ └── default.json ├── jest-dynamodb-config.js ├── jest.config.ts ├── layers │ └── chromium-v109.0.6-layer.zip ├── package-lock.json ├── package.json ├── renovate.json ├── serverless.yml ├── src │ ├── api │ │ └── preview │ │ └── get.ts │ ├── lib │ │ ├── browser.ts │ │ ├── parsers │ │ │ ├── amazon.ts │ │ │ ├── manual.ts │ │ │ ├── parser.ts │ │ │ └── schema.ts │ │ └── scraper.ts │ ├── models │ │ └── result.ts │ └── utils │ └── common.ts ├── tsconfig.json └── tslint.json
- The
src/api/preview/get.ts
file is the primary API request handler mapped tohttps://preview.api.play.adapttive.com/preview/<base64_encoded_url>
- The
src/lib/scraper.ts
file holds all the logic to loads the request URL and grabs the HTML content using the most efficientsrc/lib/parser
. Thescraper.ts
tries to load the URL usingaxios
GET
request on error or in case blocked, fallbacks tosrc/lib/browser.ts
. -
browser.ts
: Runningaxios
request on AWS Lambda is a pretty easy job, but launching a headless chrome takes a lot of efforts and experimentation. As there are a lot of blockers like:- AWS Lambda package size limit (~50MB): it has been resolved using a Lamda Layers of Chromium. The
layers/chromium-v109.0.6-layer.zip
is pushed during the build to Lambda layers and linked to the Lambda function. The mapping is done usingserverless.yaml
as below:layers: HeadlessChrome: name: HeadlessChrome compatibleRuntimes: - nodejs16.x description: Required for headless chrome package: artifact: layers/chromium-v109.0.6-layer.zip
- Headless Chrome getting detected as a robot browser: There are a lot of firewall rules and security protocols to identify the HTML page requester as a robot are in place, Like captcha verification used in Cloudfare. I have used a puppeteer library called
puppeteer-extra
withpuppeteer-extra-plugin-stealth
to makes headless browser behave like a real user. - Differences in Local and AWS Lambda environments: Testing locally can be little difficult with
headless-chrome
as your local chrome can have a different version. So, always verify the local chrome version with your puppeteer compatible version. Also,puppeteer-extra
does not work properly on the Windows development environment.
- AWS Lambda package size limit (~50MB): it has been resolved using a Lamda Layers of Chromium. The
-
parsers
: I am usingcheerio
library for parsing and traversing the HTML to get the required data, but the HTML is not same for all ecommerce stores, which made me to create 3 different types of HTML parsers.schema.ts
: For most of ecommerce stores the HTML head containsapplication/ld+json
which is standard product datajson
specified by schema.org. Thecheerio
finds thisjson
and prepares the result using the same.manual.ts
: The manual parsers traverses the HTML and useshead->title
andhead->description
to prepare the result.amazon.ts
: The amazon parsers works only if domain matches to applicable amazon stores in any region. It searches the required values from page data and prepares the result.
- The
-
How to debug?
The could be a lot of scenarios causing the
curl
orheadless-chrome
request to be blocked or fail. You can try the below steps to fix:-
Run
serverless
on local with verbose logging and verify error logs:serverless offline start --verbose
- Check puppeteer and chrome version compatibility on the official puppeteer support page: https://pptr.dev/chromium-support. Both versions should match or it will fail to work properly.
- Comment out all
puppeteer-extra
if used. - Comment out all
puppeteer-extra
plugins if used. - Disable stealth evasions one by one and find the one breaking if using
puppeteer-extra-plugin-stealth
const stealth = StealthPlugin(); // stealth.enabledEvasions.delete('iframe.contentWindow') // stealth.enabledEvasions.delete('chrome.app') // stealth.enabledEvasions.delete('chrome.csi') // stealth.enabledEvasions.delete('chrome.loadTimes') // stealth.enabledEvasions.delete('chrome.runtime') // stealth.enabledEvasions.delete('defaultArgs') // stealth.enabledEvasions.delete('iframe.contentWindow') // stealth.enabledEvasions.delete('media.codecs') // stealth.enabledEvasions.delete('navigator.hardwareConcurrency') // stealth.enabledEvasions.delete('navigator.languages') // stealth.enabledEvasions.delete('navigator.permissions') // stealth.enabledEvasions.delete('navigator.plugins') // stealth.enabledEvasions.delete('navigator.webdriver') // stealth.enabledEvasions.delete('sourceurl') stealth.enabledEvasions.delete('user-agent-override') // this one blocking the chrome launch // stealth.enabledEvasions.delete('webgl.vendor') // stealth.enabledEvasions.delete('window.outerdimensions') console.log(stealth.enabledEvasions)
- Add
DEBUG
env variable to run the test script for debug logs on terminal
DEBUG=puppeteer-extra,puppeteer-extra-plugin:* npx ts-node browser.ts
- Enable remote debug args and check with chrome://inspect
const options = { args: chromium.args, defaultViewport: chromium.defaultViewport, headless: chromium.headless, executablePath: await chromium.executablePath(), ignoreHTTPSErrors: true }; options.args.push('--remote-debugging-port=9222'); options.args.push('--remote-debugging-address=0.0.0.0'); const browser = await puppeteer.launch(options);
- Join discord for community support at https://extra.community
-
-
Deployment
References:
- https://github.com/openwishlist/preview
- https://openwishlist.stoplight.io/docs/preview
- https://www.serverless.com/framework/docs/providers/aws/guide/layers
- https://github.com/Sparticuz/chromium
- https://github.com/crespowang/serverless-lambda-chrome
- https://crespowang.medium.com/create-pdf-using-chromium-puppeteer-in-serverless-aws-lambda-685906df62d5
- https://levelup.gitconnected.com/serverless-lambda-layers-d8f8374404e3
- https://www.zenrows.com/blog/puppeteer-avoid-detection
- https://pptr.dev
- https://www.hackages.io/video-tutorials/writing-cloud-platform-independent-lambda-functions-with-serverless-framework