Skip to main content
Help

Disaster Recovery Plan

HOTLINK

Effective from April 1, 2012. Last updated January 28, 2021.

Table of Content

I. Introduction

This website is built, maintained and operated by Hotlink Services Private Limited (the "Company"), a private limited company incorporated under The Indian Companies Act, 1956, having its registered office at 207, Essel House, 10 Asaf Ali Road, New Delhi, Delhi, India 110002 and having its corporate office at Kedar Square, L29/5 DLF Phase 2, Gurgaon, Haryana, India 122002.

This electronic document, referred to as the Disaster Recovery Plan ("DRP") documents the processes, personnel and procedures to be followed in the event of this website becoming inaccessible to all or a significant proportion of its users.

II. Definitions

TermDescription
IndividualA natural human being.
Legal entityAny individual or group of individuals recognized by law and having legal rights and obligations.
PersonAn individual or legal entity.
WebsiteThis website.
UserA person who uses this website, including you.
ServiceAny and all features and functions provided and supported by this website.
Service availabilityThe ability of users to access the service provided by this website without interruptions.
Service degradationA reduction in service availability on one or more of its normal availability parameters.
Service disruptionA service degradation characterized by a complete lack of availability of the service.
DisasterA natural or man-made event that causes a serious disruption over a prolonged period of time.

III. Scope

The scope of the DRP is to address recovery from an event that causes major service disruption over a prolonged period of time. This document outlines the various criteria for invoking the DRP, the processes and procedures for carrying it out successfully and the personnel responsible for DRP activities.

IV. Service Availability Parameters

A robust DRP is based on objective, measurable parameters about the availability of the service in question. This DRP is based on the following measurable parameters.

a. Uptime

This website has been designed for 99%+ availability, ignoring any scheduled downtime due to software upgrades, maintenance, patch application, etc. An external website monitoring service checks this website every minute to ascertain if the website is accessible. The outcome of these checks is recorded as a simple yes or no. All outcomes for a calendar week (00:00 on Sunday to 23:59 on the following Saturday) are aggregated and the percentage of yes outcomes to the total number of outcomes is reported as the website availability.

b. Response time

We strive for serving webpages at under 4 seconds each. Although this target is based on the total number of active users on the website at any given point in time, we look at the average response times to make sure that the majority of user requests are served in under 4 seconds.


A service disruption is assumed if the website becomes inaccessible to more than 10% of the users for over 5 minutes.

V. Disaster Prevention

We strive to prevent disasters so that service disruptions are minimized. We have adopted the following steps that help us avoid or reduce disruptions:

  1. Hosting provider: We work with hosting providers that guarantee high availability for their infrastructure. Availability metrics for the provider are reviewed twice a year to ensure that we get 99.9% or higher availability from the provider. We currently use Amazon Web Services (external link) as our hosting infrastructure provider.
  2. Redundant infrastructure: All software required to deliver the service is hosted on redundant infrastructure so that common disruptions are avoided. All of our applications make use of redundancy mechanisms such as RAID for storage, clusters for shared-state application servers and load balancers for automatic user load spreading acros the infrastructure.
  3. Independent monitoring: We monitor our hosting infrastructure continuously from outside of the hosting premises to ensure that we get the expected levels of availability.

Our platform availability statistics are published as an easy-to-view chart for the past 52 weeks. Through these statistics, we provide full transparency into platform availability, including any outages experienced on the platform.

VI. Disaster Categories

This website offers critical business functionality to its users, which makes any service disruptions for this website undesirable. Regardless of the amount of planning and preparation put into designing and running computing infrastructure, there are bound to be hardware and/ or software failures that lead to a disruption in service. However, the response to a disruption and the time it takes to recover from it, depend on the nature of the disruption, as some disruptions are easier to recover from than others. We actively plan for and are prepared to respond to the following types of service disruptions.

a. Equipment Outage

A temporary, semi-permanent or permanent unavailability of one or more computing equipment used for hosting this website could lead to a service outage. Examples of such outages could include a fault in a network device, power failure on a server, hard disk crash and server crash.

b. Data Center Outage

Under rare circumstances, a significant portion of the data center(s) used to host this website could become inaccessible. For example, damage to undersea cables sometimes causes loss of Internet connectivity to or from locations serviced by those cables. Such incidents could lead to the data center(s) becoming inaccessible over the Internet until the data center providers can find alternate means to restore access to the equipment hosted there.

c. Location Outage

Under extremely rare circumstances, the physical location at which the data center(s) is(are) located could become inaccessible. This could happen, for example, due to geopolitical or calamitous situations such as civil unrest, government lockdowns, war, epidemic, natural disasters, etc.

VII. Disaster Detection

A disaster will said to have occurred in one of the following situations lasting for a continuous period of 10 minutes or more:

  1. Our monitoring infrastructure detects a service outage, response times more than three times their average values or more than 25% user operations failing; or
  2. Majority of users report an outage, very slow response times for user operations or high failure rates for user operations.

VIII. Disaster Recovery Strategies

Above, we have explained our philosophy of preventing disasters by incorporating practical safeguards in the design and deployment of the platform. However, we do recognize that computing hardware and software are susceptible to failures in unexpected ways. Our teams are prepared to tackle any such unexpected scenarios. The sections below provide details on our internal processes for handling unexpected outages.

a. Equipment Outage

StepDescription
1.Record the incident in the Incident Management System.
2.Restart failed hardware/ software.
3.Review monitoring dashboards to ensure normal service health.
4.Review software service status to ensure service restoration.
5.Run critical regression tests to ensure normal service function.
6.Review server and application logs to determine root cause for the disruption.
7.Record the data collected from the logs in the Incident Management System.
8.Deploy mitigation steps to prevent a recurence of the problem.
9.Document the mitigation steps in the Incident Management System.
10.Inform the users about the outage, the corrective actions taken and the mitigation steps deployed.

Equipment restarts usually take no more than 15 minutes.

b. Local Failover

In some cases the existing equipment may become permanently inaccessible or unresponsive. In such cases a simple restart of the equipment may either not be possible or may not restore the service to normalcy. In such cases the services may have to the migrated to new equipment within the same data center, if the data center remains available.

StepDescription
1.Record the incident in the Incident Management System.
2.Restore the application to new equipment from the latest available images/ backups/ copies.
3.Restart services on the new equipment.
4.Review monitoring dashboards to ensure normal service health.
5.Review software service status to ensure service restoration.
6.Run critical regression tests to ensure normal service function.
7.Record the data collected from the logs in the Incident Management System.
8.If the old equipment becomes available, review hardware and software logs to understand the root cause for the failure.
9.Contact the hosting provider to investigate the failed equipment, understand the root cause and future mitigation.
10.Document the findings in the Incident Management System.
11.Inform the users about the outage, the corrective actions taken and the mitigation steps deployed.

A local failover is expected to take about 4 hours.

c. Remote Failover

In a rare case, the data center may become inaccesible or unavailable for a prolonged period of time. In such cases the application will be migrated to another data center using the same steps as those above for local failover.

Remote failovers also take about 4 hours to complete.

IX. Disaster Recovery Personnel

The following personnel are involved with or affected by the disaster recovery plan.

a. Infrastructure team

This infrastructure team is responsible for monitoring the infrastructure and responding to any disaster that may cause service disruption.

b. Support team

The support team is responsible for user communication and liasing with the infrastructure team either to forward information received from users to the infrastructure team, keep the Incident Management System updated, and to share incident updates with the service users.

c. Information Security Manager

If a disaster is caused by an incident that also has security implications, such as, a denial-of-service attack, the infrastructure team will keep the Information Security Manager informed at all times in addition to liasing with the support team.

X. Periodic Reviews

The Infrastructure Lead holds half-yearly reviews with the Company Management to share data about the infrastructure. These review include information of any disasters, their impact and the recovery steps taken. The reviews follow the agenda given below:

  • Service availability in the past six months;
  • Availability metrics for the hosting provider in the past six months;
  • Share the number of service related incidents encountered in the past six months;
  • Incidents classified as disasters;
  • Disaster recovery steps taken;
  • Disaster prevention steps taken;
  • Recommendations on improvements to improve infrastructure availability and robustness.

XI. Change Log

  • January 28, 2021: Changed document visibility to everyone so that all platform users can review this document, without the need for signing up or signing in.
  • October 18, 2020: Reflected change in legal name of the company from INHX Services Private Limited to Hotlink Services Private Limited.
  • June 11, 2018: Changed document visibility from all users to signed-in users only.
  • June 30, 2017: Added link to the platform availability chart under the Disaster Prevention section.
  • June 24, 2017: Added a link to this plan under the {{article:AR5348V8NT7Z:Master Agreement}}.
  • June 10, 2015: Added section on Disaster Prevention.
  • November 21, 2012: Minor corrections in spellings and grammar.
  • October 29, 2012: Added definitions section at the top.
  • April 1, 2012: First version of the plan published.
Last edited: January 28, 2021