Security Level:
Huawei CBS V500R005C30 Active-Active Disaster Recovery Solution www.huawei.com
HUAWEI TECHNOLOGIES CO., LTD.
Contents
A-A DR Architecture Architecture
Deployment
Data Replication
Solution and RTO for different scenarios scenarios in case disaster strikes
Data loss
Impact on Performance and Dimensioning
Contents
A-A DR Architecture Architecture
Deployment
Data Replication
Solution and RTO for different scenarios scenarios in case disaster strikes
Data loss
Impact on Performance and Dimensioning
Active-Active DR Architecture – High Level Site A (Main A (Main Site)
BMP
BMP
BMPD B
BMPD B
Site B
USRDB GMDB
GMDB
GMDB
GMDB
CBP1 CBP1
CBP2 CBP2
CBP3 CBP3
CBP4 CBP4
BillDB
CDRDB
BillDB CBPAdapte CBPAdapter 1r GMDB
SEE
Active app
App
Standby app Single-node
USRDB CDRDB
App
Two-node Peer to peer cluster Production business flow DR business flow Data replication
CBPAdapte CBPAdapter 2r GMDB
SEE
SEE
SEE
DCC FEP/
FEP
USAU USAU1
GFEP
CAP/MAP/INAP Core Network
MSC/STP
EMPP SMPP+
GGSN/ PGW
FEP/ FEP GFEP
SMSC
USAU USAU2
MMSC
If SMSC/MMSC sends DCC messages, then SMSC/MMSC talks to CBPAdapter directly.
Active-Active DR Architecture – Low Level See next page for Architecture in Low Level Notes: 1. CRM/ESB is assumed to send requests to BMP Cluster 1 in site A via BMPGateway1. 2. BMP Custer 1 is Active while BMP Cluster 2 is Standby. 3. CBP & CBPAdapter are in Active-Active model, each site with 100% capacity, 50% production traffic. 4. OCG(SEE) is in all active mode. Scenarios: 1. Operation & Management a. BMP of Site A writes the updates to physical database (BMPDB, SYSDB, USRDB etc) and when the transaction is committed the updates will be synchronized/written in Site B due to the mechanism of physical database. b. At the same time, the updates will be synchronized to both GMDB of Site A & Site B from the physical database of Site A. 2. Calling & Data Usage & Other Service a. When the CBP updates the GMDB and t hese updates don’t need to be written to physical database, these updates will be synchronized from GMDB of site A to GMDB of site B or vice versa, depending on which site receive and process the traffic. b. When the CBP updates the GMDB and these updates need to be written to physical database. No matter which site processes the traffic, the updates will first synchronized to physical database of site A and t hen written to site B.
Acronyms BMP: Business Management Point In the Huawei OCS solution, the BMP implements service operation management, and provide system management, product management, Offer Management, resource management, customer management & customer service Management. External CRM/CC system which provides the GUI for subscribers and telecom operators can invoke these functions. CBP: Convergent Billing Point In the Huawei OCS solution, the CBP implements rating, charging and accounting functions and supports online charging. For online charging, after receiving the charging & authentication request from the OCG, the CBP performs budget and account reservation for the conversation duration, and indicates the conversation duration to the OCG. After receiving the charge deduction request from the OCG, the CBP deducts the expense in real time. GMDB: General Memory Database A Huawei-developed relational database system designed based on physical memory and database industry standards. The GMDB is applied to the applications that require high-performance database access and real-time processing. OCG: Online Control and Charging Gateway OCG provides online call and charging control and routing service. SEE: Service Execution Environment FEP/GFEP: General Front End Processor USAU: Universal Signaling Access Unit
BMPGatew BMPGatew ay ay
Site A (Main Site)
BMPGatew ay
BMP1
Site B
BMP3
w, r w, r
Invoicin Invoicin gg
BMPD B
GMDB GMDB CBP2 CBP2
CBP3 CBP3
GMDB CBP4 CBP4
BillDB CBPAdapte CBPAdapter 1r GMDB
r Mediatio n
I2000
CDRDB
BillMgm t
BillDB
Report
SEE
Standby app
Peer to peer cluster Production business flow DR business flow Data replication
Report
CBPAdapte CBPAdapter 2r GMDB
SEE
Two-node
Invoicin g
GMDB
CBP1 CBP1
App
Single-node
USRDB
BillMgm t
Active app
BMPD B
USRDB
CDRDB
App
r Mediatio n
SEE
SEE
w: write r: read
I2000
DCC FEP/
FEP
USAU USAU1
GFEP
CAP/MAP/INAP
Core Network
MSC/STP
EMPP
FEP/ FEP GFEP
USAU USAU2
SMPP+
GGSN
SMSC
MMSC
If SMSC/MMSC sends DCC messages, then SMSC/MMSC talks to CBPAdapter directly.
DR Deployment Scheme -1/2 NE Type
Mode
Site 1 (Main Site)
Site 2
Auto/Ma nual
Remark
CBPAdapter (incl. GMDB)
A-A
Two-node cluster (100% capa city, 50% production traffic)
Two-node cluster (100% capa city, 50% production traffic)
Both
GMDB in site 1/2 is divided to two schemes and each scheme adopts one-way data replication.
CBP (incl. GMDB)
A-A
Two-node cluster (100% capa city, 50% production traffic)
Two-node cluster (100% capa city, 50% production traffic)
Both
GMDB in site 1/2 is divided to two schemes and each scheme adopts one-way data replication.
USAU
A-A
Two-node cluster (100% capacity, 50% production traffic)
Two-node cluster (100% capa city, 50% production traffic)
Both
No date replication
FEP/GFEP
A-A
Two-node cluster (100% capacity, 50% production traffic)
Two-node cluster (100% capa city, 50% production traffic)
Both
No data replication 1,Depends on whether SMSC/MMSC support polling mode.
SEE (i.e. OCG)
A-A
N+1 cluster (load balancing, 100% capacity, 50% production traffic)
N+1 cluster (load balancing, 100% capacity, 50% production traffic)
Both
In case N=1, Site 1/2 will be deployed with 1+1 boards ; In case N>=2, Site 1/2 will be deployed with N boards. (N is enough because N provides 100% capacity while 50% traffic is handled)
BMPGateway (SLB)
A-S
Two-node cluster (100% capacity, 100% production traffic)
Single-node system (100% capacity, 0% production traffic)
Both
No date replication
BMP
A-S
N+1 cluster (100% capacity, 100% production traffic)
N+1 cluster (100% capacity, 0% production traffic)
Both
Including UPC/GL/AR/DC/CDRQuery
BMPDB(SYSD B)
A-S
Two-node cluster (100% capacity, 100% production read traffic, 100% production write traffic)
Single-node system (100% capacity, 0% production read traffic, 0% production write traffic)
Both
Oracle Active Data Guard is used for data replication.
DR Deployment Scheme- 2/2 (End) NE Type
Mode
Site 1 (Main Site)
Site 2
Auto/Ma nual
USRDB
A-S
Two-node cluster (100% capacity, 100% production read traffic, 100% production write traffic)
Single-node system (100% capacity, 0% production traffic)
Both
CDRDB, BillDB
A-S
Two-node cluster (100% capacity, 100% production traffic)
Single-node (100% capacity, 0% production traffic)
Both
SDU
Currently SDU is deployed along with USRDB in the same board, the mode and deployment is same as USRDB; In future, SDU will be replaced by memory DB (OMDB), the mode and deployment is same as SEE.
Invoicing
A-S
Two-node cluster (100% capacity, 100% production traffic)
Single-node (100% capacity, 0% production traffic)
Both
Remark
.
Bill Management
A-S
Single-node (100% capacity, 100% production traffic)
Single-node (100% capacity, 0% production traffic)
Both
Report
A-S
Single-node (100% capacity, 100% production traffic)
Single-node (100% capacity, 0% production traffic)
Both
By default, DR is not supported/suggested. In case of DR, only reports will be replicated, the source file will not be replicated to the DR site.
Mediation
A-S
Single-node (100% capacity, 50% production traffic)
Single-node (100% capacity, 50% production traffic)
Both
No date/file replication
I2000
AS(Optio nal)
Two-node cluster or singlenode(100% capacity, 100% production traffic)
Single-node (100% capacity, 0% production traffic)
Both
If I2000 is deployed in Main site in dual-node cluster, then DR is not supported; If I2000 is deployed in Main site in single node, then DR can be supported, we can deploy a I2000 in single-node in DR site.
Oracle GoldenGate Data Replication Site B
Site A (Main Site)
OGG operations Oracle operations
Source Oracle DB
Target Oracle DB
1.
Online redo logs
Archived redo log files
3, Apply
1, Extract 2.1 Send Trail files
Network
2.2 Receive Trail files
The Oracle GoldenGate (OGG) of the source DB extracts data from the redo log and archive log and writes the data into a local trail file. 2. OGG sends the trail file generated by the source DB to target DB. 3. The OGG of the target DB reads the trail file content and applies the file content to the target DB to synchronize data.
Date Replication for BMPDB and USRDB Site A (Main Site)
BMP 1
Active app
App
Standby app
Site B
BMP 3
read
App
Single-node Two-node
Cach e
Cach e
Peer to peer cluster
write
BMPD B
Write, read
BMPD B
(Read-write status)
Production business flow DR business flow Data replication
(Read-only status)
USRDB
USRDB write
(Read-write status)
SEE Cache
GMDB CBP1 CBP1
r
write Cache
Cache
r
write
SEE Cache
Cache
r
GMDB CBP2 CBP2
r
write Cache
GMDB
(Read-only status)
w: write r: read
r
GMDB CBP4 CBP4
CBP3 CBP3 SEE Cache
r
SEE Cache
For BMPDB/USRDB , Huawei use Oracle DB, the Replication solution will use Oracle Golden Gate. License fee needs to be considered. BMPGateway + BMP+BMPDB are deployed in one DR switch group.
Date Replication for USRDB – Low Level : Asynchronous Replication Site A Active
Site B Active
Application
Application 1. Update
1. Update
USRDB
2. Send change
USRDB
replication engine
3. Update (remote change)
3. Update (remote change)
To respect the high performance requirement on real time rating and charging, Huawei provides asynchronous replication The related application includes BMPAPP, CBPAPP.
Date Replication for GMDB – High Level GMDB
GMDB
GMDB CBP1 CBP1
CBP2 CBP2
CBPAdapte CBPAdapter 1r GMDB
GMDB CBP3 CBP3
CBP4 CBP4
CBPAdapte CBPAdapter 2r GMDB
For Memory DB, it is made by Huawei, it support high performance service processing. Consider to the performance, the data replication also adopt Asynchronous Replication solution.
Date Replication for GMDB – Low Level : Asynchronous Replication Site A Active
Site B Active
Application
Application
1. Update
1. Update
GMDB Scheme 1a Scheme 2b
2. Send change replication engine
3. Update (remote change)
GMDB Scheme 1b Scheme 2a 3. Update (remote change)
To respect the high performance requirement on real time rating and charging, Huawei provides asynchronous replication The related application includes CBPAPP, CBPAdapter
Routing
There is routing table in CBPAdapter(GMDB), so it can know which CBP to route. All the routing table in each site are same and keep the FULL routing data. So when one CBP is down, CBPAdapter can route the request to the other CBP.
CBPAdapter first checks the routing table of discrete number, if the MSISDN is not found in it, segment based routing will be applied.
GMDB
GMDB
GMDB
GMDB
CBP1 CBP1
CBP2 CBP2
CBP3 CBP3
CBP4 CBP4
135*
CBPAdapte CBPAdapter 1r GMDB
138*
135*
CBPAdapte CBPAdapter 2r GMDB
138*
Virtual GT of OCG(SEE)
SEE cluster in both sites share the same Virtual GT (GT3). SEE cluster in site A has a real GT(GT1), SEE cluster in site B has a real GT(GT2)
。
STP is assumed to support polling.
USAU1 in Site A talks to SEE cluster in Site A only, while USAU2 in Site B talks to SEE cluster in Site B only. MSC
STP
IDP(DGT=GT3)
USAU1
SEE
SEE
USAU2
SEE
IDP(DGT=GT3) IDP(DGT=GT3) RRBE(OGT=GT1) RRBE(OGT=GT1)
RRBE(OGT=GT1) ERB(DGT=GT1) ERB(DGT=GT1)
IDP(DGT=GT3)
ERB(DGT=GT1)
IDP(DGT=GT3) RRBE(OGT=GT2)
RRBE(OGT=GT2)
IDP(DGT=GT3) RRBE(OGT=GT2)
SEE
Scenario1: When CBP (APP2) is down BMP 1
Site A (Main Site)
w, r
BMPD B
w, r
Connection stopped
BMP 3 w
Site B
r
Connection applied
r
BMPD B
USRDB
Production business flow DR business flow Data replication
USRDB GMDB
CDRDB CBP1 CBP1
GMDB GMDB CBP2 CBP2
CBP3 CBP3
GMDB CBP4 CBP4
BillDB
CDRDB BillDB
CBPAdapte CBPAdapter 1r GMDB
CBPAdapte CBPAdapter 2r GMDB
1.
2.
SEE
SEE
SEE
SEE
DR software monitors links between CBPAgent and CBP. When DR software detects that CBP2 is down, DR software notifies CBPAgent to change the routing, that is, talks to CBP4 in Site B instead of CBP2 in Site A. The change can be done automatically or manually
Scenario1: CBP GDR switchover flow GDR switchover duration: 3s
RTO/Downtime: 3s Single node failure
System switch
Dual cluster switch failure
Service takeover
System switch Dual cluster Disaster detection( About less than 3s) switch In case of Manual Switch: Depending on decision mechanism; In case of Automatic Switch: Configurable, about 10minutes
Time schedule
The GDR software: 1. Check the CBP data replication link, if it is not stopped, stop it. 2. Set the CBP2 status to faulty in system definition table in BMPDB(SYSDB), update the status of CBP2 into the cache of other normal CBPs and CBPAdapter. Then CBPAdapter automatically change routing.
Scenario2: When CBP Adapter1 is down Site A (Main Site)
Site B
Connection stopped Connection applied
GMDB CBP1 CBP1
GMDB GMDB CBP2 CBP2
CBP3 CBP3
CBPAdapte CBPAdapter 1r GMDB
SEE
SEE
SEE
DCC FEP/
FEP EMPP
FEP/ FEP GFEP
SMPP+
MSC/STP
GMDB CBP4 CBP4
CBPAdapte CBPAdapter 2r GMDB
GFEP
Core Network
Production business flow DR business flow Data replication
GGSN
SMSC
MMSC
SEE
Scenario 2: CBP Adapter Rerouting Flow GDR switchover duration: 10s
RTO/Downtime: 10s
Single node failure
System switch
Dual cluster switch failure
Dual cluster switch
Service takeover
Disaster Rerouting (About detection less than 10s)
In case of Manual Switch: Depending on decision mechanism; In case of Automatic Switch: Configurable, about 10minutes
Time schedule
The GDR software: 1. Check the CBPAdapter data replication link, if it is not stopped, stop it. 2. Set the CBPAdapter1 status to faulty in system definition table in BMPDB(SYSDB), update the status of CBPAdapter1 into the cache of every CBP and SEE. Then SEE automatically change routing. SEE automatically set the CBPAdapter1 status to faulty and will not send requests to CBPAdapter1.
Scenario 3.1.1: When OCG(SEE) is down (i.e. number of faulty SEE<=2), no need to switch Site A (Main Site)
BMP 1
CBPAdapte CBPAdapte rr GMDB
SEE
SEE
SEE
Site B
BMP 3
Connection stopped Connection applied
CBPAdapte CBPAdapte rr GMDB
SEE
SEE
SEE
USAU1 USAU1
SEE
USAU2 USAU2
50% traffic
50% traffic
MSC/STP
SEE
SEE
SEE
Automatic switchover decision mechanism: 1. In case the number of SEE where exception happens reaches X, the system does not need to switch. 2. X is configurable, generally it is configured as <=50%*number of SEE nodes of Site A
Scenario 3.1.2: When OCG(SEE) is down (i.e. number of faulty SEE>2) , SEE Cluster+USAU1 will switch jointly Site A (Main Site)
BMP 1
CBPAdapte CBPAdapte rr GMDB
SEE
SEE
SEE
Connection stopped Connection applied
CBPAdapte CBPAdapte rr GMDB
SEE
SEE
SEE
USAU1 USAU1
SEE
USAU2 USAU2
50% traffic
50% traffic
MSC/STP
Site B
BMP 3
USAU+OCG(SEE) are deployed in one DR switch group.
SEE
SEE
SEE
Automatic switchover decision mechanism: 1. In case the number of SEE where exception happens reaches X, the system does not need to switch. 2. X is configurable, generally it is configured as <=50%*number of SEE nodes of Site A
Scenario 3.1.2: OCG(SEE)+USAU GDR switchover flow GDR switchover duration: <13s
System switch Single node failure
System recover
Signaling/IP links takeover
Disaster detection In case of Manual Switch: Depending on decision mechanism; In case of Automatic Switch: Configurable, about 10minutes
System Signaling switch takeover (About less (About than 3s) less than 10s)
1. Notify BMP that SEE is down.
Time schedule
Depend on STP’s ability, STP needs to send charging requests to USAU in site B, that is STP-USAU-SEECBPAdapter.
Scenario 4.1: When BMP1 in BMP Cluster 1 is down (i.e. number of faulty BMP<=1), no need to switch Suppose there are multiple BMPs(e.g. BMP1-3) in BMP Cluster 1 in main site, and only BMP1 is down, then BMP2 and BMP3 can takeover the services, and GDR switch is not required.The mechanism is similar to that of SEE.
Site A (Main Site)
BMPGatew BMPGatew ay ay
BMPGatew ay
BMP 1
BMP 3
BMPD B
BMPD B
w, r
w, r
Site B
w,r
w, r
USRDB
USRDB GMDB
CDRDB CBP1 CBP1
GMDB GMDB CBP2 CBP2
CBP3 CBP3
GMDB CBP4 CBP4
BillDB
CDRDB BillDB
CBPAdapte CBPAdapter 1r GMDB
r
SEE
SEE
CBPAdapte CBPAdapter 2r GMDB
SEE
r
SEE
Production business flow DR business flow Data replication
Switchover decision mechanism: 1. In case the number of BMP where exception happens reaches X, the system does not need to switch. 2. X is configurable, generally it is configured as <=50%*number of BMP nodes of Site A
Scenario 4.2: When BMP Cluster 1is down, BMP Cluster 1+BMPDB1+BMPGateway will switch jointly Because BMPDB(SYSDB) forms a complete data, and BMP at each site talks to BMPDB(SYSDB) on that site only, so BMP Cluster 1 and BMPDB(SYSDB) need switch jointly. BMPGateway is also included i n the DR switch group.
Site A (Main Site)
BMPGatew BMPGatew ay ay
BMPGatew ay
BMP 1
BMP 3
BMPD B
BMPD B
w, r
w, r
w, r
Connection applied
Site B
w,r
w, r
USRDB
USRDB GMDB
CDRDB CBP1 CBP1
GMDB GMDB CBP2 CBP2
CBP3 CBP3
GMDB CBP4 CBP4
BillDB CBPAdapte CBPAdapter 1r GMDB
SEE
CDRDB BillDB
r
CBPAdapte CBPAdapter 2r GMDB
SEE
BMP Cluster 1+BMPDB1+BMPGateway are deployed in one DR switch group.
Connection stopped
SEE
r
SEE
Production business flow DR business flow Data replication
Scenario 4.3: When BMPDB1 is down, BMP Cluster 1+BMPDB1+BMPGateway will switch jointly Because BMPDB(SYSDB) forms a complete data, and BMP at each site talks to BMPDB(SYSDB) on that site only, so BMP Cluster 1 and BMPDB(SYSDB) need switch jointly. BMPGateway is also included i n the DR switch group.
Site A (Main Site)
BMPGatew BMPGatew ay ay
BMPGatew ay
BMP 1
BMP 3
BMPD B
BMPD B
w, r
w, r
w, r
Connection applied
Site B
w,r
w, r
USRDB
USRDB GMDB
CDRDB CBP1 CBP1
GMDB GMDB CBP2 CBP2
CBP3 CBP3
GMDB CBP4 CBP4
BillDB CBPAdapte CBPAdapter 1r GMDB
SEE
CDRDB BillDB
r
CBPAdapte CBPAdapter 2r GMDB
SEE
BMP Cluster 1+BMPDB1+BMPGateway are deployed in one DR switch group.
Connection stopped
SEE
r
SEE
Production business flow DR business flow Data replication
Scenario 4.4: When BMPGateway is down, BMP Cluster 1+BMPDB1+BMPGateway will switch jointly CRM/ESB is assumed to send all requests to BMP Cluster 1 in site A via BMPGateway1. When BMPGateway is down, CRM/ESB needs to change BMPGateway IP address to BMPGateway of site B.
Connection stopped Connection applied Production business flow DR business flow Data replication
CRM/ES B 100% traffic 100% traffic
Site A (Main Site)
BMPGatew BMPGatew ay ay
BMPGatew ay
BMP Cluster 1
BMP 3
BMPD B
BMPD B
w, r
BMP Cluster 2
BMP 1
w,r
BMP Cluster 1+BMPDB1+BMPGateway are deployed in one DR switch group.
Site B
Scenario 4: BMP Cluster 1+BMPDB1+BMPGateway GDR switchover flow
GDR switchover duration: 3m-8m
Service takeover Single node failure
Dual cluster switch failure
Dual cluster switch
Peripheral element switch
Time Peripheral System schedule element switch switch (About less (About less than 5m) than 3m) 1. Start the application in CRM/ESB needs to change BMPGateway IP the DRBMPAPP. address to BMPGateway of site B. (manual 2. Oracle take over(1-3m). configuration, depending on CRM/ESB’s 3. The GDR software: capability) Notify CBPs of site A to change BMP Cluster IP address to BMP Cluster of site B.(auto) Notify OCGs of site A to change BMP Cluster IP address to BMP Cluster
Disaster detection
In case of Manual Switch: Depending on decision mechanism; In case of Automatic Switch: Configurable, about 10minutes
System recover
Summary Table of Switchover Duration of Different Scenarios
Scenario
Switchover Duration
Scenario1: When CBP (APP2) is down
3s
Scenario2: When CBP Adapter1 down
10s
Scenario 3.1.1: When OCG(SEE) down(<=X)
0s
Scenario 3.1.2 : When OCG(SEE) down(>X)
13s
Scenario 4: BMPCluster+BMPDB+BMPGateway GDR switchover flow
3m-8m
Remark
no need to switch
CRM/ESB needs to change BMPGateway IP address to BMPGateway of site B, the duration depends on CRM/ESB’s capability, we assume it can be finished in 5m
Data Loss - Oracle Physical DB The data is replicated to the disk and redundancy node near real time. The latency depends on network efficiency.
BMP Production Node
Redo Log Data loss: = Data latency * Network efficiency (WAN:0.4 ) * Broadband width (e.g. 10000 Mbps). Data latency: < 1-2s, generally less than 100 ms. During switchover, the redo log will be uploaded to DB and No Data Loss. BMP Redundancy Node
Active Active Physic al DB
(Readwrite )
Redo Log
Disk
Disk
Data Loss - GMDB (i.e. CBP memory database ) Every 1s or 1M bytes data, buffer sends data to standby host / disk / redundancy node. The copy speed depends on network efficiency.
CBP/CBPAdapter Production Node
Active
GMDB
Standby
Data Loss = from GMDB to Buffer (< 1MB)+ Buffer to Redundancy Node. Buffer to Redundancy Node = Data latency * Network efficiency (WAN:0.4 ) * Broadband width (e.g. 10000 Mbps). Data latency: < 1-2s, generally less than 100 ms. CBP/CBPAdapter Redundancy Node
Standby
Buffer (Log)
Disk
Disk
Performance Impact
CBPAdapter: The AA system performance impact is estimated around 15% compared with without DR solution.
CBP: The AA system performance impact is estimated around 15% compared with without DR solution.
OCG(SEE): The AA system performance impact is estimated around 15% compared with without DR solution.
BMP: The AS system performance impact is estimated around 10% compared with without DR solution.
Requirement on Dimensioning For A-A DR solution, 1. License for Oracle GoldenGate is needed for physical DB, including BMPDB(SYSDB), USRDB, CDRDB, BillDB. This data replication software allows physical DB in DR site to be opened read-write while synchronization occurs. But currently, the physical DB in DR site keeps read only. It allows one way data replication and two-way data replication. It provides scheme based data replication, that is, one DB can be divided to two schemes and for each scheme, one way or two way data replication is adopted. Huawei CBS 5.5 adopts scheme based one way data replication. But currently, one DB just have one scheme. 2. The bandwidth needs recalculation. •
• •
•