Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
12/02/2004
France Telecom R&D
Fast convergence project
Nicolas DUBOIS FTR&D/DAC [email protected]ît FONDEVIOLE FTR&D/DAC [email protected]
Nicolas MICHEL FTR&D/DAC [email protected]
- D2 - 12/02/2004France Telecom R&D
Agenda
What is the network context ?
How can we improve the scheduled maintenance ?
How can we improve the IS-IS convergence ?
Our roadmap
- D3 - 12/02/2004France Telecom R&D
Network contextServices
Migration of a lot of new services on IP networks: Voice over IP, video over IPVPN with SLA (Service Level Agreement)
New applications are :Sensitive in term of performanceCritical for the business
EquipmentsMany of them with IP functions :
Switch (ATM, GE, etc.)BAS et NAS (mobility) …
A lot of different vendors within France Telecom networksRouters : Cisco, Juniper, Redback, Cosine, etc.
Networks4 major Ases (3215, 5511, 25186, Equant IGN)With various topologies and services
- D4 - 12/02/2004France Telecom R&D
Network context : Hot issuesProblem with L2TP tunnels:
L2TP is used for INTERNET aggregation. The IP backbone is fully securized. But during link or node failure and scheduled work a lot of L2TP sessions are reset. We need less than 5s as L2TP timer is around 8s.
Problem with MPLS/VPN:Traffic from VPN customers is very sensitive in term of convergence because :
Video applications over INTRANET are growing fast There are a lot of real time applications : cartography apps, financial apps, etc.. SNA interconnections with thousands of usersDeployment of "thin clients" ( e.g. Citrix )
Service Convergence should be under the second
Two improvement areas :Improve the scheduled maintenance process so that :
L2TP session get rerouted before the maintenance operationsMPLS-VPN stability is enhanced (no traffic loss, no BGP session reset)
Improve the IS-IS convergence in the IP backbone (PE-P and P-P) so that :Unexpected traffic disruption become as short as possible
- D5 - 12/02/2004France Telecom R&D
Improve the scheduled work (1/2)Taking into account the scheduled work is a priority. Most convergence cases arerelated to scheduled work.
Example 1: Study comparing IS-IS activities on maintenance hours compared todaytime. IS-IS changes are much more frequent during maintenance hours (study over 1 month)
Example 2: Study analyzing all failure and maintenance tickets and the related rootcauses: Logiciel
9% Matériel 13%
Exploitation 1%
Indéterminé1%
Software
Hardware
unknown
Human errors
TP 76%Maintenance
- D6 - 12/02/2004France Telecom R&D
Improve the scheduled Maintenance (2/2)We can anticipate the rerouting for a scheduled maintenance. That is of course not possible in case of an unexpected failure event.
Exemple of an improved engineering rule in case of a router reload (software upgrade).
Before the reload :Step 1 : Set overload bit in the IS-IS configurationStep 2 : Depending on the router, modify some metrics (next-hop or loopback, etc.)Step 3 : Wait and verify that expected rerouting of the traffic as occured.Step 4 : Reload the router
After the reload :Step 5 : Don't forget to normalize the router, Step 6 : Verify that nominal routing is again working
Not always efficient, can become very complex.
We need a safe, simple, efficient solution to allow scheduled work without lossof packets in order to reduce operational costs and disruptions
Maybe dedicated reload feature combined with Graceful restart and Non Stop Forwarding ?Maybe other protocol changes in ISIS and BGP ?
- D7 - 12/02/2004France Telecom R&D
Improve the convergence time
Time
Break
T0
Detection
T1 T2
ReceptionPropagation and trigger
Computation
T3
RIB update
Distributed update
T4
FIB updateConvergence Time
Time
State of routers
Analysis of the convergence time:1. Detection Time2. IS-IS Flooding and SPF triggering3. SPF Computation and RIB update4. FIB computation and distribution to the Line cards
Method to improve the convergence time:1. Understand the IS-IS activity2. Enhance IS-IS protocol behavior3. Improve our test tools
Customer traffic
4. Verify and measure on the real network the improvement
- D8 - 12/02/2004France Telecom R&D
Principle for monitoring IS-IS
PE
PE
P P
P
OperationalNetwork
Back Office
LocalDatabase
Web Server
RawformatAnalyzed data
Database Server Probe
FT R&D
MRTGreplay
Agilent replay
Lab Network
Customized Auditdocument
Principle: Snoop an IS-IS session between two IS-IS RoutersThe probe receives and stores all LSP flooded in the networkThe database server stores and analyses the LSPThe web server displays statistics
- D9 - 12/02/2004France Telecom R&D
Understand the network behavior Audit network behavior to prove IS-IS optimization feasability withSceptre:
Inter-LSP arrival :
Number of SPF and PRC computations per day :
General evolution of IS-IS activity (here over one month) :
Repartition of the interval
between LSPs (ms) Repartition of the interval
between non-refresh LSPs (ms)
- D10 - 12/02/2004France Telecom R&D
Today possible IS-IS Optimization
Optimizing IS-IS reactions to failuresDetection
No possible changes in IS-IS hello : too dangerousPOS timer optimization (depending of layer 2 protection timer)
LSP generation /floodingSmall backoff LSP gen interval ( avoids spacing too much LSPs)
SPF and PRC triggering and computationSmall back off spf interval
IS-IS before IS-IS optimized Failure detection 2000 ms <= 5 ms Delay before flooding LSP 33 ms <= 5 ms Delay before SPF computation 5500 ms 1 ms to max time (avg 20ms) SPF computation time & RIB update
40 to 250 ms 40 to 250 ms
- D11 - 12/02/2004France Telecom R&D
Tools to mesure convergence delay testLab tests with real hardware are also necessary and complementary to control plane auditing and monitoring.
In order to evaluate equipment performanceIn order to check that new equipments can effectively forward and reroute the traffic
Example of measurementsLine card FIB feeding speed can only be measured in the lab.LSP generation mechanisms evolution depending upon versionSPF triggering and computation speedSPF vs. Incremental SPF in realistic network conditions
RT for IS-IS& IP traffic
.x+2
MRT for BGP
C72VC
C12C C12F
C12EMgmt:0/0:.114 Mgmt:0/0:.132
.45 .2 .1
.69 .70
.33.81
.34
.82 .6
.5.10
X.80/30
X.0/30
X.8/30
X.32/30
X.68/30
X.4/30
Lo1:Y.13/32
5
200
2005
5
RT for IS-IS& IP traffic
.x+2
MRT for BGP
Web Server
RawformatAnalyzed data
Database Server
Use of Database toUpdate ISIS & BGP simulators
- D12 - 12/02/2004France Telecom R&D
IS-IS convergence : FIB update
Temps de rétablissement du trafic
0
20 000
40 000
60 000
80 000
100 000
120 000
140 000
0 2 000 4 000 6 000 8 000 10 000 12 000 14 000 16 000 18 000Temps (ms)
Nb
préf
ixes
rét
ablis
FIB update : 25 us / prefixe
T1: link failure
T2: FIB Update of first IPv4 prefix
T3 : FIB Update of last IPv4 prefix
In some network situations for some equipments, the principal problem is the FIBupdate :
This time depends on the failure (ASBR, link in the backbone) and the number of prefixes that should be modified inside each line card.
After a change of output interface without change of BGP nexthop in NC1 (decision base on local pref)
Num
ber o
f upd
ated
pre
fixes
INTERNETFull-Routing
ASBR2ASBR1
NR1 NR1
NC1 NC2
5
30
25
1010
10 10
INTERNETcustomer
Nom
inal path for the FR
Time in msec
IS-ISconvergenceand FIBupdatefor the FR
The solution is : Two stages lookup , CISCO introduce s this feature in 12.0.27S
Backup path for the FR
- D13 - 12/02/2004France Telecom R&D
Updating a lot of prefixesBut in other failure case like :
Temps d'établissement du trafic
y = 4,85 x
0
20 000
40 000
60 000
80 000
100 000
120 000
140 000
0 5 000 10 000 15 000 20 000 25 000 30 000Temps (ms)
Nb
préf
ixes
éta
blis
Time to update the FIB for all BGP prefixes (with a new next-hop). After an ASBR failure.
Time in msec
Num
ber o
f upd
ated
pre
fixes
Established prefixes over time
INTERNETFull-Routing
ASBR2ASBR1
NR1 NR1
NC1 NC2
5
30
25
1010
10 10
INTERNETcustomer
Nom
inal path for the FR
Backup path fot the FR
BGPconvergenceand FIBupdatefor the FR
What is the hardware, software solution ? (maybe just a best network design ? ;-)
- D14 - 12/02/2004France Telecom R&D
Solution for the FIB Update issueSome possible solutions
Hardware improvement: Two stages lookup at the edge or in the core;Engineering changes :
Reduce the number of IS-IS and BGP prefixes (for example : Partial Routing with default route)Limit the number of next-hop changes.
Partial routing performance (Case of the collect network)It is not a perfect solution, but it provides a good improvement to the convergence in the core It can stabilize the network by reducing the amount of BGP updates
0
5 000 0
10 000 0
15 000 0
20 000 0
25 000 0
30 000 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
day
num
ber
of u
pdat
es p
er d
ay
BGP full-routing partial-routing
- D15 - 12/02/2004France Telecom R&D
SummaryFor Internet services
IS-IS optimizations are sufficient to fulfill the service requirements with routers that have two stages look-up. FT will deploy IS-IS fast convergence.BFD will be a good solution to improve the detection time on GigaEthernet
For MPLS-VPN servicesSome additionnal tests are needed to detail failure cases and discuss the benefits of subsecond IS-IS convergence. There is a strong interaction with LDP.
OutlookInteroperability : Test the end-to-end results with different routers (CISCO – JUNIPER)Need to consider Fast Reroute to decrease the convergence time for new services ? A problem for updating a lot of prefixes in the FIB remains (in some special failure)Improve the Inter-AS convergence time