Savvis IDC Networking Issues - Backbone Router(s) Failure
Aug 25th, 2007 by jps
RESOLVED
Update 2:
Excerpt from recent Savvis notice.
Our engineers have determined that during a scheduled maintenance, a router in the DL1 facility crashed. The router did not restore gracefully, and as a result our engineers routed all production traffic away from this node. Emergency Maintenance Activity 488xxxx has been created to reboot this node during our 0001-0500 CDT maintenance window Sunday 8/26/2007. Once the device has restored and verified to be in a healthy state, our engineers will roll-back the previously implemented changes which moved traffic away from this node, thus restoring traffic flow to its original state. At this time, the alternate paths are functioning perfectly and there are no concerns of utilization, latency or packet loss. SAVVIS will send another notification once this maintenance has been completed. Traffic traversing these alternate paths may experience a momentary increase in response time as the trunks route back to their original paths; however this maintenance is expected to be a largely unnoticeable event.
Update:
The LT Savvis IDC location is still seeing intermittent issues with some client routes. If you are experiencing problems with connecting to your host(s) please open a support ticket via our support portal located at https://encompass.layeredtech.com or https://support.layeredtech.com/home/. If you are not able to use the support portal please submit a ticket via your registered support email account using a 3rd party email service (gmail, yahoo etc) or your primary account if it is available. Provide our support team with the following so it can be sent to Savvis for further investigation. You can reach the support system by emailing ’support@ layeredtech.com’ if our portal is not available.
Client ID:
Server ID(s):
Please provide the following details if possible.
1: Dos or Shell based 300 ping count. Showing ONLY the end results not all 300 pings.
2: Dos or Shell based traceroute to your server from your remote location. Text file format only please. Our upstreams will not accept images. Please do NOT mask your IP address
3: Dos or Shell based traceroute from your server to your remote location. Text file format only please. Our upstreams will not accept images. Please do NOT mask your IP address
4: ‘root’ or administrator logins to your host so we can confirm the network configuration on your host and perform additional tests if needed. (optional)
5: Information about any firewalls or packet filters you might be running
6: Information about the application you are experiencing the latency in.
7: Information on how to possibly replicate the problem.
Thanks,
Jeremy
Original Post:
Hello,
We are currently experiencing issues upstream from our Savvis IDC. Savvis was performing un-scheduled maintainence on a backbone devices which has caused a peering issue and sever or 100% packet loss for most hosts on the Savvis network.
We are working with Savvis now to get the issue resolved as quickly as possible and will update everyone with further details as they are found.
EDIT: 1:47 AM CST
I am still waiting on a call back from Savvis. Will be a few more mins before I get an update and ETA for resolution.
EDIT: 1:49 AM CST
Savvis has responded and is in the process of restoring previous configurations on the affected devices. The devices will be reloaded and brought back online in hopefully the next 15-20 mins.
EDIT: 1:53 AM CST
Some connectivity is responding. Still not 100% resolved. Will report as I learn more.
EDIT: 1:58 AM CST
Connectivity being restored. Heavy packet loss still occurring at this time.
EDIT: 2:02 AM CST
Packet Loss is starting to subside. We are still monitoring the situation and waiting on more details from Savvis as to the cause.
EDIT: 2:08 AM CST
Packet Loss is still occuring with intermittent blips of 100% loss. Still waiting on more details from Savvis as to the cause and resolution.
EDIT: 2:20 AM CST
Savvis reports they have and ongoing issue with 3 core backbone routers in the Dallas Fort Worth (DFW / DL1) region. The routers are back online now but going through further inspections to fix any outstanding issues that are found and being rebooted also to roll back previously made changes.
Current Savvis ETA: 45mins - 1 hour
I am waiting on a call back from Savvis NOC as more details are known and will update this post with them.
EDIT: 2:39 AM CST
Packet loss appears to be at 0% now. Repeated tests of 200-300+ pings show 0% loss in the past 15 mins.
— 209.67.208.178 ping statistics —
300 packets transmitted, 300 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.507/1.692/2.008/0.083 ms
Savvis has not closed out the issue or stated it is 100% resolved so some issues may re-appear until the issue is fully resolved.
EDIT: 2:44 AM CST
Savvis has reported 1 of the 3 routers has been fully resolved and is back online. They are now making changes to the remaining 2 that are showing issues. There will be brief lapses of 100% packet loss or heavy packet loss while the last 2 are being repaired.
I am now showing packet loss again from my remote location
EDIT: 3:04 AM CST
No real updates at this time. I still seeing packet loss from my remote location and awaiting updates from Savvis on the remaining 2 backbone routers in this region that are having issues.
EDIT: 3:13 AM CST
Savvis is still working on the issues with 2 affected backbone routers. They are stating it will be a new ETA of 1-2 Hours from now. The current issue appears to only be affecting some ranges at the IDC and some inbound peers at this time
Current Savvis ETA: 1 - 2 hours
EDIT: 3:27 AM CST
I am now able to ping from remote hosts. I am monitoring the situation now and waiting on an update from Savvis. The current maint window is still open and work ongoing.
EDIT: 3:35 AM CST
Savvis is now stating the work on the 2 affected routers has been completed and they are now monitoring the devices for further issues and will be calling me back in 20 mins for an update on how we are seeing traffic coming in and out from remote sites. At this time I am not showing packet loss any further but I am seeing extended routes which are going out through chicago instead of the local DFW area peers which is causing additional latency but no packet loss.
EDIT: 3:57 AM CST
Odd routes going through Chicago and other major cities are now being resolved and routing through the proper peers. Packet loss has not re-appeared since my last update.
EDIT: 4:50 AM CST
Packet loss has not returned in the last couple hours. Savvis is reporting this issue as resolved and I am going to close it out now with us.
— 72.21.34.34 ping statistics —
2840 packets transmitted, 2836 packets received, 0% packet loss
round-trip min/avg/max/stddev = 8.379/10.635/120.459/4.314 ms
Anyone still experiencing problems connecting to their host should contact our support team at support@ layeredtech.com or via https://encompass.layeredtech.com - https://support.layeredtech.com/home/. Your host may be experiencing a problem outside of this earlier outage.
Thanks,
Jeremy

When did it start? What is the ETA?
I go this from them 15min ago:
Please be advised that there is maintenance scheduled tonight that is affecting the Dallas datacenter. The maintenance is scheduled from 12:00 AM to 5:00 CST this morning. Please let us know if we can be of further assistance.
Thank you,
Adam Gray
Service Request Analyst
SAVVIS
Built To Respond
Seems to be down from an hour..
keep us updated
Its working now on my side, take care.
Julius
It must be our unlucky day. Server was down for 10+ hours because of hardware issues and now this happens.
Its working on and off for us, keep getting a wake up cell phone text message each time it goes down
It is on and off. It was on now and off again. Now it is on again!
been facing downtime on my service since 2-3 hours. i hope this is due to this problem. coz server has been rock solid since months now!
IP: 72.36.195.250
Just FYI: My monitoring shows: abnormal increase in latency that topped out over 200ms beginning 23:35 CST through 23:55, heavy packet loss began right at midnight CST, and complete loss as of 00:35 CST. Then occasional packets began making it through as of 01:25 CST where there would be connectivity for a short time period followed by complete loss again for a longer period.
At any rate, thank you, Jeremy and crew, for keeping us up-to-date.
I am now again seeing 100% packet loss.
Highly appreciating the transparency and the co-operation of LT staff, even the problem is beyond their limits.
Interesting that Murpy’s laws are applying here too: the issue appeared right when we needed most (an online preview for a hosted account). However, it turned to be a success
Thanks guys.
Having servers in both Savvis and Databank turned out to be a very good ideea. Thanks Jeremy for the status update.
STill packet lost what’s going on
At this moment no connections possible to layeredtech from the netherlands! Murphy strikes twice.. At this moment I planned to do some maintenance on my servers :-(. I think I just take another cup of collee.
Thanks to the layeredtech people for their transparency and cooperation and their outstanding service!!!
There has been no real update for 40 minutes, whats going on with this issue and why are peole performing unscheduled “maintenance” on anything.
A friday night outage is certainly not something anyone needs.
still having many problems from Poland
Dear,
we have a problem of connecting to our server
72.36.216.58
please help,
thx,
wael ali: Everyone is having problems. Get in line!
damn..
down again?
Still down for me. Waiting for updades.
Thanks.
Dear Sir,
we loose each min pass about 500$
why unscheduled maintenance
shouldn’t we know before that
you will cause us to loose alot of money
you be responsible for that
please update
I can connect to layeredtech.com but cant connect to my server and encompass.layeredtech.com. Thanks for transparency LT…
Whats happening ? First up/down, then up and now completely down ?? What is going on ?? Someone again decided to do an unsched. maintainance ? Bogus!
wow.. its like a roller coaster
hope LT fixes it by morning….
Two of my servers are with Savvis. Thought they are the best…quite disappointed that this downtime still occurs in today’s time. It is very bad to have my sites down for more than an hour…and still going.
BACK!!!!!!!! WOOHOOO!!!!!!!!!!!!!!
when the problem will be solved?
“why unscheduled maintenance
shouldn’t we know before that”
That’s why it’s called “unscheduled maintenance”.
No use complaining, I reckon, this is a Savvis wide issue; not really LTs fault.
Though, I do expect a detailed report from Savvis on what happened, and what steps they’re taking to prevent it in the future.
Thank’s Jeremy for the updates!
Back to normal for me: thanks for the follow-up.
Jeremy, where is your announcement regarding this maintenance?
I think it is important to avoid this kind of downtime in future. Quite disappointing to have a few hours downtime. Very bad for any online sites.
It seems that Savvis is back and fine now.
“Jeremy, where is your announcement regarding this maintenance?”
Un-scheduled maintenance from Savvis. So hence this would have no announcement.
This kind of unscheduled downtimes are terrible for a web host like us, we’re in Pakistan and the server went offline at about 13:00 which is peak time for online communications etc.
Atleast we should have a notification.
Hello,
It seems that it is not fully fixed as I am still having problems with the server I am currently on. Anyone know an ETA to have this fully resolved?
Thanks!
Yes I am still having problems also connecting to my servers (from NZ). For the last few years, never a problem - until now. Hopefully they are still looking into the issues - would appreciate any update. Thanks.
Having problems from Australia - is this still continuing?
Hi There;
Same here , we have 100% packet loss , we just get out from hardware issue and now we are in a network issue ,…….
I understand the whole unscheduled thing is not Layered Tech’s fault… this will impact many more companies than just LT. Just wanted to say thanks for keeping us posted… and hope this gets resolved soon.
My server is still on and off and has been since about midnight last night. Could not go to WHM until just a little bit ago about 8 AM (central time) so the problem must not be totally resolved yet. In fact I had problems just submitting to this forum! Kept taking me to an error page! Gosh I sure hope they fix this soon!
This is driving me nuts.
Is there any eta?
We we do have access to our machine - we’re getting like 20-30kb/s.
LT please update us!
Still having issues with many servers on savvis…
I just cant believe how long is taking this time. Remember me databank…
This is a nightmare!
seems to be fixed from the UK for hours now - just done 10,000 pings and only lost 2. Also tried the other ip’s mentioned in posts above, and they all do 100 pings without fail.
I’m still experienced intermittent connection timeouts and dropped packets.
Here’s a traceroute during a good period. I experienced a minute of 100% packet loss just prior to this:
traceroute to sv-b1 (72.232.89.242), 30 hops max, 46 byte packets
13 216.39.81.54 (216.39.81.54) 47.138 ms 47.094 ms 47.480 ms
14 * * *
15 sv-b1 (72.232.89.242) 46.979 ms 46.920 ms 46.785 ms
I’m having awful routing issues.. Some ips work for me, some don’t both on the same server.. the ones that don’t allow some data through, then lock up.. the IPs that work, act like nothing is wrong..
Having other people test it, the IPs that work for me, don’t for them.. and visa versa.. Except it’s repeatable 100% of the time.
-Jason
I’m having awful routing issues.. Some ips work for me, some don’t both on the same server.. the ones that don’t allow some data through, then lock up.. the IPs that work, act like nothing is wrong..
Having other people test it, the IPs that work for me, don’t for them.. and visa versa.. Except it’s repeatable 100% of the time.
-Jason
I haven’t been able to get on my site since this started, site traceroute and ping is fine, no packet loss, but the site still won’t load and even ssh times out.
I was told 24-72 hours for a network admin to look at it.
GREAT
JasonR,
That is a standard networking reply template. The ticket will be looked at much sooner then 24-72 hours.
Thanks,
Jeremy
I am having the exact same problem as JasonR. I can ssh to the server just fine, but it randomly locks up when I execute certain basic commands like “top”. It’s like it just can’t find a route back to me.
ShaneS,
Please make sure you send in your information regarding that so it can be passed upto Savvis. I do remember another client reporting a similar issue and they where able to resolve it for them earlier this AM.
Thanks,
Jeremy
I’m still having issued connecting to some server from Spain, the only way to connect to those servers is from another server at LT ( on Savvis DC)
Traceroute from server to my DSL in Spain
[root@242 ~]# traceroute 80.24.202.107
traceroute to 80.24.202.107 (80.24.202.107), 30 hops max, 38 byte packets
1 241 (72.232.60.241) 5.930 ms 0.554 ms 0.518 ms
19 * 214.Red-81-46-52.staticIP.rima-tde.net (81.46.52.214) 160.543 ms *
…
4th hop seems to be the problem
From DSL to server
1
Elmister and others, Please do not post traceroutes here as they are not useful here. Please submit them as requested earlier with the additional details so we can pass your information upto Savvis.
Thanks,
Jeremy
How long should I expect for a reply from support? It’s been 2 hours without any reply. Ticket EQX-47960-659
That’s the thing i most hate from LT, everytime i send a network related ticket, i’m told to wait 24 to 72 hours, and yes, it takes 24 to 72h to be resolved, i think network should be a priority, but it doesn’t seem that LT thinks the same
ReneeB your ticket was moved to a non-support after your earlier request for it to be closed I have moved it to the network queue. elmister the network information being submitted is being collected and forwarded to Savvis you may not see a reply immediately but it is being used. I am asking the network staff to reply to all tickets with an update.
Thanks,
Jeremy
I sent an e-mail with all my details (pings, trace routes, etc) but It was closed immediately and I was told I need to go through the company I I rent the machine through (server4sale.com). I sent the details to them as well, but I’ve been dealing with their support all day. I get the impression that they think this is all something on my side since they can access the server and execute commands just fine. They don’t see any issues. I guess I’ll cross my fingers and just keep refreshing this thread.
ShaneS I show that your reseller submitted the same details. The ticket is open with our support staff and being reviewed.
Thanks,
Jeremy
We’re experiencing the same problem with LT servers, that we purchased through ZipServers. Unfortunately they are blaming this on our ISP, which is clearly not the case.
Is this something that is actively being worked on? Comments written above such as “I can ssh to the server just fine, but it randomly locks up when I execute certain basic commands like “top”. It’s like it just can’t find a route back to me.” are precisely the case..
Same issue as Saeven here, SSH would randomly lock up and websites would randomly not load. I show no packet lost but we also have not changed any configurations on the server.
I really hope this gets fixed soon, I have a few customers getting really upset.
This morning ltstatus.com is sometimes not available or it takes a long time to load. Same problems as Bruce and Saeven! I live in the Netherlands
Saeven, Bruce and Jack please note there is an ongoing maint window right now that is taking place with Savvis to repair the issues they caused lastnight. Hopefully this will clear up your issues if not please submit the details to us as requested earlier or via your reseller and ask them to forward the details to us.
The window for this maint should be closed in roughly 2 hours from now.
Thanks,
Jeremy
Hello Ltadmin,
Visitors from Holland and Denmark are experiencing problems, sometimes they can connect the websites sometimes they cant… Sometimes they are connecting a website in my server without a problem, but in same time they cant connect to another website in the same server.
Do we need 24-72 hours to be resolved? If so i can explain the situation to my customers and visitors..
It seems it’s down completly now. Any ETA from Savvis yet ? How will Savvis make up for this problem, some of us are loosing big money because this extended downtime.
Seems like all the problems that were there before the most recent maintenance are still there now. Can you please let us know what was done by Savvis and if they currently are considering the issue resolved or i they are aware there are still problems?
I am still seeing the same exact problem as well.
Any updates? My clients are already angry… I hope this will be resolved today… I do not want to lose any more business.. I’ve already got customers wanting to go elsewhere..
Hello,
Wondering if someone has details if the work was done or not, the clients were having problems before are still with the same problems.
Thank You
HTTP/FTP/SSH/POP3 are still being blocked/dropped on our primary IP address, but only when accessed from certain ISPs. However, our other IP addresses on the same server don’t seem to give anyone any problem.
It appears that the route FROM the server to the ISP unexpectedly varies depending on which of the server’s IP addresses is used for the traceroute. Ticket #KOY-65711-322
I had notice that some people can’t enter to the ip where they were connected when the issues were happening they can go to any other ip even in the same server but to the one they were connected that unfortunatly is exactly to the one they want to connect since is them web/radio/chat/
Alex
ChrisM that sounds precisely right, thanks for having filed a ticket. Should I set up some forum space where we can all post traceroutes for these guys to look at?
Almost every server we have on SAVVIS are with those problems… If I would go for every server I have to provide all the details you request, we will not finish today. So I guess you need to check all again!
Glad to read we’re not alone at least, hope people keep piping up til its fixed! Good luck with the fixings nonetheless, I hope that it gets resolved over the weekend.
Hello,
I have people saying they cannot reach my server either.
ETA?
RickyG
All,
As I have we requested already in the past 48 hours please submit the details of anything you are seeing to our support team and they will forward the details upto the Savvis group. If your host is via a reseller then please submit them via the reseller and if the state the issue is non-existent please point them to this post and ask them to submit them. Please include the following with all submissions.
Client ID:
Server ID(s):
Please provide the following details if possible.
1: Dos or Shell based 300 ping count. Showing ONLY the end results not all 300 pings.
2: Dos or Shell based traceroute to your server from your remote location. Text file format only please. Our upstreams will not accept images. Please do NOT mask your IP address
3: Dos or Shell based traceroute from your server to your remote location. Text file format only please. Our upstreams will not accept images. Please do NOT mask your IP address
4: ‘root’ or administrator logins to your host so we can confirm the network configuration on your host and perform additional tests if needed. (optional)
5: Information about any firewalls or packet filters you might be running
6: Information about the application you are experiencing the latency in.
7: Information on how to possibly replicate the problem.
7 Cancellation requests, 7 less servers, and counting, i really don’t understand how can i have ONLY 7 cancellations caused by this huge downtime
I’m very lucky, it might be the summer, maybe customers are in the beach, let’s see what happens tomorrow
elmister,
We are doing everything we can with the issue. Savvis currently making some changes on the backbone to route around some identified oc48 links to see if it resolves the issue. Initial tests look promising but we are waiting on some more feedback.
Thanks,
Jeremy
The main problem is that we cant ask for our customers to do that many of tests that you required, its incoerent. My customers just want to see their websites opening.
Here some IPs that our customers reported problems:
72.232.109.*
72.36.202.*
72.36.229.*
72.36.225.*
Savvis has made some changes for 5 class C’s now to route around some oc48 circuits that where dropping packets. Initial tests have shown improvement for these ranges and we are waiting on feedback from the clients on them to confirm the issue is resolved with this change. Savvis in the meantime is ensuring they have enough capacity to route around these circuits and we will then look at routing the larger supernet blocks on our Savvis Network which are showing issues.
I am on the call now with Savvis and will update as I learn more.
Thanks,
Jeremy
My affected primary is 72.232.119.114
I look forward to your updates, and to SAVVIS rectifying the issue. I’m not an LT customer directly, though my server is with you.. Zipservers, with whom I obtain service for the SAVVIS DC through you, have been less than helpful. Your attitude here, is definitely encouraging me to skip the middleman.
Just to note, the issue is still a problem at present time.
Fix it, fix it now, it is your fault, fix it. Now
EDIT: 17:38 CST
We are now in the process of routing our larger supernets out over alternate paths to avoid the oc48 circuits that have been showing packet loss. This has shown to be successful with the smaller /24 blocks we did earlier. There may be some brief service interruptions while this takes place.
Thanks,
Jeremy
All the clients got booted like a minute ago, letting you know hopes is finally the fix of the issues.
Thank You
Hello,
Savvis has now routed all of the LT IP ranges off the current OC48 circuits that where having issues and are now using a OC192 circuit.
All hosts that where showing issues before should now re-test and see if they are still experiencing problems since the last change was made. Please ensure you test connectivity with all of your supported protocols. IE SSH, HTTP, HTTPS, POP3, SMTP and others.
We will be having another call with Savvis in 1 hour so any updates would be appreciated.
Thanks,
Jeremy
FWIW: my machine is still behaving badly. Pings to it work fine (as they have been for a long time), but the moment I do anything that requires a bit of ssh traffic, that process just hangs. (E.g. a long ls or a vi session is impossible.) Non network related traffic is fine.
Hi,
Still the same issue here, top would freeze SSH and sites will randomly not connect. I found out some new things that might help SAVVIS/LT solve the problem.
1) If i leave it at waiting for domain, eventually (in a few minutes or sometimes longer) it will load the page.
2) One ODD thing is any 404 or server generated pages load quickly with no problems, for example, http://swifthost.net/nonexistantpage will load very quick and give me my custom error page, but any PHP or even basic HTML pages will take a really long time.
(Anyone else with the same issue? Try loading a page that’s non-existant or a server generated page [404, directory listings] )
3) Another thing is the server is running perfectly fine, I know this because any websites I go to via a proxy url (megaproxy.com) they load perfectly fine and QUICK, no connection issues, but if I try to load the page on my own connection directly, it gives me all the same issues as before!)
4) If I RDP into a Windows 2003 server I have running on another host that uses Congent bandwidth, I can connect perfectly to my server, this includes ALL HTTP connections, sites load quick and fast, SSH works perfectly fine, all commands work without freezing.
From what it looks like, Savvis is STILL experiencing issues and I REALLY hope this gets resolved soon, it has been nearly 48 hours since this problem has started occurring!
My ticket number from LT is #LTTR #MOU-68176-630 and I will be posting this message as a update!
Savvis has made another change on the uplink routing equipment. Can everyone please re-test their servers and services and see if there is any improvements. If you are seeing an issue still please send us an updated traceroute and ping results and the speciifc protocol you are seeing errors with and we will continue to investigate the issues.
Thanks,
Jeremy
48 hours down and still no change
ReneeB,
I am able to access your host from a number of sites with no problems. Can you please see my last update to your ticket and test again.
Thanks,
Jeremy
It’s back up now, thanks and sorry for the constant pestering.
Savvis has finished all work for this evening. We have been viewing all reports coming in and most people seem to be fully resolved at this time. We are continuing to monitor until 02:00 AM CST and will close the issue for tonight only. We will be working with Savvis tomorrow to get a more clear understanding of the true cause of the issue and what can be done to prevent any re-occurrence over the next few days and plan to remove the current static routes over the coming 30 days. We will post updates prior to making those changes.
If you are still having problems connecting to your host it is likely NOT related to the earlier problem and should be addressed by our support staff. So please open a *new* ticket and request a reboot or local console check to resolve the issues. If you are still seeing network issues after that please submit the full pings, traceroutes and other details and we can re-open the issue with Savvis tonight and continue. At this time all systems appear to be clean and operating as expected. A full postmortem is still in progress and we will have more details in the coming days of the true cause and sequence of events.
Thanks,
Jeremy