SAVVIS Routing Issues Information UPDATE
Aug 26th, 2007 by Miguel Richards
RESOLVED:
The conference call is still continuing, all LT and SAVVIS’ resources are being applied to get the issues resolved.Savvis has made some changes for 5 class C’s now to route around some oc48 circuits that where dropping packets. Initial tests have shown improvement for these ranges and we are waiting on feedback from the clients on them to confirm the issue is resolved with this change. Savvis in the meantime is ensuring they have enough capacity to route around these circuits and we will then look at routing the larger supernet blocks on our Savvis Network which are showing issues.
I am on the call now with Savvis and will update as I learn more.
EDIT: 17:38 CST
We are now in the process of routing our larger supernets out over alternate paths to avoid the oc48 circuits that have been showing packet loss. This has shown to be successful with the smaller /24 blocks we did earlier. There may be some brief service interruptions while this takes place.
EDIT: 19:06 CST
Hello,
Savvis has now routed all of the LT IP ranges off the current OC48 circuits that where having issues and are now using a OC192 circuit.
All hosts that where showing issues before should now re-test and see if they are still experiencing problems since the last change was made. Please ensure you test connectivity with all of your supported protocols. IE SSH, HTTP, HTTPS, POP3, SMTP and others.
We will be having another call with Savvis in 1 hour so any updates would be appreciated.
EDIT: 22:29 CST
Savvis has made another change on the uplink routing equipment. Can everyone please re-test their servers and services and see if there is any improvements. If you are seeing an issue still please send us an updated traceroute and ping results and the speciifc protocol you are seeing errors with and we will continue to investigate the issues.
EDIT: 00:50 CST
Savvis has finished all work for this evening. We have been viewing all reports coming in and most people seem to be fully resolved at this time. We are continuing to monitor until 02:00 AM CST and will close the issue for tonight only. We will be working with Savvis tomorrow to get a more clear understanding of the true cause of the issue and what can be done to prevent any re-occurrence over the next few days and plan to remove the current static routes over the coming 30 days. We will post updates prior to making those changes.
If you are still having problems connecting to your host it is likely NOT related to the earlier problem and should be addressed by our support staff. So please open a *new* ticket and request a reboot or local console check to resolve the issues. If you are still seeing network issues after that please submit the full pings, traceroutes and other details and we can re-open the issue with Savvis tonight and continue. At this time all systems appear to be clean and operating as expected. A full postmortem is still in progress and we will have more details in the coming days of the true cause and sequence of events.
Thanks,
Jeremy

How soon will this be fixed, any ETA?
My customers sites have been down for now 2 days….
I’m wondering this as well. I still can’t load my websites, or even test to ensure that a cPanel problem has been fixed because I cannot access the server.
If you want I can check, I can see both of my servers without problems (cross my fingers) ;), thanks.
I am having problem with a https://, who do I contact for this issue? Thanks.
Services on my 3 servers are affected differently for users. Some users can reach HTTP on one server but not another, while another user on a different network have the exact opposite experience and some users are not affected at all.
I’ve been looking over traceroutes from various Looking Glasses and getting I’m also seeing asymmetrical routing from various locations. Are you certain that you’re not announcing the same prefixes with different next-hops?
Asked some of the people with problems and they are still with the same problems. The can’t connect to the ip where they were when the main problem happend but they can go to other ips even in the same server.
Alex
Alex,
That is exactly what I am seeing. I have been in contact with the datacenter manager, and have sent him the most recent traceroute. I have given him full access to the server, and offered to give him access to VPN into my home machine, so they can experience the issues first hand if they can’t duplicate it there.
-JasonR
*I AM REPOSTING THIS HERE TO MAKE SURE THIS INFORMATION GETS PASSED ALONG IN THE NEXT CALL TO HELP RESOLVE THE ISSUE*
Hi,
Still the same issue here, top would freeze SSH and sites will randomly not connect. I found out some new things that might help SAVVIS/LT solve the problem.
1) If i leave it at waiting for domain, eventually (in a few minutes or sometimes longer) it will load the page.
2) One ODD thing is any 404 or server generated pages load quickly with no problems, for example, http://swifthost.net/nonexistantpage will load very quick and give me my custom error page, but any PHP or even basic HTML pages will take a really long time.
(Anyone else with the same issue? Try loading a page that’s non-existant or a server generated page [404, directory listings] )
3) Another thing is the server is running perfectly fine, I know this because any websites I go to via a proxy url (megaproxy.com) they load perfectly fine and QUICK, no connection issues, but if I try to load the page on my own connection directly, it gives me all the same issues as before!)
4) If I RDP into a Windows 2003 server I have running on another host that uses Congent bandwidth, I can connect perfectly to my server, this includes ALL HTTP connections, sites load quick and fast, SSH works perfectly fine, all commands work without freezing.
From what it looks like, Savvis is STILL experiencing issues and I REALLY hope this gets resolved soon, it has been nearly 48 hours since this problem has started occurring!
My ticket number from LT is #LTTR #MOU-68176-630 and I will be posting this message as a update!
https:// is now working for me. but I have this issue:
11 8 8 14 204.70.193.26 hr2-tengig12-1.dallas1.savvis.net
12 8 8 8 216.39.81.34 -
13 Timed out Timed out Timed out -
14 Timed out Timed out Timed out -
15 Timed out Timed out Timed out -
16 Timed out Timed out Timed out -
Trace aborted.
Thanks.
SSH still locking up on one server, but fine on another.
Thanks for the updates everyone. Please post if you are fixed now or not. If you are not fixed please submit us the traceroute details and specific protocols that are failing so we can continue to investigate a resolution to this issue. Savvis is still on the call with me and working on their local routing equipment to isolate the outstanding problems that have been reported since my earlier update at 19:06 CST.
Thanks,
Jeremy
ran a traceroute
traceroute to vespinoy.com (72.36.152.82), 30 hops max, 40 byte packets
1 10.1.184.1 (10.1.184.1) 17.951 ms 17.280 ms 12.410 ms
2 202.8.255.1 (202.8.255.1) 10.102 ms 12.296 ms 12.703 ms
3 EdgeRouter-pos.d-one.net (202.8.224.136) 18.238 ms 15.705 ms 18.223 ms
4 203.177.55.50 (203.177.55.50) 12.101 ms 23.791 ms 20.958 ms
5 203.177.31.89 (203.177.31.89) 27.490 ms 20.836 ms 13.459 ms
6 203.177.254.185 (203.177.254.185) 215.822 ms 215.335 ms 219.224 ms
7 POS3-1.IG4.LAX1.ALTER.NET (157.130.214.193) 217.336 ms 216.670 ms 213.475 ms
8 0.so-7-0-0.XL1.LAX1.ALTER.NET (152.63.112.250) 226.432 ms 227.368 ms 218.506 ms
9 0.so-6-0-2.XT3.DFW9.ALTER.NET (152.63.0.57) 264.834 ms 255.366 ms 253.425 ms
10 0.so-6-0-0.BR6.DFW9.ALTER.NET (152.63.99.2) 263.032 ms 254.173 ms 262.450 ms
11 204.255.169.102 (204.255.169.102) 243.069 ms 241.138 ms 239.775 ms
12 cr1-pos0-3-3-1.dallas.savvis.net (204.70.194.66) 233.827 ms 234.714 ms 259.401 ms
13 hr2-tengig12-1.dallas1.savvis.net (204.70.193.26) 238.513 ms 235.358 ms 239.325 ms
14 216.39.69.238 (216.39.69.238) 238.394 ms 234.283 ms 234.628 ms
15 * * * Request timed out.
16 82.152.36.72.static.reverse.ltdomains.com (72.36.152.82) 235.288 ms 246.351 ms 239.281 ms
Trace complete
still can’t access my site.
No changes for me: traceroute from my local machine results in an ‘Request timed out’. Always. Ping works.
(I have also filed a request on this.)
Tom
Hello,
We are still experiencing problems and unable to connect in all ways. Anyone know any sort of ETA? My customers sites have also been down for 2 days now, and this is not good.
Hopefully you can get this issue resolved ASAP!
I also ran a tracert.
C:\Users\CyberGeek>tracert cybergeekweb.com
Tracing route to cybergeekweb.com [72.36.132.58]
over a maximum of 30 hops:
1
I have still the access problem like ScottS.
SSH client can connect my server, but suddenly locks up.
I have another server in other DC and via that server, SSH connection runs smoothly.
I guess there is still some routing issue.
Tracert …
C:\Users\CyberGeek>tracert cybergeekweb.com
Tracing route to cybergeekweb.com [72.36.132.58]
over a maximum of 30 hops:
1
My server is still having issues. SSH times out after just one or two commands and my HTTP/HTTPS/POP/SMTP is mostly unresponsive.
I am still having intermittent issues on some sites. This is affecting sites hosted on three different servers in the LT datacenter, but strangely our other server in the LT datacenter has not had any issues.
I am seeing intermittent problems connecting to sites - one site on a server will work, another will not, 10 minutes later they’ve switched.
I’m seeing many of the same symptoms that Bruce mentioned above. SSH windows are freezing when executing a command with lots of output (ie top, ps aux). On sites that I can’t get to the 404 pages load very quickly. If I try a wget on a page that does exist, I see very quickly that the server responds with a 200 OK code but past that nothing.
I hope this information helps. I sent in several traceroutes in ticket OST-23310-533 if you would care to look them over. I will not post them here for brevity and privacy.
Savvis has made another change on the uplink routing equipment. Can everyone please re-test their servers and services and see if there is any improvements. If you are seeing an issue still please send us an updated traceroute and ping results and the speciifc protocol you are seeing errors with and we will continue to investigate the issues.
Thanks,
Jeremy
I had two separate requests (different servers, one HTTP and one HTTPS) that were consistently failing, and started working just before (within 5 minutes) of the 22:29 update.
I hope it’s resolved for all others as well.
Hi Jeremy,
Have clients that just reported me they can enter now, will post here if any reports problems, want to thank you for keep us posted about the problem.
Thank You
Alex
I’ve received reports from a dozen or so users who were previously unable to access portions of our web services who indicate they are now able to connect. I remain hopefully optimistic.
Thanks!
Everything seems to be working good so far! Finally, thanks for fixing this.
At 4.20 hours GMT:
WinMTR Statistics
WinMTR statistics
Host % Sent Recv Best Avrg Wrst Last
192.168.1.254 0 120 120 0 59 141 93
lo1.dr4.d12.xs4all.net 1 120 119 0 25 187 47
0.10ge-3-2-0.xr3.d12.xs4all.net 0 120 120 0 15 157 16
0.so-0-0-0.xr2.3d12.xs4all.net 2 120 118 0 16 187 16
134.222.97.17 0 120 120 0 14 156 15
obl-rou-1021.NL.eurorings.net 0 120 120 0 17 297 16
ldn-s2-rou-1021.UK.eurorings.net 0 119 119 15 22 94 16
ldn-s2-rou-1003.UK.eurorings.net 6 119 113 15 24 94 31
bcr2-ge-4-1-0.Londonlnx.savvis.net 0 119 119 15 22 94 93
204.70.193.118 0 119 119 15 24 125 32
cr1-pos-0-8-0-0.NewYork.savvis.net 1 119 118 15 106 188 16
cr1-pos-0-8-0-0.NewYork.savvis.net 62 119 46 109 148 546 110
dcr2-so-0-0-0.dallas.savvis.net 59 119 49 125 154 188 141
dcr1-so-6-0-0.dallas.savvis.net 52 119 58 125 147 187 187
cr1-pos0-3-3-1.dallas.savvis.net 62 119 46 125 151 188 141
hr2-tengig12-1.dallas1.savvis.net 72 119 34 125 154 188 172
dcr1-so-6-0-0.dallas.savvis.net 57 119 52 125 177 360 125
cr1-pos0-3-3-1.dallas.savvis.net 73 119 33 125 148 203 188
hr2-tengig12-1.dallas1.savvis.net 74 119 31 125 153 235 141
http://www.rkboz.nl 84 119 20 140 166 438 438
72.232.50.114 is from United States(US) in region North America
TraceRoute to 72.232.50.114 []
Hop (ms) (ms) (ms) IP Address Host name
1 0 0 0 66.98.244.1 gphou-66-98-244-1.ev1servers.net
2 0 0 0 66.98.241.16 gphou-66-98-241-16.ev1servers.net
3 0 0 0 66.98.240.12 gphou-66-98-240-12.ev1servers.net
4 1 1 1 129.250.11.129 ge-1-11.r03.hstntx01.us.bb.gin.ntt.net
5 1 1 1 129.250.2.228 xe-0-1-0.r20.hstntx01.us.bb.gin.ntt.net
6 7 7 6 129.250.3.129 as-0.r20.dllstx09.us.bb.gin.ntt.net
7 178 8 58 129.250.2.154 po-1.r02.dllstx09.us.bb.gin.ntt.net
8 7 6 7 208.175.175.81 aer1-ge-5-5.dallasequinix.savvis.net
9 7 7 7 204.70.194.29 -
10 9 8 Timed out 204.70.194.66 cr1-pos0-3-3-1.dallas.savvis.net
11 15 15 15 204.70.193.26 hr2-tengig12-1.dallas1.savvis.net
12 14 14 14 216.39.81.34 -
13 Timed out Timed out Timed out -
14 Timed out Timed out Timed out -
15 Timed out Timed out Timed out -
16 Timed out Timed out Timed out -
Trace aborted.
Julius,
I suspect your host needs a reboot or similar. I am not able to hit it from the local area network or remote. Please open a *new* reboot or support ticket to have it looked at.
Thanks,
Jeremy
Jack,
I am having a hard time reading that output. Are you not able to connect to your site? I can from this remote site with no errors or PL. Can you re-run any tests you did and see if the packet loss persists and then submit it to our help desk if its there is any loss. Either that or paste the results to ‘http://pastebin.ca’ and post the link here.
Thanks,
Jeremy
Savvis has finished all work for this evening. We have been viewing all reports coming in and most people seem to be fully resolved at this time. We are continuing to monitor until 02:00 AM CST and will close the issue for tonight only. We will be working with Savvis tomorrow to get a more clear understanding of the true cause of the issue and what can be done to prevent any re-occurrence over the next few days and plan to remove the current static routes over the coming 30 days. We will post updates prior to making those changes.
If you are still having problems connecting to your host it is likely NOT related to the earlier problem and should be addressed by our support staff. So please open a *new* ticket and request a reboot or local console check to resolve the issues. If you are still seeing network issues after that please submit the full pings, traceroutes and other details and we can re-open the issue with Savvis tonight and continue. At this time all systems appear to be clean and operating as expected. A full postmortem is still in progress and we will have more details in the coming days of the true cause and sequence of events.
Thanks,
Jeremy
The site seems to be up now, just like shell work. Traceroute doesn’t (Request timed out), but I don’t really care about that, and it may be an issue with my local machine.
TomV,
It is normal to see the 2nd to last hop before your server to timeout when running a traceroute. That is where our network goes from public IP space to private IP space and our edge gear is setup to drop all PING replies to those ranges as they cross the borders. Now if you repeat the tests from your host outbound you should not see that hop with a timeout but instead show a device in a ‘10.xxx.xxx.xxx’ IP range. Also please make sure you are not seeing timeouts due to firewalls, kernel sysctl’s in use on your host.
Thanks,
Jeremy
Quick tests at 5am EST show my issue resolved. Will test more fully when I am actually awake.
-Jason
Thanks Jeremy did a reboot and its working now, take care.
Juba
Jeremy, would you mind having a look at ticket #MEZ-55638-188? They’re telling me I have to wait 48-72 hours before a technician will have a look at it, and I have a client in Japan unable to connect to one of our servers (but several who can). Not sure if this is the same issue, but the traceroute shows a suspicious hop between Savvis and Level 3 Dallas.
Thanks!
Anyone get any response to SLA inquiries?
-Jason
Anyone with questions related to SLA credits or similar will need to contact our sales team at sales and you should get a response within 48 hours of making your submission. We are still investigating everything that occurred and will need 24 hours before anyone starts getting replies to requests that have already been submitted.
Thanks,
Jeremy
What is LT’s SLA and how are credits computed? Thanks, Joe
Joe,
Please contact sales at layeredtech.com and they can provide you a PDF copy of our SLA and the credits that are issued depending on the length of downtime experienced.
Thanks,
Jeremy
So i assume the this whole routing issue is fixed now??? Or what? Can you guys make a notice if its fixed or if its still going or what!
I am still getting this:
72.232.50.115 is from United States(US) in region North America
TraceRoute to 72.232.50.115 [ns1.wbox2007.com]
Hop (ms) (ms) (ms) IP Address Host name
1 0 0 0 66.98.244.1 gphou-66-98-244-1.ev1servers.net
2 0 1 1 66.98.241.16 gphou-66-98-241-16.ev1servers.net
3 0 0 0 66.98.240.12 gphou-66-98-240-12.ev1servers.net
4 1 1 1 129.250.11.129 ge-1-11.r03.hstntx01.us.bb.gin.ntt.net
5 2 2 1 129.250.2.228 xe-0-1-0.r20.hstntx01.us.bb.gin.ntt.net
6 7 6 7 129.250.3.129 as-0.r20.dllstx09.us.bb.gin.ntt.net
7 148 7 207 129.250.2.154 po-1.r02.dllstx09.us.bb.gin.ntt.net
8 7 7 6 208.175.175.81 aer1-ge-5-5.dallasequinix.savvis.net
9 7 7 7 204.70.194.9 dpr1-ge-0-3-0.dallasequinix.savvis.net
10 Timed out 7 8 204.70.194.66 cr1-pos0-3-3-1.dallas.savvis.net
11 8 8 8 204.70.193.26 hr2-tengig12-1.dallas1.savvis.net
12 8 8 8 216.39.81.34 -
13 Timed out Timed out Timed out -
14 Timed out Timed out Timed out -
15 Timed out Timed out Timed out -
16 Timed out Timed out Timed out -
Trace aborted.
And I cannot access
Folks
If there are any open issues, please document your requests in a support ticket. Once the support ticket(s) are created our network team is able to address your needs quickly and directly. The forums are an excellent method of communication, however the support tickets are monitored and audited more closely. If anyone has an issue creating a ticket or other need, please feel free to contact me via email and I will assist you.
Thanks for all your patience and understanding
Miguel Richards
Manager, Data Center Operations
Layered Technologies Inc.
Mobile 214-878-4424
mrichards@layeredtechnologies.com
http://www.layeredtech.com
sales@layeredtech.com