Jon Lewis's Blog

Mon, 11 Jul 2022

impossible circuit

I'm preserving bits of this thread I started on the NANOG mailing list back in 2008 in case the old archives ever disappear:

From jlewis@lewis.org Sun Aug 10 23:15:47 2008 -0400
Date: Sun, 10 Aug 2008 23:15:47 -0400 (EDT)
From: Jon Lewis 
To: nanog@nanog.org
Subject: impossible circuit
Message-ID: 
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Content-Length: 7326
After all the messages recently about how to fix DNS, I was seriously tempted to title this messsage "And now, for something completely different", but impossible circuit is more descriptive.

Before you read further, I need everyone to put on their thinking WAY outside the box hats. I've heard from enough people already that I'm nuts and what I'm seeing can't happen, so it must not be happening...even though we see the results of it happening.

I've got this private line DS3. It connects cisco 7206 routers in Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).

According to the DLR, it's a real circuit, various portions of it ride varying sized OC circuits, and then it's handed off to us at each end the usual way (copper/coax) and plugged into PA-2T3 cards.

Last Tuesday, at about 2:30PM, "something bad happened." We saw a serious jump in traffic to Ocala, and in particular we noticed one customer's connection (a group of load sharing T1s) was just totally full. We quickly assumed it was a DDoS aimed at that customer, but looking at the traffic, we couldn't pinpoint anything that wasn't expected flows.

Then we noticed the really weird stuff. Pings to anything in Ocala responded with multiple dupes and ttl exceeded messages from a Level3 IP. Traceroutes to certain IPs in Ocala would get as far our Ocala router, then inexplicably hop onto Sprintlink's network, come back to us over our Level3 transit connection, get to Ocala, then hop over to Sprintlink again, repeating that loop as many times as max TTL would permit. Pings from router to router crossing just the DS3 would work, but we'd see 10 duplicate packets for every 1 expected packet. BTW, the cisco CLI hides dupes unless you turn on ip icmp debugging.

I've seen some sort of similar things (though contained within an AS) with MPLS and routing misconfigurations, but traffic jumping off our network (to a network to which we're not directly connected) was seemingly impossible. We did all sorts of things to troubleshoot it (studied our router configs in rancid, temporarily shut every interface on the Ocala side other than the DS3, changed IOS versions, changed out the hardware, opened a ticket with cisco TAC) but then it occurred to me, that if traffic was actually jumping off our network and coming back in via Level3, I could see/block at least some of that using an ACL on our interface to Level3. How do you explain it, when you ping the remote end of a DS3 interface with a single echo request packet and see 5 copies of that echo request arrive at one of your transit provider interfaces?

Here's a typical traceroute with the first few hops (from my home internet connection) removed. BTW, hop 9 is a customer router conveniently configured with no ip unreachables.
  7  andc-br-3-f2-0.atlantic.net (209.208.9.138)  47.951 ms  56.096 ms  56.154 ms
  8  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  56.199 ms  56.320 ms  56.196 ms
  9  * * *
10  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  80.774 ms  81.030 ms  81.821 ms
11  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  75.731 ms  75.902 ms  77.128 ms
12  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  46.548 ms  53.200 ms  45.736 ms
13  vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.918 ms vlan79.csw2.Washington1.Level3.net (4.68.17.126)  55.438 ms vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.693 ms
14  ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137)  48.935 ms ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  49.317 ms ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141)  48.865 ms
15  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  59.642 ms  56.278 ms  56.671 ms
16  ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2)  47.401 ms  62.980 ms  62.640 ms
17  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  40.300 ms  40.101 ms  42.690 ms
18  ae-6-6.car1.Orlando1.Level3.net (4.69.133.77)  40.959 ms  40.963 ms  41.016 ms
19  unknown.Level3.net (63.209.98.66)  246.744 ms  240.826 ms  239.758 ms
20  andc-br-3-f2-0.atlantic.net (209.208.9.138)  39.725 ms  37.751 ms  42.262 ms
21  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  43.524 ms  45.844 ms  43.392 ms
22  * * *
23  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  63.752 ms  61.648 ms  60.839 ms
24  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  66.923 ms  65.258 ms  70.609 ms
25  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  67.106 ms  93.415 ms  73.932 ms
26  vlan99.csw4.Washington1.Level3.net (4.68.17.254)  88.919 ms  75.306 ms vlan79.csw2.Washington1.Level3.net (4.68.17.126)  75.048 ms
27  ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  69.508 ms  68.401 ms ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133)  79.128 ms
28  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  64.048 ms  67.764 ms  67.704 ms
29  ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18)  68.372 ms  67.025 ms  68.162 ms
30  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  65.112 ms  65.584 ms  65.525 ms
Our circuit provider's support people have basically just maintained that this behavior isn't possible and so there's nothing they can do about it. i.e. that the problem has to be something other than the circuit.

I got tired of talking to their brick wall, so I contacted Sprint and was able to confirm with them that the traffic in question really was inexplicably appearing on their network...and not terribly close geographically to the Orlando/Ocala areas.

So, I have a circuit that's bleeding duplicate packets onto an unrelated IP network, a circuit provider who's got their head in the sand and keeps telling me "this can't happen, we can't help you", and customers who were getting tired of receiving all their packets in triplicate (or more) saturating their connections and confusing their applications. After a while, I had to give up on finding the problem and focus on just making it stop. After trying a couple of things, the solution I found was to change the encapsulation we use at each end of the DS3. I haven't gotten confirmation of this from Sprint, but I assume they're now seeing massive input errors one the one or more circuits where our packets were/are appearing. The important thing (for me) is that this makes the packets invalid to Sprint's routers and so it keeps them from forwarding the packets to us. Cisco TAC finally got back to us the day after I "fixed" the circuit...but since it was obviously not a problem with our cisco gear, I haven't pursued it with them.

The only things I can think of that might be the cause are misconfiguration in a DACS/mux somewhere along the circuit path or perhaps a mishandled lawful intercept. I don't have enough experience with either or enough access to the systems that provide the circuit to do any more than speculate. Has anyone else ever seen anything like this?

If someone from Level3 transport can wrap their head around this, I'd love to know what's really going on...but at least it's no longer an urgent problem for me.
----------------------------------------------------------------------
  Jon Lewis                   |  I route
  Senior Network Engineer     |  therefore you are
  Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

Re: impossible circuit
From: Paul Wall
Date: Mon Aug 18 16:48:24 2008
List-archive: 
List-help: 
List-id: North American Network Operators Group 
List-post: 
List-subscribe: , 
List-unsubscribe: , 
Jon,

I think we can safely conclude from the information provided that you're looking at some sort of a misconfigured traffic mirroring or [un]lawful intercept.

Sadly, as neither Sprint nor your loop provider will fess up, I don't think you're going to get much further on here.

Probably best to order a new loop and cancel the existing one.

Drive Slow, Paul
- Original message -
I just went ahead and "re-broke" the circuit ...

On 8/17/08, Jon Lewis wrote: > On Tue, 12 Aug 2008, Jon Lewis wrote: > >>> What would happen if you pinged the Ocala router such that the TTL was 1 >>> when travelling over the DS3? From your traceroute it seems it travelled >>> two IP hops that did not send ICMP error messages, but it might just be >>> that the ICMP errors from the Ocala router are arriving first. >> >> Based on where the dupes are coming from, I assume pinging across the DS3 >> with TTL tuned to expire at the Ocala side would result in TTL exceeded >> messages from both Ocala and the Sprint router where the packets are >> injected >> into Sprint's network. It doesn't look as if IOS gives the option to set >> TTL >> on ping...so I'd try this from a Linux machine in our data center. > > I just went ahead and "re-broke" the circuit for a bit by turning it back > to hdlc to see if the issue is still there and to run some additional > tests. Someone is still cross connecting our Orlando->Ocala traffic over > to Sprint. > > I did your suggested ping with short TTL and the result was close to what > I expected. > > $ traceroute ocalflxa-br-1 > traceroute to ocalflxa-br-1.atlantic.net (209.208.6.229), 30 hops max, 38 > byte packets > 1 209.208.25.165 (209.208.25.165) 0.539 ms 0.426 ms 0.388 ms > 2 69.28.72.162 (69.28.72.162) 0.246 ms 0.351 ms 0.223 ms > 3 andc-br-3-f2-0 (209.208.9.138) 0.559 ms 0.435 ms 0.471 ms > 4 ocalflxa-br-1-s1-0 (209.208.112.98) 2.735 ms * 2.656 ms > > So, I need a TTL of 4 to get there from this machine. > > $ ping -t4 ocalflxa-br-1 > PING ocalflxa-br-1.atlantic.net (209.208.6.229) 56(84) bytes of data. > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=0 ttl=252 > time=2.68 ms > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=1 ttl=252 > time=2.72 ms > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=2 ttl=252 > time=2.88 ms > > Decrease ttl by one, and I get the expected ttl exceeded from the Orlando > side of the circuit. > > $ ping -t 3 ocalflxa-br-1 > PING ocalflxa-br-1.atlantic.net (209.208.6.229) 56(84) bytes of data. > >From andc-br-3-f2-0.atlantic.net (209.208.9.138) icmp_seq=0 Time to live > exceeded > > Now, here's a mild surprise. You'll notice that in the above -t4 trace, I > didn't hear back from Sprint. > > $ ping -t 5 ocalflxa-br-1 > PING ocalflxa-br-1.atlantic.net (209.208.6.229) 56(84) bytes of data. > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=0 ttl=252 > time=2.89 ms > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=1 ttl=252 > time=3.10 ms > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=2 ttl=252 > time=2.97 ms > hmm...still no ttl exceeded from Sprint? > > $ ping -t 6 ocalflxa-br-1 > PING ocalflxa-br-1.atlantic.net (209.208.6.229) 56(84) bytes of data. > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=0 ttl=252 > time=2.95 ms > >From sl-crs2-dc-0-5-3-0.sprintlink.net (144.232.19.93) icmp_seq=0 Time to > live exceeded > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=1 ttl=252 > time=2.78 ms > >From sl-crs2-dc-0-5-3-0.sprintlink.net (144.232.19.93) icmp_seq=1 Time to > live exceeded > > $ ping -t 7 ocalflxa-br-1 > PING ocalflxa-br-1.atlantic.net (209.208.6.229) 56(84) bytes of data. > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=0 ttl=252 > time=2.88 ms > >From sl-st20-ash-9-0-0.sprintlink.net (144.232.18.228) icmp_seq=0 Time to > live exceeded > 64 bytes from ocalflxa-br-1.atlantic.net (209.208.6.229): icmp_seq=1 ttl=252 > time=2.84 ms > >From sl-st20-ash-9-0-0.sprintlink.net (144.232.18.228) icmp_seq=1 Time to > live exceeded > > Is it just coincidence that there are 2 private IP hops in some > traceroutes between us and Sprint? i.e. Look at this trace from cogent: > > Tracing the route to 209.208.33.1 > > 1 fa0-8.na01.b005944-0.dca01.atlas.cogentco.com (66.250.56.189) 0 msec 4 > msec 4 msec > 2 gi3-9.3507.core01.dca01.atlas.cogentco.com (66.28.67.225) 160 msec 4 > msec 8 msec > 3 te3-1.ccr02.dca01.atlas.cogentco.com (154.54.3.158) 0 msec 0 msec 4 > msec > 4 vl3493.mpd01.dca02.atlas.cogentco.com (154.54.7.230) 28 msec 4 msec > te4-1.mpd01.dca02.atlas.cogentco.com (154.54.2.182) 52 msec > 5 vl3494.mpd01.iad01.atlas.cogentco.com (154.54.5.42) 4 msec 4 msec > vl3497.mpd01.iad01.atlas.cogentco.com (154.54.5.66) 4 msec > 6 timewarner.iad01.atlas.cogentco.com (154.54.13.250) 4 msec > peer-01-ge-3-1-2-13.asbn.twtelecom.net (66.192.252.217) 4 msec 12 msec > 7 66-194-200-202.static.twtelecom.net (66.194.200.202) 28 msec 28 msec 32 > msec > 8 66-194-200-202.static.twtelecom.net (66.194.200.202) 32 msec 32 msec 28 > msec > 9 andc-br-3-f2-0.atlantic.net (209.208.9.138) 32 msec 32 msec 32 msec > 10 172.22.122.1 32 msec 32 msec 32 msec > 11 10.247.28.205 32 msec 32 msec 32 msec > 12 sl-crs2-dc-0-5-3-0.sprintlink.net (144.232.19.93) 32 msec 32 msec 28 > msec > 13 sl-st20-ash-9-0-0.sprintlink.net (144.232.18.228) 28 msec 32 msec 32 > msec > 14 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 32 msec 32 msec 28 > msec > 15 vlan79.csw2.Washington1.Level3.net (4.68.17.126) 28 msec > vlan89.csw3.Washington1.Level3.net (4.68.17.190) 32 msec > vlan79.csw2.Washington1.Level3.net (4.68.17.126) 40 msec > 16 ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 28 msec > ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 28 msec > ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133) 32 msec > 17 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 48 msec 48 msec 56 msec > 18 ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2) 44 msec 48 msec > ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 52 msec > 19 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 56 msec 104 msec 56 msec > 20 ae-6-6.car1.Orlando1.Level3.net (4.69.133.77) 52 msec 52 msec 56 msec > 21 unknown.Level3.net (63.209.98.66) 52 msec 52 msec 56 msec > 22 andc-br-3-f2-0.atlantic.net (209.208.9.138) 52 msec 52 msec 56 msec > 23 172.22.122.1 52 msec 56 msec 52 msec > 24 10.247.28.205 52 msec 52 msec 56 msec > 25 sl-crs2-dc-0-5-3-0.sprintlink.net (144.232.19.93) 52 msec 56 msec 52 > msec > 26 sl-st20-ash-9-0-0.sprintlink.net (144.232.18.228) 56 msec 56 msec 56 > msec > 27 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 52 msec 52 msec 52 > msec > 28 vlan99.csw4.Washington1.Level3.net (4.68.17.254) 52 msec > vlan69.csw1.Washington1.Level3.net (4.68.17.62) 56 msec > vlan89.csw3.Washington1.Level3.net (4.68.17.190) 56 msec > 29 ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133) 64 msec > ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 52 msec 56 msec > 30 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 76 msec 72 msec 72 msec > > I've seen the 172.22.122.1 & 10.247.28.205 hops before. They occasionally > show up in traces when the traffic is jumping over to Sprint. Sometimes > they don't show up though. i.e. Tracing from my house: > > traceroute to 209.208.33.1 (209.208.33.1), 30 hops max, 40 byte packets > 1 172.31.0.1 (172.31.0.1) 0.336 ms 0.272 ms 0.268 ms > 2 10.210.160.1 (10.210.160.1) 10.109 ms 11.719 ms 14.265 ms > 3 gig7-0-4-101.orldflaabv-rtr1.cfl.rr.com (24.95.232.100) 15.302 ms > 15.324 ms 16.687 ms > 4 198.228.95.24.cfl.res.rr.com (24.95.228.198) 16.688 ms 18.812 ms > 18.816 ms > 5 te-3-3.car1.Orlando1.Level3.net (4.79.116.145) 20.084 ms 19.946 ms > te-3-1.car1.Orlando1.Level3.net (4.79.116.137) 21.328 ms > 6 unknown.Level3.net (63.209.98.66) 19.900 ms 14.714 ms 14.689 ms > 7 andc-br-3-f2-0.atlantic.net (209.208.9.138) 104.058 ms 11.932 ms > 13.584 ms > 8 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 15.872 ms 15.886 ms > 17.238 ms > 9 * * * > 10 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 41.277 ms 41.964 ms > 41.955 ms > 11 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 43.360 ms 44.578 ms > 35.635 ms > 12 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 37.035 ms 37.062 > ms 33.185 ms > 13 vlan89.csw3.Washington1.Level3.net (4.68.17.190) 44.060 ms 44.057 ms > vlan99.csw4.Washington1.Level3.net (4.68.17.254) 39.603 ms > 14 ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 38.123 ms > ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141) 39.546 ms > ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133) 38.115 ms > 15 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 46.284 ms 46.275 ms > 46.274 ms > 16 ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 52.523 ms > ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2) 53.338 ms > ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 53.299 ms > 17 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 34.964 ms 39.582 ms > 38.088 ms > 18 ae-6-6.car1.Orlando1.Level3.net (4.69.133.77) 36.701 ms 38.144 ms > 36.949 ms > 19 unknown.Level3.net (63.209.98.66) 36.902 ms 37.750 ms 37.717 ms > 20 andc-br-3-f2-0.atlantic.net (209.208.9.138) 37.729 ms 35.812 ms > 35.048 ms > 21 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 37.485 ms 37.601 ms > 36.495 ms > 22 * * * > 23 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 56.459 ms 56.449 ms > 57.709 ms > 24 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 57.694 ms 57.692 ms > 60.243 ms > 25 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 103.257 ms > 100.829 ms 82.571 ms > 26 vlan99.csw4.Washington1.Level3.net (4.68.17.254) 70.401 ms > vlan89.csw3.Washington1.Level3.net (4.68.17.190) 69.262 ms > vlan99.csw4.Washington1.Level3.net (4.68.17.254) 82.700 ms > 27 ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 74.132 ms > ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 74.135 ms > ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 75.540 ms > 28 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 58.656 ms 60.838 ms > 54.346 ms > 29 ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 59.323 ms > ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2) 59.336 ms > ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 63.323 ms > 30 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 127.652 ms 57.884 ms > 57.851 ms > > >From the traces I've seen, it seems if the first Sprint hop is sl-bb20-dc, > the private IP hops don't show up. If the first Sprint hop is sl-crs2-dc, > then the private IP hops are there. I wonder if anyone from Sprint can > shed some light on that? > > Unfortunately, the Sprint engineer I intitially made contact with who was > helpful and seemed curious about the issue seems to have vanished and > isn't returning my calls or emails. Anyone else from Sprintlink care to > play? > > ---------------------------------------------------------------------- > Jon Lewis | I route > Senior Network Engineer | therefore you are > Atlantic Net | > _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________ > >

-- Sent from Gmail for mobile | mobile.google.com

[/internet/routing] permanent link

Thu, 28 Mar 2013

BCP 38 - Or why all edge networks need some form of valid prefix filtering

The Internet has been dealing with amplification attacks dependent on source address spoofing at least as far back as the mid to late 1990s. Smurf attacks were the first such attack to which I had any exposure. In a smurf attack, the attacker would send large numbers of ping packets (icmp echo requests) to the broadcast addresses on a large number of networks with the source IP address spoofed to be their DDoS target's IP. All of the hosts on these networks would receive the broadcast echo requests and many or all would respond en-masse, flooding the spoofed target IP with echo replies. It wasn't hard to fill a T1 with one of these attacks. T1's were pretty common transit pipes for small to mid sized ISPs back in the 90s. Over time, most networks disabled directed broadcast, and Smurf attacks became relatively ineffective and went out of style.

Today, the hip method for DDoSing is the DNS amplification attack. In this attack, the attacker sends queries as fast as they can for some DNS record of large size to a large list of "open recursive DNS servers". These are DNS servers willing to answer recursive queries for anyone on the Internet. It's believed there are currently roughly 27 million such DNS servers. The queries are sent with the source address forged to be the DDoS target's IP. Amplification factors of >70x can be achieved with these attacks due to the difference in size between the DNS query and the response. This means for every 1gb/s of available bandwidth to the attacker, they can generate ~70gb/s of attack traffic. Using this type of attack, someone with comparatively little bandwidth can generate an attack large enough to overwhelm nearly any network.

The common thread here is that both Smurf and DNS amplification attacks wouldn't be possible if the attackers couldn't spoof their target's IP address.

BCP 38 was written thirteen years ago (May, 2000), to encourage network operators to institute source address validity filters in their networks. i.e. if a packet enters your network from a customer, you should drop that packet if its source address is not known to be one of your customer's addresses. Router vendors have implemented such filtering as an automated feature, but not on all platforms. Cisco calls it Unicast Reverse Path Forwarding or uRPF. With uRPF enabled on an interface, traffic entering via that interface is dropped if its source address is not covered by a route pointing to that interface. i.e. suppose you have a customer on an interface such as:

interface FastEthernet1/0
 ip address 10.0.0.1 255.255.255.0
 ip verify unicast source reachable-via rx allow-self-ping
...
ip route 192.168.0.0 255.255.255.0 10.0.0.2

With the above interface and route statement, only traffic received on FastEthernet1/0 with source addresses in 10.0.0.0/24 or 192.168.0.0/24 will be forwarded. Traffic arriving via FastEthernet1/0 with any other source address will be dropped.

Unfortunately, not all gear supports this feature. Some gear "supports it", but with such severe limitations as to make it unusuable. This is likely why BCP38 has made so little traction over the years. On gear where uRPF is not supported or not usable, you're left with having to write and maintain an input ACL for every customer interface. On a layer 3 switch with 48 customer ports, that's 48 ACLs, and some multiple of 48 chances to screw up which might cause customer outages, hundreds of additional lines of config, and for no obvious benefit to you or your customers.

For these reasons, many networks have gone without BCP38-style filtering.

This can't be allowed to continue. Attackers have recently demonstrated, using DNS amplification, that attacks in the hundreds of gbit/s are possible. Presumably, they're only limited by the number of either hacked or other high bandwidth hosts they have access to, and their imagination. Spamhaus has been the recent high profile target. Next week, it could be the major stock exchanges or entire countries. It's time for all networks to take responsibility for their traffic and stop spoofed source address packets from making it out to the Internet.

If you haven't implemented BCP38 filtering because your gear doesn't do uRPF and maintaining an ACL for every customer is "too hard", "doesn't scale", or perhaps is more ACL/config than your gear can handle, consider this alternative solution.

On a typical small ISP / service provider / hosting provider network, if you were to ACL every customer, you might need hundreds or even thousands of ACLs. However, if you were to put output filters on your transit connections, allowing traffic sourced from all IP networks "valid" inside your network, you might find that all you need is a single ACL of a handful to several dozen entries. i.e. All your transit output ACL needs to list are all of your IP space, and any IP spaces belonging to your customers who have their own IP space.

Having one ACL to maintain that only needs changing if you get a new IP allocation or add/remove a customer who has their own IPs really isn't all that difficult. As far at the rest of the internet is concerned, this solves the issue of spoofed IP packets leaving your network.

If you want to test your ISP to see if it allows spoofing, the MIT Spoofer Project has binaries for common OS's and source code that can be downloaded and run to test various classes of spoofing from a host.

[/internet/routing] permanent link

Sat, 05 Feb 2011

Black Hole Routing

A number of Tier-1/Tier-2 network service providers support a feature called real time black hole routing triggered via BGP. In simple terms, what this means is that with providers that support this, you can advertise a route to your transit provider(s) that tells them "I'd like you to null route this instead of routing it to me." Why would this be useful? The most likely situation is an IP on your network is being DDoS'd (Distributed Denial of Service attack) hard enough that it's congesting your transit pipe(s) causing increased latency and/or packet loss for all of your internet traffic.

The usual way to do this (or express any other sort of desired upstream routing policy to your transit provider) is via BGP communities. These are set in the output route-map for your eBGP peering with the provider. The following is an example of how you might setup a system to allow for easy creation/removal of real time black hole routes (on cisco gear).

First:

  1. Make sure your provider(s) support this.
  2. Look up what community strings they use for this.
  3. Step three, you'll probably need to contact each provider and make sure they're setup to receive /32 IPv4 routes from you. Assuming they do prefix filtering, they may not automatically be setup to accept such specific routes from you.

Using Level3 as the example, a search for Level3 BGP communities will turn up that 3356:9999 is Level3's customer accepted community for telling Level3 to discard traffic for the tagged route.

Now, you could edit your Level3 output route-map, BGP config (insert a network statement), and then do a static route every time you want to create a real time black hole route, but doing it that way is time consuming and error-prone and you may not want everyone who has enable access mucking around in your eBGP config. Instead, why not set things up so all you have to do is create a special static route anywhere on your network?

This config assumes you have separate routers for transit connections and for internal routing, there are config changes that will need to be done on each.

On the transit router that talks to Level3:

ip community-list standard BLACKHOLE permit <your-ASN>:9999
!
route-map LEVEL3-OUTPUT permit 5
 match community BLACKHOLE
 set community 3356:9999
This assumes that LEVEL3-OUTPUT is already configured as your output route-map for your Level3 peering session.

On your internal routing router(s):

ip access-list extended match32
 permit ip any host 255.255.255.255
!
route-map blackhole permit 10
 match ip address match32
 match tag 9999
 set community <your-ASN>:9999
!
router bgp 
 redistribute static route-map blackhole
 redistribute ospf 1 route-map blackhole

If you have much experience with BGP, you're hopefully saying to yourself "but redistribution, especially of the IGP, into BGP is really dangerous". Well, the blackhole route-map is limiting the redistribution to only those routes tagged with 9999 and which are /32s. This means if someone gets clumsy and creates a route for a shorter network with the tag 9999, the route-map will not match that route, and it won't be redistributed into BGP. So this setup won't let you accidentally real time blackhole an entire CIDR block. The reason for redistributing both OSPF (or you might use ISIS) and static is, this way the route can be created on this router (as a static route) or on another device participating in your IGP.

Once these config changes have been made, all you need to do to real time black hole route an IP is log into the internal router or any router in your IGP and

ip route <IP to black hole> 255.255.255.255 null0 tag 9999
This will null route the IP in your network and tell your transit providers to stop sending traffic for the IP to you.

A really neat side effect of this setup is, you can real time black hole an IP without null routing it internally.

Suppose the IP you want to real time black hole is part of a customer's /28, and that /28 is configured as the IP on the customer's access port. i.e.

interface FastEthernet0/1
 ip address <customer IP network> 255.255.255.240

You can log into that device, and

ip route <IP to black hole> 255.255.255.255 FastEthernet0/1 tag 9999

Now, the IP is still routed to the customer, but because it's tagged with 9999, assuming your customer aggregation routers redistribute static into your IGP (which for me is OSPF), the route will be in your IGP with the tag, your internal router will see this and redistribute it into BGP with the internally used real time black hole community, and your transit router(s) will tag the route with the appropriate community to have your transit provider(s) real time black hole route it. The IP is still reachable inside your ASN, but to the rest of the internet, it's dead as your transit providers are null routing it.

[/internet/routing] permanent link

Sat, 09 Feb 2008

One Million Routes

So, you just upgraded your cisco 6500/7600 gear to Sup720-3BXL's because that's the lowest end supervisor module that has the tcam for full internet routes (>244k routes). You may have read in the data sheet that it's capable of "1,000,000 IPv4 routes; 500,000 IPv6 routes." That should be plenty of room for growth, right? Well, maybe not as much as you think.

Somewhere burried in the fine print (ok, I can't actually find it even in fine print or an * or anywhere on the data sheet), is the fact that it's an either or thing. i.e. The 3BXL can do 1,000,000 IPv4 routes (and no IPv6 at all), or it can do 500,000 IPv6 routes (and no IPv4 at all). In a real world installation, neither of those configs are terribly useful. The default settings allow for 524,288 IPv4 routes and 262,144 IPv6 routes...meaning in its default config, with full internet routes a Sup720-3BXL is already at nearly half it's capacity of IPv4 routes. You can examine this (using recent IOS versions), with the command:
show platform hardware capacity

Look for the output section labeled "L3 Forwarding Resources". i.e.
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used	%Used
                  72 bits (IPv4, MPLS, EoM)     524288      230589	  44%
                 144 bits (IP mcast, IPv6)      262144           5	   1%
This can be tuned with the config command mls cef maximum-routes ip <N> where N is a number in thousands of IPv4 routes you want to be able to handle. i.e. With "mls cef maximum-routes ip 750", the above output changes to:
L3 Forwarding Resources
             FIB TCAM usage:                     Total        Used       %Used
                  72 bits (IPv4, MPLS, EoM)     770048      230459         30%
                 144 bits (IP mcast, IPv6)      139264           5          1%

Such a split may make more sense, as it leaves more room for anticipated IPv4 routing table growth, and in a perfect world, we really shouldn't see much more than a single IPv6 prefix per ASN.

Note: The numbers above from a set of Sup720-3BXL's in a lab environment have slightly filtered BGP feeds. "Full routes" would be closer to 240,000 routes.

[/internet/routing] permanent link

Sat, 19 Jan 2008

RIR Minimums BGP prefix-list

I originally posted this BGP filter to a couple of mailing lists, most notably the NANOG list, back in September 2007.

http://www.merit.edu/mail.archives/nanog/2007-09/msg00103.html

The reason I put this filter together is lots of big cisco routers, in particular the 6500/7600 series with anything less than the Sup720-3bxl, were on the verge of running out of space (TCAM in the 6500/7600 case) to hold routes due to continued growth of the global BGP routing table. A large part of this global routing table "growth" is actually gratuitous deaggregation by networks that either don't care or don't even realize what they're doing. Most networks can live without these "garbage routes", and since I maintain a couple of 6500/Sup2 routers, I started working on contingency plans in case we were unable to upgrade to Sup720-3bxls before the global routing table + our internal routes hit the magic number of routes (244k) at which point the Sup2 starts doing "bad things".

It should be noted that because some of the really clue deficient networks announce only the deaggregates of their CIDRs, using this filter may cause you to entirely lose routing information to such networks. Therefore, unless you're able to get away with that level of BOFHness ("fix your BGP if you want to talk to us"), I strongly suggest you add (if you don't already have) one or more default routes to your various transit providers.

This BGP route filter is based largely on Barry Greene's work available from

ftp://ftp-eng.cisco.com/cons/isp/security/Ingress-Prefix-Filter-Templates/T-ip-prefix-filter-ingress-strict-check-v18.txt

While working on my version of ISP-Ingress-In-Strict, I noticed a bunch of inconsistencies in the expected RIR minimum allocations in Barry's ISP-Ingress-In-Strict and in the data actually published by the various RIRs.

I've adjusted the appropriate entries, flipped things around so that for each of the known RIR /8 or shorter prefixes, prefixes longer than RIR specified minimums (or /24 in cases where the RIR specifies longer than /24!) are denied.

At the end of the prefix-list, any prefix /24 or shorter is allowed. The advantage to this setup is known ranges are filtered on known RIR minimums. Anything omitted ends up being permitted as long as it's /24 or shorter.

If you currently use a distribute-list to filter incoming routes, you'll have to rewrite those rules in prefix-list format and merge them into the beginning of this prefix-list, as IOS (at least the versions I'm using) doesn't allow both an input prefix-list and input distribute-list on the same BGP peer.

What follows is the latest version of what I originally posted to the NANOG list in September 2007.

-- jlewis lewis.org 20080118

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! APNIC http://www.apnic.net/db/min-alloc.html !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ip prefix-list ISP-Ingress-In-Strict SEQ 4000 deny 58.0.0.0/8 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 4001 deny 59.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4002 deny 60.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4004 deny 116.0.0.0/6 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 4008 deny 120.0.0.0/6 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 4011 deny 124.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4013 deny 126.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4014 deny 202.0.0.0/7 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 4016 deny 210.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4018 permit 218.100.0.0/16 ge 17 le 24 ip prefix-list ISP-Ingress-In-Strict SEQ 4019 deny 218.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 4021 deny 220.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict seq 4023 deny 222.0.0.0/8 ge 21 ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! http://www.arin.net/reference/ip_blocks.html#ipv4 !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ip prefix-list ISP-Ingress-In-Strict SEQ 5000 deny 24.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5001 deny 63.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5002 deny 64.0.0.0/5 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5010 deny 72.0.0.0/6 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5014 deny 76.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5015 deny 96.0.0.0/6 ge 21 ! these ge 25's are redundant, but left in for accounting purposes ip prefix-list ISP-Ingress-In-Strict SEQ 5020 deny 198.0.0.0/7 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 5022 deny 204.0.0.0/7 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 5023 deny 206.0.0.0/7 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 5032 deny 208.0.0.0/8 ge 23 ip prefix-list ISP-Ingress-In-Strict SEQ 5033 deny 209.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 5034 deny 216.0.0.0/8 ge 21 ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! RIPE NCC https://www.ripe.net/ripe/docs/ripe-ncc-managed-address-space.html !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ip prefix-list ISP-Ingress-In-Strict SEQ 6000 deny 62.0.0.0/8 ge 20 ip prefix-list ISP-Ingress-In-Strict SEQ 6001 deny 77.0.0.0/8 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6002 deny 78.0.0.0/7 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6004 deny 80.0.0.0/7 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 6006 deny 82.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 6007 deny 83.0.0.0/8 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6008 deny 84.0.0.0/6 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6012 deny 88.0.0.0/7 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6014 deny 90.0.0.0/8 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6015 deny 91.0.0.0/8 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 6016 deny 92.0.0.0/6 ge 22 ip prefix-list ISP-Ingress-In-Strict SEQ 6020 deny 193.0.0.0/8 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 6021 deny 194.0.0.0/7 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 6023 deny 212.0.0.0/7 ge 20 ip prefix-list ISP-Ingress-In-Strict SEQ 6025 deny 217.0.0.0/8 ge 21 ! ! ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! LANIC - http://lacnic.net/en/registro/index.html !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ip prefix-list ISP-Ingress-In-Strict SEQ 7000 deny 189.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 7001 deny 190.0.0.0/8 ge 21 ip prefix-list ISP-Ingress-In-Strict SEQ 7002 deny 200.0.0.0/8 ge 25 ip prefix-list ISP-Ingress-In-Strict SEQ 7003 deny 201.0.0.0/8 ge 21 ! ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! AFRINIC http://www.afrinic.net/index.htm !! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ip prefix-list ISP-Ingress-In-Strict SEQ 8000 deny 41.0.0.0/8 ge 23 ip prefix-list ISP-Ingress-In-Strict SEQ 8001 deny 196.0.0.0/8 ge 23 ! ! Final "permit any any" statement. ! This is allowing all the orginal pre-RIR/RFC2050 allocations through. ! Addtional filtering can be added if so desired. ! !ip prefix-list ISP-Ingress-In-Strict seq 10100 deny 0.0.0.0/0 le 7 ip prefix-list ISP-Ingress-In-Strict seq 10200 permit 0.0.0.0/0 le 24

[/internet/routing] permanent link