achieving high performance throughput in production networks.ppt
Achieving high performance throughput in production networks Les Cottrell SLACPresented at the Internet 2 HENP Networking Working Group kickoff meeting at Internet 2 Ann Arbor,Michigan,Oct 26 01Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring(IEPM),also supported by IUPAP1High Speed Bulk ThroughputDriven by:Data intensive science,e.g.data gridsHENP data rates,e.g.BaBar 300TB/year,collection doubling yearly,i.e.PBytes in couple of years Data rate from experiment today 20MBytes/s 200GBytes/dMultiple regional computer centers(e.g.Lyon-FR,RAL-UK,INFN-IT,LBNL-CA,LLNL-CA,Caltech-CA)need copies of dataTier A gets 1/3 data in 1/3 year(full rate),SLAC does not keep copyBoeing 747 high throughput,BUT poor latency(2 weeks)&very people intensiveSo need high-speed networks and ability to utilizeHigh speed today=few hundred GBytes/day(100GB/d 10Mbits/s)Data vol Moores law2How to measure network throughputSelected about 2 dozen major collaborator sites in US,CA,JP,FR,CH,IT,UK over last yearOf interest to SLACCan get logon accountsUse iperf Choose window size and#parallel streamsRun for 10 seconds together with ping(loaded)Stop iperf,run ping(unloaded)for 10 secondsChange window or number of streams&repeatRecord#streams,window,throughput (Mbits/s),loaded&unloaded ping responses,cpu utilization,real timeVerify window sizes are set properly by using tcpdump cant believe what application tells you LMNote cpu speeds,interface speeds,operating system,path characteristics3Solaris Default window sizeTypical resultsTodayHi-thru usually=big windows&multiple streamsImproves linearlywith streams forsmall windows8kB16kB32kB100kB64kBBroke 100MbpsTrans AtlanticBarrier4Windows vs StreamsOften for fixed streams*window product,streams are more effective than window size,e.g.SLACCERN,Jul 01:There is an optimum number of streams above which performance flattens outCommon for throughputs to be asymmetricmore congestion one way,different routes,host dependenciesWindowStreamsMbpsW*S64KB25681600KB128KB12561536KB256KB6661536KB512KB3621536KB1024KB2262048KB2048KB1172048KB4096KB1254096KB5Windows vs StreamsMulti-streams often more effective than windowsmore agile in face of congestionOften easier to set up Need root to configure kernel to set max windowNetwork components may not support big windowsSome OS treat max windows strangely DMay be able to take advantage of multiple pathsBut:may be considered over-aggressive(RFC 2914)pcan take more cpu cycleshow to know how many streams?6Iperf client CPU utilizationAs expected increases with throughput(mainly kernel)d 0.7*MHz/Mbits/sFor fixed throughputFewer streams take less cpu J 6E.g.1-4 streams take 20%less cpu than 8-16 streams for same throughput(if can get it)7Throughput quality improvementsTCPBW IN2P3 and 340Mbps Caltech SLAC with GE9Network Simulator(ns-2)From UCB,simulates networkChoice of stack(Reno,Tahoe,Vegas,SACK)RTT,bandwidth,flows,windows,queue lengths Compare with measured resultsAgrees wellConfirms observations(e.g.linear growth in throughput for small window sizes as increase number of flows)10Agreement of ns2 with observed11Ns-2 thruput&loss predictIndicates on unloaded link can get 70%of available bandwidth without causing noticeable packet lossCan get over 80-90%of available bandwidthCan overdrive:no extra throughput BUT extra loss90%12Simulator benefitsNo traffic on network(nb throughput can use 90%)Can do what if experimentsNo need to install iperf servers or have accountsNo need to configure host to allow large windowsBUTNeed to estimate simulator parameters,e.g.RTT use ping or synackBandwidth,use pchar,pipechar etc.,moderately accurateAND its not the real thingNeed to validate vs.observed dataNeed to simulate cross-traffic etc13Impact on OthersMake ping measurements with&without iperf loadingLoss loaded(unloaded)RTTLooking at how to avoid impact:e.g.QBSS/LBE,application pacing,control loop on stdev(RTT)reducing streams,want to avoid scheduling14File TransferUsed bbcp(written by Andy Hanushevsky)similar methodology to iperf,except ran for file length rather than time,provides incremental throughput reports,supports/dev/zero,adding durationlooked at/afs/,/tmp/,/dev/nullchecked different file sizesBehavior with windows&streams similar to iperfThrubbcp 0.8*ThruiperfFor modest throughputs(than for iperf15Application rate-limitingBbcp has transfer rate limitingCould use network information(e.g.from Web100 or independent pinging)to bbcp to reduce/increase its transfer rate,or change number of parallel streamsNo rate limiting,64KB window,32 streams15MB/s rate limiting,64KB window,32 streams16Using bbcp to make QBSS measurementsRun bbcp src data/dev/zero,dst=/dev/null,report throughput at 1 second intervalswith TOS=32(QBSS)After 20 s.run bbcp with no TOS bits specified(BE)After 20 s.run bbcp with TOS=40(priority)After 20 more secs turn off PriorityAfter 20 more secs turn off BE17QBSS test bed with Cisco 7200sSet up QBSS testbedConfigure router interfaces3 traffic types:QBSS,BE,PriorityDefine policy,e.g.QBSS 1%,priority 30%Apply policy to router interface queues10Mbps100Mbps100Mbps100Mbps1GbpsCisco 7200s18Example of effectsAlso tried:1 stream for all,and priority at 30%19QBSS with Cisco 65006500s+Policy Feature Card(PFC)Routing by PFC2,policing on switch interfaces2 queues,2 thresholds eachQBSS assigned to own queue with 5%bandwidth guarantees QBSS gets somethingBE&Priority traffic in 2nd queue with 95%bandwidthApply ACL to switch port to police Priority traffic to factor of 10 larger RTT than unloadedIf both ping&iperf have QoS=Priority then ping RTT very variable since iperf limited to 30%RTT quick when iperf limited,long when iperf transmits 21Possible HEP usageApply priority to lower volume interactive voice/video-conferencing and real time controlApply QBSS to high volume data replicationLeave the rest as Best EffortSince 40-65%of bytes to/from SLAC come from a single application,we have modified to enable setting of TOS bitsNeed to identify bottlenecks and implement QBSS thereBottlenecks tend to be at edges so hope to try with a few HEP sites22Acknowledgements for SC2001Many people assisted in getting accounts,setting up servers,providing advice,software etc.Suresh Man Singh,Harvey Newman,Julian Bunn(Caltech),Andy Hanushevsky,Paola Grosso,Gary Buhrmaster,Connie Logg(SLAC),Olivier Martin(CERN),Loric Totay,Jerome Bernier(IN2P3),Dantong Yu(BNL),Robin Tasker,Paul Kummer(DL),John Gordon(RL),Brian Tierney,Bob Jacobsen,(LBL),Stanislav Shalunov(Internet 2),Joe Izen(UT Dallas),Linda Winkler,Bill Allcock(ANL),Ruth Pordes,Frank Nagy(FNAL),Emanuele Leonardi(INFN),Chip Watson(JLab),Yukio Karita(KEK),Tom Dunigan(ORNL),John Gordon(RL),Andrew Daviel(TRIUMF),Paul Avery,Greg Goddard(UFL),Paul Barford,Miron Livny(UWisc),Shane Canon(NERSC),Andy Germain(NASA),Andrew Daviel(TRIUMF),Richard baraniuk,Rold Reidi(Rice).23SC2001 demoSend data from SLAC/FNAL booth computers(emulate a tier 0 or 1 HENP site)to over 20 other sites with good connections in about 6 countriesThroughputs from SLAC range from 3Mbps to 300MbpsPart of bandwidth challenge proposalSaturate 2Gbps connection to floor networkApply QBSS to some sites,priority to a few and rest Best EffortSee how QBSS works at high speedsCompeting bulk throughput streamsInteractive low throughput streams,look at RTT with ping24WAN thruput conclusionsHigh FTP performance across WAN links is possibleEven with 20-30Mbps bottleneck can do 100Gbytes/dayCan easily saturate a fast Ethernet interface over WANNeed GE NICs,OC3 WANs&to improve performancePerformance is improvingOS must support big windows selectable by applicationNeed multiple parallel streams in some casesLoss is important in particular interval between lossesCan get close to max thruput with small (=32Mbyte)with sufficient(5-10)streamsImprovements of 5 to 60 in thruput by using multiple streams&larger windowsImpacts others users,QBSS looks hopeful25More InformationIEPM/PingER home site:www-iepm.slac.stanford.edu/Bulk throughput site:Transfer tools:slac.stanford.edu/abh/bbcp/:/hep rl.ac.uk/Adye/talks/010402-ftp/html/sld015.htmTCP Tuning:pptQBSS measurements26