Posted 2017-11-01 23:00:00 GMT
I attended the LISA17 conference in San Francisco. Here are some very rough and opinionated notes from talks that interested me. If I made any mistakes let me know and please comment on the things I missed!
This tutorial was exceptional because the speaker has years of experience as one of the top Linux developers. Ted uses emacs!
Goals of performance tuning
Investigate bottlenecks with rigorous scientific method: change one parameter at a time. Take lots of notes and make haste slowly. Measurement tools can impact behaviour of the observed application.
Basic tools to start with
Example filesystem benchmark: fs_mark -s 10240 -n 1000 -d /mnt - creates 1000 files each 10kB in /mnt and does fsync after each. Can be greatly improved by relaxing sync guarantees. Use the exact benchmark for your application!
Cache flush commands are ridiculously expensive
. Google runs in
production with journal disabled, as it is so much faster and there is
a higher level consistency guaranteed by multi-machine
replication. This cross-machine cluster file-system also means RAID is
unnecessary.
Ted Ts'o made this snap script for low impact performance monitoring extraction from production systems.
In terms of selecting hardware: note seek time is complicated and
should typically be reported statistically - worst case and
average. Low number offsets LBA at the outer diameter of the disk can
be much faster to seek. Therefore you can put your most used
partitions at the first offsets of the disk to get a lot more
performance - this is called short-stroking
and can be a cheap
way to get more performance. Filesystem software moves slowly as it
has to be reliable and hardware generally moves much faster.
HDDs at 15000rpm run at hot temperatures and use a lot of power; many
applications that used those have moved to SSDs. SSDs can also use a
lot of power. They tend to fail on write. Random writes can be very
slow - 0.5s average, 2s worst case for 4k random writes. You can wear
out an SSD by writing a lot. See Disks for Data
Centers in terms of advice about selecting hardware (Ted wrote
this). Think about how to use iops across the fleet (hot and cold
storage). The interface SATA 1.5Gbps or 3Gbps or PCIe may not be
important given that e.g. random writes are slow. RAID does not
make sense
generally in today's world (at Google scale) and can
suffer from read/modify/write cycles.
We can think about application specific file-systems, now we have containers. For example, ReiserFS is good for small files, XFS good for big RAID arrays and large files. Ext4 is not optimized for RAID.
Consider increasing journal size for small writes. Note Google disables the journal altogether.
Recommends Brendan Gregg's perf tools using ftrace. These were introduced at LISA 14
Consider the multiqueue scheduler for very fast storage devices like NVMe.
Consider whether you want latency or throughput. Optimize the bandwidth delay product. Then remember that increasing window size takes memory; this can be tuned with net.core.rmem_max and net.core.wmem_max. Use nttcp to reduce buffer sizes as much as possible to avoid bufferbloat.
UDP might be a better bet.
However, we can push TCP to be low latency. Disable Nagle with setsockopt TCP_NODELAY. Enable TCP_QUICKACK to disable delayed acknowledgments.
Recommends considering an NFS appliance, e.g. NetApp.
Some tricks: use no_subtree_check. Bump up nfs threads to 128. Try to separate network and disk IO to different PCI buses - no longer necessarily relevant with PCIe. Make sure to use NFSv3 and increase rsize/wsize. Mount options: rw,intr. Make sure to tune for throughput, large mtu and jumbo frames.
NFSv4: only use NFSv4.1 on Linux 4+, see ;login magazine, June 2015.
If there is any swapping, first, try adding more memory. Add more and faster swap devices.
Paging will happen. majflts/s is rate of faults that result in IO. pgsteal/s is rate of recycling of page cache pages.
Try sar -W 3 and periodically send sysreq-m.
Note the memory hierarchy is important as closer caches are much faster.
Translation Lookaside Buffer (TLB) caches translation from virtual address to physical address. Can avoid up to six layers of lookup on 64-bit system - costing thousands of cycles. There are only 32 or 64 entries in the TLB in a modern system.
The jiffies rate can greatly affect TLB thrashing by controlling rate of task switches. Hugepages avoid consuming these entries. Kernel modules burn TLB entries while the originally loaded kernel does not.
The perf tool can show TLB and cache statistics.
Experimentation with eBPF.
For JVM consider GC and JIT. Size the heap.
Tools: strace, ltrace, valgrind, gprof, oprofile, perf (like truss, ktruss). Purify might not be as good as valgrind.
perf is the new hotness
. Minimal overhead, should be safe in
production from a performance perspective. However, the advanced
introspection capabilities may be undesirable for security.
There are many interesting extensions - like the ext4slower program which shows all operations on ext4 that take longer than 10ms.
Userspace locking: make sure to use futex(2).
Consider CPU affinity.
Latency numbers that all programmers should know. Note this does not include random write for an SSD because that depends on a great many factors.
It's more addictive than pistachios!
It's time to shoot the engineer and put the darn thing into
production.
Great way of learning about the whole stack!
This talk presented a valuable philosophy and attitude: that we should consider making repeatable re-usable reports. This goes against the grain of expectations around reporting which often frame reports as one-off tasks. The examples were very compelling.
Some background: R was written at Bell Labs by statisticians who were
very familiar with UN*X. Data is dirty. The computations and
software for data analysis should be trustworthy: they should do what
they claim, and be seen to do so.
I've spent my entire career getting rid of spreadsheets.
Very rapid growth in CRAN R packages. Pipe operator %>%.
Used dplyr. Small repeatable pipelines for reports that can be reused. Very pretty code examples using dplyr and ggplot and the aforementioned pipe operators.
Your customers shouldn’t find problems before you do.
Onboarding a big new account with an expected 20k incidents per day, around 7M per year.
They wanted to test the load. The only thing that behaves like
prod, is prod.
Chaos Engineering is about experiments in realistic conditions. PagerDuty has Failure Friday - where they expose systems to controlled experiments.
Balance business and engineering.
Decided to create a tool to simulate load.
Noticeable customer impact from first and second test but they still persisted which was quite brave. The talk was very honest about the interpersonal and organisational issues that the project faced.
Tried to explain why the staging environment is different from production to an idealistic questioner.
Eben Freeman's talk on queuing is really good!
Recommends a talk by Rasmussen on failure boundaries.
Failure boundary is nonlinear.
Hard to apply queuing theory to the real world of capacity and ops, as difficult to figure out how much time is spent queuing in real systems.
Add a crosstalk (coherence) penalty with a coefficient k as a quadratic term to the denominator in Amdahl's law. The penalty represents the cost of communication.
Defines load as concurrency.
Suggests that load-tests should try to fit the crosstalk penalty and Amdahl's law parameters. Claims that this fits quite well to many real world scaling problems with some abstract examples.
Without effort, documentation will be scattered across multiple systems and notes that the costs are paid in ramping up new people. We should invest in documentation.
SRE is not a rebranded ops, should not try to build an NOC.
Beyond performance, we can trace things like system calls to find out where something is happening - for example, the stacktrace of the code that is printing a message.
The JVM can cooperate by adding more tracepoints -XX:+ExtendedDTraceProbes.
The advantage of BPF as opposed to perf, is that BPF can filter and aggregate events in the kernel, which can make things much faster than perf, which just transmits events. BPF can calculate histograms and so on.
Needs recent kernels - 4.9 kernel for the perf_events attaching.
Many performance investigations can occur now without modifying any programs. For example, there are instrumentation scripts like dbslower that can print out which queries in MySQL and can be extended to other databases.
We can trace and find out the exact stacktrace where a query is printed.
Can trace GC: ustat tool and object creation with uobjnew.
Use opensnoop to find failed open syscalls. Then attach a trace for that specific error to a Java application.
Want to make it possible for people to bring their own software stack to run on the supercomputers at Los Alamos, and decided to explore containers. Unlike virtual machines, they do not have performance impact on storage or networking (Infiniband).
Recommends this LWN article: Namespaces in operation.
Docker with OverylayFS can be slow on HPC. Therefore they built a system called Charliecloud with minimal performance impact, and native file-system performance.
Sometimes don't have source code or access to logs.
Many basic problems: 5% of veterans incorrectly denied healthcare benefits from a single simple bug.
Great value delivered by bug bounty programs.
API first!
Meaningful contribution - hugely impactful, bipartisan problems. Looking for software engineers and site reliability engineers for short tours of duty.
CEO at Shopify once turned off a server as a manual chaos monkey.
Continued to work on resiliency, with gamedays. Then thought about automating the game days to ensure that issues remain fixed and don't regress.
Want to maintain authenticity.
ToxiProxy interposes latency, blackholing, and rejecting connections in the production systems and then is supported by automated testing in Ruby that asserts behaviour about the system.
Incident post-mortem fixes are checked and verified by injecting the faults again and checking application specific consequences. This confirms that fixes worked, and continue to work in the future.
Resiliency Matrix declares expected dependency between runtime systems. ToxiProxy tests allow one to validate that the dependency matrix truely reflects the production reality.
Common system performance tools like perf do not work well in containers, as the pid and filesystem namespaces are different. System wide statistics (e.g. for free memory) are published to containers which causes programs to make wrong assumptions: for example, Java does not understand how much memory is actually available in the container.
The situation is improving and there is ongoing integration of support for monitoring performance of containerized applications.
Understanding which limit is hit in the case of CPU containerization can be very confusing as there are many different limits.
PS. Brendan's talk from last year at LISA16 gives a general introduction to the advanced bpf tools: LISA16: Linux 4.X Tracing Tools: Using BPF Superpowers
Labs showing how to collect and query many system level properties like running processes from a distributed set of systems with a tool called osquery.
It can collect current state and also buffer logs of changes.
20k on-prem VMs.
Human communications take a lot of time and we need to be careful that we're really communicating.
Establish some strong properties: that all flows are authenticated and encrypted.
No trust in the network. Automation based policy based on a Ruby DSL and Chef that reconfigures iptables rules to add IPSec routes between application tiers.
Related to Google's BeyondCorp.
Beyond the control aspects, another value of the approach is observability. Mentioned that another way of doing this is Lyft Envoy.
Mostly build your own still.
Very thoughtful talk with a comprehensive coverage of various techniques.
EventBrite has 150M ticket sales per year. Very spiky traffic. Over one minute can quadruple.
Bulkheads: partition systems to prevent cascading failures.
Canary testing: gradual rollout of new applications.
Graceful degradation.
Rate limiting. Understand capacity and control amount of work you accept.
Timeouts. Even have to timeout payment processors.
Caching.
Capacity planning.
High energy and well presented talk.
Netflix: developers have root in production.
Should not be cargo-culted to places without same culture of trust and top quality talent.
Be careful about punching down.
Recognise the weight that your
words carry coming from a successful company with specific
constraints.
Culture of security conferences is toxic.
Adapt sample rate as you're watching your traffic, to scale observability infrastructure logarithmically with production. Sample rate should be recorded in event, and reduced in proportion to traffic.
Honeycomb does this with honeytail. Another alternative is Uber's opentracing: Jaeger which uses a consistent sampler.
Distributed tracing is the new debugger
.
Use Twitter's anomaly detection R library.
Need to better at following checklists, sterile cockpit rule (kicking out unqualified people). Avoid normalization of deviance. Lots to learn from airline industry!
Think about telemetry.
Post a comment