John Fremlin's blog: LISA17 conference notes

Posted 2017-11-01 23:00:00 GMT

I attended the LISA17 conference in San Francisco. Here are some very rough and opinionated notes from talks that interested me. If I made any mistakes let me know and please comment on the things I missed!

Ted Ts'o: Linux performance tuning

This tutorial was exceptional because the speaker has years of experience as one of the top Linux developers. Ted uses emacs!

Goals of performance tuning

Investigate bottlenecks with rigorous scientific method: change one parameter at a time. Take lots of notes and make haste slowly. Measurement tools can impact behaviour of the observed application.

Basic tools to start with

Example filesystem benchmark: fs_mark -s 10240 -n 1000 -d /mnt - creates 1000 files each 10kB in /mnt and does fsync after each. Can be greatly improved by relaxing sync guarantees. Use the exact benchmark for your application!

Cache flush commands are ridiculously expensive. Google runs in production with journal disabled, as it is so much faster and there is a higher level consistency guaranteed by multi-machine replication. This cross-machine cluster file-system also means RAID is unnecessary.

Ted Ts'o made this snap script for low impact performance monitoring extraction from production systems.


In terms of selecting hardware: note seek time is complicated and should typically be reported statistically - worst case and average. Low number offsets LBA at the outer diameter of the disk can be much faster to seek. Therefore you can put your most used partitions at the first offsets of the disk to get a lot more performance - this is called short-stroking and can be a cheap way to get more performance. Filesystem software moves slowly as it has to be reliable and hardware generally moves much faster.

HDDs at 15000rpm run at hot temperatures and use a lot of power; many applications that used those have moved to SSDs. SSDs can also use a lot of power. They tend to fail on write. Random writes can be very slow - 0.5s average, 2s worst case for 4k random writes. You can wear out an SSD by writing a lot. See Disks for Data Centers in terms of advice about selecting hardware (Ted wrote this). Think about how to use iops across the fleet (hot and cold storage). The interface SATA 1.5Gbps or 3Gbps or PCIe may not be important given that e.g. random writes are slow. RAID does not make sense generally in today's world (at Google scale) and can suffer from read/modify/write cycles.

We can think about application specific file-systems, now we have containers. For example, ReiserFS is good for small files, XFS good for big RAID arrays and large files. Ext4 is not optimized for RAID.

Consider increasing journal size for small writes. Note Google disables the journal altogether.

Recommends Brendan Gregg's perf tools using ftrace. These were introduced at LISA 14

Also more advanced versions with lower overhead due to computing aggregates in kernel using eBPF, the BPF Compiler Collection (BCC): biosnoop, bitesize, biolatency, opensnoop.

Consider the multiqueue scheduler for very fast storage devices like NVMe.

Network tuning

Immediately check for trivial basic health: ethtool, ifconfig, ping, nttcp. Check for various off-load functions and that the advanced capabilities of the card are used.

Consider whether you want latency or throughput. Optimize the bandwidth delay product. Then remember that increasing window size takes memory; this can be tuned with net.core.rmem_max and net.core.wmem_max. Use nttcp to reduce buffer sizes as much as possible to avoid bufferbloat.

UDP might be a better bet.

However, we can push TCP to be low latency. Disable Nagle with setsockopt TCP_NODELAY. Enable TCP_QUICKACK to disable delayed acknowledgments.

NFS performance tuning

Recommends considering an NFS appliance, e.g. NetApp.

Some tricks: use no_subtree_check. Bump up nfs threads to 128. Try to separate network and disk IO to different PCI buses - no longer necessarily relevant with PCIe. Make sure to use NFSv3 and increase rsize/wsize. Mount options: rw,intr. Make sure to tune for throughput, large mtu and jumbo frames.

NFSv4: only use NFSv4.1 on Linux 4+, see ;login magazine, June 2015.

Memory tuning

If there is any swapping, first, try adding more memory. Add more and faster swap devices.

Paging will happen. majflts/s is rate of faults that result in IO. pgsteal/s is rate of recycling of page cache pages.

Try sar -W 3 and periodically send sysreq-m.

Note the memory hierarchy is important as closer caches are much faster.

Translation Lookaside Buffer (TLB) caches translation from virtual address to physical address. Can avoid up to six layers of lookup on 64-bit system - costing thousands of cycles. There are only 32 or 64 entries in the TLB in a modern system.

The jiffies rate can greatly affect TLB thrashing by controlling rate of task switches. Hugepages avoid consuming these entries. Kernel modules burn TLB entries while the originally loaded kernel does not.

The perf tool can show TLB and cache statistics.

Application tuning

Experimentation with eBPF.

For JVM consider GC and JIT. Size the heap.

Tools: strace, ltrace, valgrind, gprof, oprofile, perf (like truss, ktruss). Purify might not be as good as valgrind.

perf is the new hotness. Minimal overhead, should be safe in production from a performance perspective. However, the advanced introspection capabilities may be undesirable for security.

There are many interesting extensions - like the ext4slower program which shows all operations on ext4 that take longer than 10ms.

Userspace locking: make sure to use futex(2).

Consider CPU affinity.

Latency numbers that all programmers should know. Note this does not include random write for an SSD because that depends on a great many factors.


It's more addictive than pistachios!

It's time to shoot the engineer and put the darn thing into production.

Great way of learning about the whole stack!

Robert Ballance: Automating System Data Analysis Using R

This talk presented a valuable philosophy and attitude: that we should consider making repeatable re-usable reports. This goes against the grain of expectations around reporting which often frame reports as one-off tasks. The examples were very compelling.

Some background: R was written at Bell Labs by statisticians who were very familiar with UN*X. Data is dirty. The computations and software for data analysis should be trustworthy: they should do what they claim, and be seen to do so.

I've spent my entire career getting rid of spreadsheets.

Very rapid growth in CRAN R packages. Pipe operator %>%.

Used dplyr. Small repeatable pipelines for reports that can be reused. Very pretty code examples using dplyr and ggplot and the aforementioned pipe operators.

Renee Lung: Testing Before You Scale & Making Friends While You Do It

Your customers shouldn’t find problems before you do.

Onboarding a big new account with an expected 20k incidents per day, around 7M per year.

They wanted to test the load. The only thing that behaves like prod, is prod.

Chaos Engineering is about experiments in realistic conditions. PagerDuty has Failure Friday - where they expose systems to controlled experiments.

Balance business and engineering.

Decided to create a tool to simulate load.

Noticeable customer impact from first and second test but they still persisted which was quite brave. The talk was very honest about the interpersonal and organisational issues that the project faced.

Tried to explain why the staging environment is different from production to an idealistic questioner.

Baron Schwartz: Scalability Is Quantifiable: The Universal Scalability Law

Eben Freeman's talk on queuing is really good!

Recommends a talk by Rasmussen on failure boundaries.

Failure boundary is nonlinear.

Hard to apply queuing theory to the real world of capacity and ops, as difficult to figure out how much time is spent queuing in real systems.

Add a crosstalk (coherence) penalty with a coefficient k as a quadratic term to the denominator in Amdahl's law. The penalty represents the cost of communication.

Defines load as concurrency.

Suggests that load-tests should try to fit the crosstalk penalty and Amdahl's law parameters. Claims that this fits quite well to many real world scaling problems with some abstract examples.

Chastity Blackwell: The 7 Deadly Sins of Documentation

Without effort, documentation will be scattered across multiple systems and notes that the costs are paid in ramping up new people. We should invest in documentation.

Blake Bisset; Jonah Horowitz: Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful SRE Program Like Netflix and Google

SRE is not a rebranded ops, should not try to build an NOC.

Sasha Goldshtein: Fast and Safe Production Monitoring of JVM Applications with BPF Magic

Beyond performance, we can trace things like system calls to find out where something is happening - for example, the stacktrace of the code that is printing a message.

The JVM can cooperate by adding more tracepoints -XX:+ExtendedDTraceProbes.

The advantage of BPF as opposed to perf, is that BPF can filter and aggregate events in the kernel, which can make things much faster than perf, which just transmits events. BPF can calculate histograms and so on.

Needs recent kernels - 4.9 kernel for the perf_events attaching.

DB performance tracing

Many performance investigations can occur now without modifying any programs. For example, there are instrumentation scripts like dbslower that can print out which queries in MySQL and can be extended to other databases.

We can trace and find out the exact stacktrace where a query is printed.

GC performance tracing

Can trace GC: ustat tool and object creation with uobjnew.

Trace open file failures

Use opensnoop to find failed open syscalls. Then attach a trace for that specific error to a Java application.

Michael Jennings: Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC

Want to make it possible for people to bring their own software stack to run on the supercomputers at Los Alamos, and decided to explore containers. Unlike virtual machines, they do not have performance impact on storage or networking (Infiniband).

Recommends this LWN article: Namespaces in operation.

Docker with OverylayFS can be slow on HPC. Therefore they built a system called Charliecloud with minimal performance impact, and native file-system performance.

Matt Cutts and Raquel Romano: Stories from the Trenches of Government Technology

Sometimes don't have source code or access to logs.

Many basic problems: 5% of veterans incorrectly denied healthcare benefits from a single simple bug.

Great value delivered by bug bounty programs.

API first!

Meaningful contribution - hugely impactful, bipartisan problems. Looking for software engineers and site reliability engineers for short tours of duty.

Jake Pittis: Resiliency Testing with Toxiproxy

CEO at Shopify once turned off a server as a manual chaos monkey.

Continued to work on resiliency, with gamedays. Then thought about automating the game days to ensure that issues remain fixed and don't regress.

Want to maintain authenticity.

ToxiProxy interposes latency, blackholing, and rejecting connections in the production systems and then is supported by automated testing in Ruby that asserts behaviour about the system.

Incident post-mortem fixes are checked and verified by injecting the faults again and checking application specific consequences. This confirms that fixes worked, and continue to work in the future.

Resiliency Matrix declares expected dependency between runtime systems. ToxiProxy tests allow one to validate that the dependency matrix truely reflects the production reality.

Brendan Gregg: Linux Container Performance Analysis

Common system performance tools like perf do not work well in containers, as the pid and filesystem namespaces are different. System wide statistics (e.g. for free memory) are published to containers which causes programs to make wrong assumptions: for example, Java does not understand how much memory is actually available in the container.

The situation is improving and there is ongoing integration of support for monitoring performance of containerized applications.

Understanding which limit is hit in the case of CPU containerization can be very confusing as there are many different limits.

PS. Brendan's talk from last year at LISA16 gives a general introduction to the advanced bpf tools: LISA16: Linux 4.X Tracing Tools: Using BPF Superpowers

Teddy Reed and Mitchell Grenier: osquery—Windows, macOS, Linux Monitoring and Intrusion Detection

Labs showing how to collect and query many system level properties like running processes from a distributed set of systems with a tool called osquery.

It can collect current state and also buffer logs of changes.

Heather Osborn: Vax to K8s: Ticketmaster's Transformation to Cloud Native Devops

Tech Maturity model.

20k on-prem VMs.

Kevin Barron: Coherent Communications—What We Can Learn from Theoretical Physics

Human communications take a lot of time and we need to be careful that we're really communicating.

Evan Gilman and Doug Barth: Clarifying Zero Trust: The Model, the Philosophy, the Ethos

Establish some strong properties: that all flows are authenticated and encrypted.

No trust in the network. Automation based policy based on a Ruby DSL and Chef that reconfigures iptables rules to add IPSec routes between application tiers.

Related to Google's BeyondCorp.

Beyond the control aspects, another value of the approach is observability. Mentioned that another way of doing this is Lyft Envoy.

Mostly build your own still.

Brian Pitts: Capacity and Stability Patterns

Very thoughtful talk with a comprehensive coverage of various techniques.

EventBrite has 150M ticket sales per year. Very spiky traffic. Over one minute can quadruple.

Bulkheads: partition systems to prevent cascading failures.

Canary testing: gradual rollout of new applications.

Graceful degradation.

Rate limiting. Understand capacity and control amount of work you accept.

Timeouts. Even have to timeout payment processors.


Capacity planning.

Corey Quinn: "Don't You Know Who I Am?!" The Danger of Celebrity in Tech

High energy and well presented talk.

Netflix: developers have root in production.

Should not be cargo-culted to places without same culture of trust and top quality talent.

Be careful about punching down. Recognise the weight that your words carry coming from a successful company with specific constraints.

Culture of security conferences is toxic.

Ben Hartshorne: Sample Your Traffic but Keep the Good Stuff!

Adapt sample rate as you're watching your traffic, to scale observability infrastructure logarithmically with production. Sample rate should be recorded in event, and reduced in proportion to traffic.

Honeycomb does this with honeytail. Another alternative is Uber's opentracing: Jaeger which uses a consistent sampler.

Mohit Suley: Debugging @ Scale

Distributed tracing is the new debugger.

Use Twitter's anomaly detection R library.

Jon Kuroda: System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields

Need to better at following checklists, sterile cockpit rule (kicking out unqualified people). Avoid normalization of deviance. Lots to learn from airline industry!

Think about telemetry.

Post a comment