Hi, I'm Conrad

This website is where I keep a log of things, sort of an online notepad if you will. Mostly focused on topics relating to software engineering. After a few years of full-stack web development at an online insurance brokerage company, I am now a site reliability engineer at a bank.

Outside work, I maintain a couple of open source projects related to password management:

  • rooster, a password manager;
  • rpassword, a cross-platform Rust library for reading passwords from a terminal.

To read my latest article, click/tap on the arrow on the right hand-side of this page.

For years, I've been taking pictures of all kinds of things with my smartphone. The photo above, which is wall in Malta is just one example. Currently, I all these pictures in a folder without any kind of sorting. It has become large enough that I have no idea what this folder contains. I'd like to sort these images.

Deleting duplicate images

I've noticed that some images are duplicated. But how many really are?

I use fdupes to find automatically remove duplicate files:

fdupes -rdN Pictures/

This removes images that are perfect duplicates. But what about almost identical images? What about if a PNG was duplicated as a JPEG? I look for software helping me with that and find Geeqie. This tutorial explains how I can use Geekie to quickly find similar images. With that, I remove a few dozen more duplicate images.

Sorting images

For me, sorting images by year is sufficient. I don't need to know where the images were taken or what they are about. If it's important enough, I'll remember anyway. So I try to sort them by year. The images' Exif data is helpful for that. Given a directory of images (only images, no folders), this script creates a directory that is named after the year at which each photo was taken and then moves the photo inside the created directory.

#!/bin/bash

set -e

find Pictures/ -maxdepth 1 -type f -print0 | while IFS= read -r -d '' file
do
    # Bypasses any non-image files
    file "$file" | grep image > /dev/null
    if [ "0" != "$?" ]; then
        continue
    fi
    
    # Bypasses files that don't have Exif data
    exif="$(identify -verbose "$file" | grep "exif:")"
    if [ 0 != "$?" ]; then
        continue
    fi

    exif_year="$(echo "$exif" | grep exif:DateTimeOriginal | cut -d' ' -f6 | cut -d':' -f1)"

    if [ -n "$exif_year" ]; then
        echo "$file" '->' "$exif_year"
        test -d "$exif_year" || mkdir "$exif_year"
        mv "$file" "$exif_year"/
    fi
done

For instance, if there is one image from 2005 and one from 2018, it creates this directory structure:

2005
 -> one-image-from-2005.jpg
2018
 -> another-image-from-2018.jpg

A few photos do not have any Exif data, so I sort these by hand.

Creating value from useless images

After sorting those images, I realize a lot of them have no value for me personally. They don't represent anything personal, public places, airports, etc. I briefly think about deleting them. But what if, instead, I put my "useless" images on a stock photo website?

These images may not be useful to me right now, but they may well be useful to someone else. Or maybe they'll be useful to me in a few years and having a personal bank of images I can pick from would be cool.

So I try to understand how to remove private information from these images. They can contain Exif data, which sometimes contains private information. To avoid leaking it, I'm strip it all away:

find Pictures/ -type f -exec exiv2 rm {} \;

Now, if I want some photos to describe me as the photographer, I can include a copy right notice:

exiv2 -M"set Exif.Image.Copyright Copyright Conrad Kleinespel" copyrighted-photo.jpg

I also want to strip any potentially private information from the image names themselves. And the extensions of the images being somewhat odd — jpg, JPG, jpeg, JPEG — I'd like to replace it all with jpg. So I create another script for that:

#!/bin/bash

set -e

find Pictures/ -type f -print0 | while IFS= read -r -d '' file
do
    # Bypasses any non-image files
    file "$file" | grep image > /dev/null
    if [ "0" != "$?" ]; then
        continue
    fi

    sha1="`sha1sum "$file" | cut -d' ' -f1`"
    ext="`echo "$file" | grep -o '.[[:alnum:]]*$' | tr '[:upper:]' '[:lower:]'`"

    if [ "$ext" = ".jpeg" ]; then
        ext=".jpg"
    fi

    new_name="$sha1$ext"
    mv "$file" "$new_name"
done

This creates image names like 3914678325556c51762c3c709c322d4357b2163d.jpg, without any personally identifiable information.

I've been trying to migrate this blog, currently built on Hugo, to Ghost. The goal is to migrate a bunch of markdown files located in content/codebase/*.md to the format accepted by Ghost, all the while preserving URLs, meta descriptions and images without manual work for each post.

Preserving URLs

Regarding URLs, Ghost has built-in support for redirects via a routes.yml file:

routes:

collections:
  /:
    permalink: /codebase/{year}/{month}/{day}/{slug}/
    template:
      - index

taxonomies:
  tag: /tag/{slug}/
  author: /author/{slug}/

Preserving images

I manually put images from Hugo in Ghost's content/images folder with the same paths as in the Hugo version of the blog, to prevent images' URL from changing during the migration. That's enough for Ghost to pick up on them.

Transforming markdown to Ghost's JSON format

Here's the script I use to convert raw markdown to Ghost's JSON format:

const { readdirSync, readFileSync } = require('fs')
const { basename } = require('path')

// I have my posts in the `content/codebase` directory
const listPosts = () => {
    return readdirSync(`${__dirname}/content/codebase`)
        .map(el => `${__dirname}/content/codebase/${el}`)
}

const createPost = (image, title, slug, publishedAt, markdown) => {
    return {
        "title": title,
        "slug": slug,
        "feature_image": image,
        "mobiledoc": JSON.stringify({
            version: '0.3.1',
            markups: [],
            atoms: [],
            cards: [['markdown', {cardName: 'markdown', markdown }]],
            sections: [[10, 0]]
        }),
        "status": "published",
        // These dates end up somewhat of an approximation, because
        // I would have to extract them from the .md files' metadata
        // which is more work and of little use in my case
        "published_at":  publishedAt,
        "created_at":  publishedAt,
        "updated_at":  publishedAt,
    }
}

// Extracts post data from the Markdown files used in Hugo
//
// The original Hugo files look somewhat like this:
//
// <example lang="hugo">
//   +++
//   title = "Hello world"
//   slug = "hello-world-url"
//   date = "2019-04-16"
//   +++
//   Post content is here...
// </example>
const createPostDataFromFileContent = (filename, fileContent) => {
    const contentRegexp = /^\+\+\+((.|\n)+)\+\+\+((.|\n)+)$/m
    const titleRegexp = /title = "(.+)"/
    const dateRegexp = /date = (.+)/
    const imageRegexp = /image = "(.+)"/

    const contentMatches = fileContent.match(contentRegexp)

    const header = contentMatches[1]
    const titleMatches = header.match(titleRegexp)
    const dateMatches = header.match(dateRegexp)
    const imageMatches = header.match(imageRegexp)

    const image = imageMatches[1]
    const title = titleMatches[1]
    const date = (new Date(dateMatches[1])).getTime()
    const markdown = contentMatches[3].trim()
    // In my case, the filenames are the same as the slug
    const slug = basename(filename.substring(0, filename.length - ('.md'.length)))

    return createPost(
        image,
        title,
        slug,
        date,
        markdown
    )
}

const postsData = listPosts()
    .map(filename => ({ filename, fileContent: readFileSync(filename).toString() }))
    .map(({ filename, fileContent }) => createPostDataFromFileContent(filename, fileContent))

// Prints the posts in JSON format for Ghost, which can be used to debug
// or create a .json file to import into Ghost, like so:
//
// <example lang="shell">
//   node hugo-to-ghost.js > ghost.json
// </example>
console.log(JSON.stringify(
        {
        "meta": {
            "exported_on":1408552443891,
            "version":"3.1.0",
        },
        "data": {
            "posts": postsData,
        },
    }
))

In August 2018, I wrote about the challenges of moving from AWS to Heroku. A new challenge has presented itself on the same website since then: a large number of request timeouts at unpredictable times during the day.

What we saw in Heroku

For the last few months, we've regularly had days where the Heroku metrics would look like this:

What we can see here is that lots of requests take between 8 and 30 seconds, with 30 seconds being the maximum request timeout for Heroku.

These timeouts appeared completely unpredictably. Rebooting the database server and application servers would fix the issue for a few days, before the same things happened again. And again. And again.

Our first thought was that some requests were very slow and were probably clogging up the database server, until eventually, it would be overloaded. So we started looking at which requests initiated the timeouts, first with the good old heroku logs command from the Heroku CLI.

But that wasn't it. And PHP logs didn't reveal anything useful either. So we looked at our database metrics in the AWS RDS dashboard.

What we saw in AWS RDS metrics

We didn't necessarily know what to look for in database metrics because we hadn't had any database performance issues up until then.

What we could see for sure was that when the website crashed, the number of simultaneous database connections was very high. And read IOPS were also very high.

Read IOPS in AWS metrics

Queue depth in AWS metrics

We didn't know why, though. And we didn't know where else to look to get better insights. After talking with some developers outside the company, we realized we needed more logs. So here comes ELK.

Adding an ELK stack into the mix

We setup an ELK (Elastic Search, Logstash, Kibana) and sent it all the logs we could possibly get (Heroku, Nginx, PHP, MariaDB, etc).

We made sure to import Heroku server metrics through Heroku Labs. The useful features in our case were log-runtime-metrics, runtime-dyno-metadata and runtime-heroku-metrics, which would give us access to CPU and RAM usage metrics.

ELK is great in that it allows us to search for something in all logs at once, without having to be too specific about what we're looking for. Heroku timeouts can be searched with the type:heroku AND at:error query:

Once we got ELK setup, we waited until the timeouts reappeared, hoping we would finally figure out what was going on.

But when the timeouts reappeared, we didn't still were unable to figure out what their root cause was. The logs confirmed that SQL requests were taking a long time to complete, but didn't tell us why the requests were suddenly slow.

However, these new logs helped to rule out what the problem was not:

  • metrics from Heroku Labs features confirmed that there was no application server overload;
  • through custom data parsing and graphs in ELK, we could see that our background jobs were not the issue either, as there was nothing odd going on in times of high request timeout rate;
  • there was no significant increase in traffic during high request timeout rate, which indicated we were not getting spammed by malicious 3rd parties.

Given all that, it was time to dig deeper into the database server configuration and metrics.

AWS RDS default configuration

Diving back into the MariaDB documentation and the AWS configuration, we noticed that some of the default timeouts of MariaDB were not suitable to use with Heroku's 30 second timeout.

For instance, the max_statement_time configuration value was set to use the "engine default". MariaDB disables max_statement_time by default. So we set it to 25 seconds, which is more than enough.

We cycled through some other settings and ended up with this updated configuration:

Default MariaDB configuration (left) and updated configuration (right)

AWS RDS best practices

While the updated configuration gave us clearer error messages in logs, the instability was still very much present, so we had a closer look at the AWS RDS manuals and we stumbled onto the AWS RDS best practices:

An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. To tell if your working set is almost all in memory, check the ReadIOPS metric (using Amazon CloudWatch) while the DB instance is under load. The value of ReadIOPS should be small and stable.

We had a look at the metrics again and saw this:

Swap in AWS metrics In addition, as seen above, we had very unstable read IOPS. Now things started to light up. The problem suddenly seemed simple: our database server was not powerful enough, it didn't have enough RAM.

So we upgraded the AWS RDS instances.

That was it:

  • IOPS are low and stable;
  • SWAP is 0Mb and stable;
  • random crashes with high SQL times are gone.

Everything seems so obvious after the fact. We could have just upgraded the server from the start. Adding more servers is a commonly used solution after all. But then again, it's not always scalable to "just add more hardware". More importantly, it doesn't fix the underlying issue in all cases.

In this case it does though. Going through the process of trying to understand what the problem was before adding the hardware means we understand IOPS and SWAP metrics. We now monitor them proactively.

Over the past 3 years, I feel like I've seen a lot of interesting errors cases happen as a full-stack web developer, focusing on PHP-driven backends. Pretty much any error that I think would not happen to me ended up happening anyway. Murphy's law.

Here are some of the error cases I've had happen and their effects. I'm writing this down as a reminder for what curious error cases should be handled in a product that needs highly reliable execution.

Timeouts on networked services

Connection timeout

Connection timeouts are somewhat special. They happen in some rare cases, such as when domain name resolution is impossible. That can happen when doing an HTTP request to an API.

In PHP, one of the most used HTTP client is Guzzle. The way Guzzle handles errors can be a bit surprising. Guzzle allows exceptions to be turned off. I use this feature whenever enhanced error handling is required. You'd think that this makes exception handling useless:

$response = $http->request('GET', '/some-uri', [
    'connect_timeout' => 5,
    'exceptions' => false,
]);

if ($response->getStatusCode() !== 200) {
   // handle unexpected status code
   // ...
   return;
}

Surprisingly, with exceptions turned off in early versions of Guzzle 6, connection timeouts will still raise an exception. More specifically, a ConnectException. This means that exceptions must be handled even when exceptions are disabled, with something like this:

try {
    $response = $http->request('GET', '/some-uri', [
        'connect_timeout' => 5,
        'exceptions' => false,
    ]);
} catch (ConnectException $e) {
    // handle connection timeout
    // ...
    return;
}

if ($response->getStatusCode() !== 200) {
   // handle unexpected status code
   // ...
   return;
}

This behavior has been made more clear in recent Guzzle versions. The exceptions option is now called http_errors, which makes it more clear that it only prevents exceptions for HTTP requests that get a response (which excludes timeouts).

Next, let's move on to response timeouts.

Response timeout

The response timeout is a bit different. It happens when the network request

takes to long to get a response. In the case of HTTP, that's usually because the headers of the response take a long time to come back.

With Guzzle, that's the timeout option:

$guzzle->request('GET', '/some-uri', [
    'timeout' => 5,
]);

Usually, something like 5 seconds should be more than enough. If 5 seconds is not enough, usually that means that things need to be rethought.

Requests that take more than 5 seconds are usually requests that do heavy computation synchronously. Turning the heavy computation into an asynchronous task, handled with a webhook or a polling API is preferable if possible.

Whenever heavy computation needs to be synchronous, 5 seconds might not be enough. In that specific case, using idempotency can make the API immune to timeout errors. This article by Stripe on idempotency is a great read.

Out of memory

Let's move on to another class of errors which happen all too often and can be quite catastrophic, "Out of memory" (or "OOM") errors. These errors happen every time a process uses more RAM than it is allowed to use.

In the case of a PHP website, the maximum amount of RAM is specified with the memory_limit INI directive in the php.ini file. Errors similar to OOM might occur if receive files larger than the upload_max_filesize directive or a request body larger than the post_max_size directive.

Here's a quick example of what these values might look like in php.ini:

memory_limit = 64M
upload_max_filesize = 32M
post_max_size = 32M

OOM errors due to exceeding RAM usage (case of memory_limit being overrun) happen mostly whenever you load to many objects into memory. That can happen if you do something like this on a very large SQL table:

$query = $dbConnection->prepare("SELECT * FROM blog_posts");
$query->execute();
$result = $query->fetchAll(PDO::FETCH_ASSOC);

Here, fetchAll might try to load 1000 blog posts into RAM. This will require more than 64M of RAM and thus cause an "Out of memory" error. This classic case of OOM can be worked around in Symfony, PHP in general and other languages with some techniques I've previously described in ["Memory management with Symfony commands"]({{< ref "/codebase/memory-management-in-long-lived-symfony-commands.md" >}}).

For OOM errors that come from too large requests (case of upload_max_filesize or post_max_size being overrun), you might want to look into APIs that allow you to delegate file upload to dedicated services. If you're using S3 for file storage, than AWS S3 pre-signed upload URLs might be the way to go.

Race condition

During development, there is only one user on a website: the developer. Because of that, race conditions rarely occur during development.

In production, things are different. Lots of users are active at once. That means that the slightest possibility for race condition might become a reality.

The most common error I've seen is the incorrect use of temporary files.

In PHP, a tempfile is created with tempnam:

tempnam

Often times, I've seen code that generates PDFs and looks like this:

$tempfile = tempnam(sys_get_temp_dir(), 'user_data_');
if ($tempfile === false) {
    // handle error
    // ...
    return;
}

@rename($tempfile, $tempfile . '.pdf');
$tempfile = $tempfile . '.pdf';

generatePdf($tempfile);

Note how there is no error handling for the call to rename. That will be key in just a moment.

If this script runs on the same server enough times, eventually, two runs will run into a race condition. User A might end up with the PDF of user B, or vice-versa. More specifically, here's what could happen:

  • user A requests a personal data export in PDF form
  • server creates /tmp/user_data_432565 for user A
  • server renames /tmp/user_data_432565 to /tmp/user_data_432565.pdf
  • user B requests a personal data export in PDF form
  • server creates /tmp/user_data_432565 (which is now available, since the file from user A was renamed to something else) for user B
  • rename to /tmp/user_data_432565.pdf fails because /tmp/user_data_432565.pdf already exists from user A's request, but because there is no error handling, the request of user B continues
  • server puts data of user A in /tmp/user_data_432565.pdf
  • server puts data of user B in /tmp/user_data_432565.pdf
  • server sends /tmp/user_data_432565.pdf to user A and B

And there it is: user A and user B will end up with the data of user B, because it was the last data to be put in the file used by both requests.

Handling all possible errors is of courses a simple solution here. But race conditions can occur in some harder to debug cases as well, such as when you use threads or use the database without proper transaction management.

Let's have a look at another type of error, full disks.

Full-disk

Yes, disks become full. I've had this happen multiple times. Unless you have monitoring systems in place, you probably won't be notified of the disks being full and data will be dropped silently.

The first time I encountered a full-disk, we were still in early stage, using a single server for everything, including storage of uploaded files. These files filled up the disk and at some point, new files could not be uploaded. Too bad, because our code depended on uploads functioning properly.

The second time, we had installed Docker, a piece of software which had issues with disk usage for a long time, which we didn't know before seeing our website go down.

The third time, it was because we were storing log files in an s3fs mount. What we missed was that a failure to mount the s3fs would not trigger any error and instead, files would silently fill up the server's disk until everything blew up.

So basically, what I learn from all of that is that the more immutable a piece of infrastructure is, the easier things are to reason about. If at all possible, mutable data should be stored on bespoke infrastructure with corresponding monitoring services in place.

Null pointer errors

When used without good reason, nullable types are like cancer. They start small and then spread throughout your codebase before they start causing mayhem.

So what's the problem with nullable types?

The problem with nullable types is not so much null itself. The value null can be handled perfectly well.

Instead, what I've seen time and time again is that junior engineers will use nullable types when they don't want to take the time to understand their software's requirements. They "just" use a nullable type and that's it. They don't worry about what might happen if the value is ever null, because in their initial design, the value is never null.

Then, another engineer comes along and in their new code, the value is null at some point. But the old code doesn't handle null.

Let's see an invalid use of a nullable type: take Address type with a field streetNumber that is nullable even though every Address is supposed to have a street number. This kind of thing happens a lot and ends up causing frequent bugs.

And so things start to break down. And the bug tracker starts filling up with errors like these:

nullable-error

So yeah, avoid nullable types if at all possible.

Reasonable uses of nullable types

The only reasonable uses for nullable types are when:

  • you have no time to work properly (which can happen if you need a feature shipped very quick);
  • or when the null value expresses is intuitively a valid value.

For instance, here's a valid usecase for a nullable type: you have BlogPost type persisted in a database and this object has a publishedAt field, which is either a date, if the blog post is published, or null if the object is not yet published.

A trick to get around nullable types

I've often seen nullable fields being used as a way to avoid creating a new object. For instance, you have User type that has addressStreetNumber, addressPostcode and addressCity fields that are all nullable, because maybe the user has not yet informed you about their address — in which case all three fields are null.

User
  addressStreetNumber: string | null
  addressPostcode:     string | null
  addressCity:         string | null

This causes a problem because now, maybe you have users which have only addressStreetNumber that is null but addressPostcode and addressCity not null. What would that mean? Is this user valid?

A architectural trick to get around that is to make the User type have an address field which is of type Address. And make the Address type have 3 non-nullable fields named streetNumber, postcode and city. Now, each user can either have an address, or no address. But each address is valid 100% of the time.

User
  address: Address | null

Address
  streetNumber: string
  postcode:     string
  city:         string

Reducing the amount of places where null is used makes code easier to reason about and reduces the number of bugs in the long run.

SSL certificate of API invalid

More and more, SSL certificate renewals are getting automated, what with Let's Encrypt and all. But not everything is automated yet. This leaves room for occasional human errors.

I've had an SSL certificate be updated in a backwards incompatible manner once. It was a Saturday morning, so as to not disturb customers during the most active hours, which in this case were week hours.

A critical service should try to handle SSL failures securely and gracefully. That means never reverting to insecure connections and trying to revert to a safe alternative whenever possible. As a last resort, dynamically disabling the feature that uses the failing SSL connection whilst having alerts going out to on-call engineers can be a wise temporary solution.

API unavailable

For all of the reasons discussed in this article, web services break. Bugs occur. And servers fail. That's when an API may become unavailable. It may be unavailable for a few seconds only, or for a prolonged period of time.

When an API is unavailable for a brief moment, there is a simple method we've used: retrying. That's when you run the API call multiple times until it works. Usually, retrying will look something like this:

$retries = 5;
while ($retries !== 0) {
    $apiResult = doApiCall();
    if ($apiResult) {
        break;
    }
    $retries--;
}

if (!$apiResult) {
    // handle error
    // ...
    return;
}

// handle successful API call
// ...

Retries are not ideal though: the more you retry, the longer the code runs. This mean that if possible, retrying should be avoided. This means, as software engineers, it is a good idea to assume that APIs can become unavailable at any time and that not handling unavailability transparently will make our product unusable.

Recently, I've been looking into performance issues on an SQL-based search feature. The website in question paginates with Symfony's PagerFanta Bundle. It seems that optimizing this bundle for use with large table is a somewhat common issue.

Use IN instead of = xxx OR

This one is simple enough. Using WHERE x IN (a, b, c) is faster than using WHERE x = a OR x = b OR x = c. Symfony makes this simple with the query builder. Let's say we want to list users that are banned or suspended, we could do something like this:

$statuses = [
  1, // banned
  2, // suspended
];
$qb->andWhere($qb->expr()->in('user.status', $statuses));

In such a case, having an index on user.status helps too (see "Indexes" further down).

When joining, put ORDER BY in subquery

In SQL, at least in MariaDB, there is a performance issue affecting "late row lookups" and can be worked around by putting any ORDER BY clauses inside a subquery. Although it sounds counter-intuitive to me that adding a query would speed things up, it does work.

The idea is to change something like:

SELECT my_user.id, my_cars.id
FROM my_user
JOIN cars ON my_cars.user_id = my_user.id
ORDER BY my_user.id DESC

To something like:

SELECT DISTINCT my_user.id, my_cars.id
FROM
    (
        SELECT ordered_user.id
        FROM my_user ordered_user
        ORDER BY ordered_user.id DESC
        LIMIT 999999999
    ) ordered_user
JOIN my_user ON my_user.id = ordered_user.id
JOIN my_cars ON my_cars.user_id = my_user.id

The LIMIT 999999999 is required because MariaDB will only use ORDER BY in a subquery if a LIMIT is also provided.

Avoid prefix wildcards with LIKE

When running an SQL query with a prefix wildcard, indexes are not used. With this, even if the email column is indexed, the index will not be used. With a large table, this query will be slow:

SELECT * FROM my_user WHERE email LIKE '%[email protected]%'

Instead, more often than not, there are ways to query non prefix wildcards. For instance, in a CRM, it is often the case that a customer service representative copy-pastes the email address of a customer. This means that emails are exact. And so I can run this query instead:

SELECT * FROM my_user WHERE email = '[email protected]'

Or, if I want speed and a bit of flexibility, I can match a prefix, in which case the index is used:

SELECT * FROM my_user WHERE email LIKE '[email protected]%'

This works especially well with data types such as phone numbers. If a searched phone number is valid — which in PHP can be checked with Google's libphonenumber library — it probably OK to do a WHERE phone = xxx query. If, on the other hand, the phone number is incorrect, that may mean that it is incomplete, in which case I can revert to a slower but more accurate WHERE phone LIKE '%xxx%' query. I find that using data validation can be a good way to build efficient SQL queries on the fly and reduce query time in most cases, reverting to slow queries only when data validation leaves me unsure what to do.

Don't show page number, unless you need to

PageFanta's Doctrine ORM adapter is initialized like this, where $qb is a Doctrine query builder:

$pager = new Pagerfanta(new \Pagerfanta\Adapter\DoctrineORMAdapter($qb));

This Pagerfanta adapter counts the number of pages. This helps display something like this in the front-end:

pagination-1

This looks nice. Doesn't it? It seems like it has become kind of an automatism to display this data. It's usefulness is not necessarily questioned. But it is useful? In the case I'm working on, it's useless. What matters more is to display the right data on the first page.

And, fetching the number of pages from the database takes time. Getting rid of this completely may just be an acceptable solution. This how I do that with PagerFanta:

$pager = new Pagerfanta(new \Pagerfanta\Adapter\CallbackAdapter(
    function() use($limit, $offset) {
        // approximation that allows always having one more page so as to get a "Next page" link
        return $limit * 2 + $offset;
    },
    function() use($qb) {
        return $qb->getQuery()->getResult();
    }
));

Setting indexes

Indexes can speed up queries in some situations. In a Symfony website, adding indexes can be setup with the Index annotation on any entity's class:

use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Table(name="my_user", indexes={
 *   @ORM\Index(name="idx_status", columns={"status"}),
 * })
 * @ORM\Entity(repositoryClass="AppBundle\Repository\UserRepository")
 */
class User
{
    /**
     * @var integer
     *
     * @ORM\Column(type="integer")
     */
    protected $status;
}

Composite indexes can be used to further speed up specific types of queries that use WHERE on multiple fields in a single query.

Recently, I've been migrating a website from AWS to Heroku to reduce the maintenance burden of AWS EC2 instances and make the website more reliable.

Planning for failure

Before migrating the servers, I make sure there was an easy way to switch back to the old infrastructure, just in case something went wrong. That turns out to be useful later on.

The domain of the website pointed to an AWS ELB via a CNAME record. Heroku also uses CNAME records. So, to switch back from the new, potentially buggy, infrastructure to the old one, all I have to do was keep the old servers up and make update the CNAME record.

Having a low "Time To Live" (TTL) for the CNAME record was important to enable quick reverts.

ttl

Failing to handle the load

Before moving to Heroku, I try using another PaaS, an alternative to Heroku if you will. It seems very simple to use. No configuration whatsoever. At first sight, I'm very enthusiastic about it. And in my initial tests, the website works nicely with it. No lag. No crashes. No timeouts. I think I'm off to a great start.

But then my enthusiasm catches up to me. I flip the DNS switch over to the new infrastructure. Almost instantly, everything starts crashing. Hundreds of timeouts. The website is completely unusable. Stress starts growing. Support requests start coming in to ask what's going on. At this point, I call the PaaS support team for help. But they are unable to assist me.

So I look for possible bottlenecks. That too, is difficult, because the PaaS provides almost no monitoring tools. That's when I realise how important monitoring tools are. Not to mention how much more I should have load tested the new infrastructure before putting it on the front-line.

So yes, using this PaaS is a mistake. It doesn't work for my use-case. So I switch back to the old AWS infrastructure. Instantly, the website started delivering quick responses again.

That's how I understand some of Heroku's value proposition for PHP hosting:

  • it lets me customize the Nginx & PHP configuration;
  • it lets me scale the number of servers up and down on demand;
  • it lets me monitor incoming traffic.

But before blindly trusting Heroku's feature-set like I have done with the first PaaS, I want to make sure that, this time around, the new infrastructure works properly under production conditions.

What to load test

The website I'm migrating has different kinds of work loads: asynchronous tasks run by workers, static content via Wordpress, APIs which themselves call out to multiple APIs, etc. All of these things have to at least be as fast and stable as they are with AWS.

My first pain point is understanding what the bottleneck would be. What should I load test?

My friend Kim is of great help. He gives me tips about log analysis. He walks me through AWS monitoring dashboards amongst others. They provide lots of useful graphs that can help understand the number of concurrent requests I should plan for. Kim also recommends I take a close look at the number of concurrent database connections and transactions in AWS RDS metrics, which would help more accurately define what tests were required.

ec2-monitoring

For inspiration, I highly recommend the article about how Github tested a change the way merges were processed. Not everyone needs to go as far as Github, but if Github is as stable as it is despite the high number of users, it is probably because of all the work that goes into testing changes.

How to load test

After I have a somewhat good idea of what to load test, I run tests with more traffic than what would happen in production for the different kinds of workloads that the website runs. If the servers hold up, that's a good indication that they would also remain available under production traffic.

I hear about the "Apache HTTP benchmarking tool". What it does, is that it throws a specific number of requests at a specific URL with a predefined concurrency threshold. Here is what it looks like for 1000 requests with a concurrency of 200:

ab -n 1000 -c 200 https://mywebsite.herokudns.com/some-url-to-load-test

The output looks something like this:

ab-results

The load tests of all workloads shows good results. Under high load scenario, increasing the number of servers is painless and without interruption of live traffic. Everything looks good. So I go ahead and flip the DNS switch again. We are now in production with the new Heroku based setup.

Edge case detection through monitoring

Heroku has a very helpful "Metrics" view, which displays the number of 4xx and 5xx requests. That's a good first indicator when it comes to the number of failed requests.

logs

The automated alerts that Heroku sends out can be a bit out of place, but they are generally helpful and help avoid downtime. At the slightest rise in failed requests, a notification is sent out, which can be used to take action: increase the number of servers, look for ways to optimize CPU/memory use, etc.

Looking at Google Analytics performance metrics and Page Speed Insights can also help debunk some bugs.

Some unexpected slowness does appear though:

  • Heroku disables GZIP by default, which slows down requests for some static content like Javascript and CSS;
  • Heroku has a default memory limit per request of 32MB, which breaks requests that used more than 32MB of RAM;
  • Moving from one server on AWS to load balanced Heroku breaks file based caching.

I guess the main takeaway for me is that it takes much longer than expected to iron out all of these bugs.

Just a quick note in /bin/sh. It doesn't always point to the same underlying binary.

/bin/sh can point to any other shell

/bin/sh is often a symlink to /bin/bash. But on newer Ubuntu /bin/sh refers to /bin/dash. My laptop, which runs Arch Linux, has /bin/sh pointed to /bin/bash. And Debian also has /bin/sh pointed to /bin/bash.

Following this, I find it useful to use only one shell in a given codebase. bash is widely used and recommended in Google's shell style guide, so it can be a good default.

Most PHP projects eventually pull in open source libraries from Composer. All of these Composer dependencies must be updated regularly to avoid security holes or incompatibility issues with new software. I do a couple of things before updating a dependency to make sure the update doesn't break anything.

Figuring out what might break

Some Composer packages provide a CHANGELOG.md file, which is a great way to know if updating will break a website. But reading through dozens of changelogs can take a long time. Because most Composer packages (or at least well known ones) rely on SemVer to publicize what might have changed between two releases, there is usually no need to read through changelogs.

SemVer says that any software packages should have a version of the form <major>.<minor>.<patch>, for instance 5.1.34.

Whenever the major part changes, there is a breaking change in the package. In that case, I verify if any code is affected and needs to be updated. Whenever the minor part changes, there shouldn't any breaking change but in practice, I've seen breaking changes for minor updates, I prefer to read the changelog. I have very rarely seen breaking changes with patch updates, although they can happen if the code relies on the bugged behavior.

Viewing composer version changes after an update

To get an overview of version changes after a composer update, I use git diff and grep:

git diff composer.lock | grep -B1 -E '            "version"'

The output looks like this:

composer-update-diff

From there, I take a closer look at the changelogs from packages for which a major or minor version has changed.

Here are some of the things I think are useful to keep memory usage under control in Symfony commands that process large databases.

Batching

Two naive approaches to looping over all contents of a database would be:

  • Select all data at once and migrate it in one run: the network is only accessed once but this can result in an "Out of memory" error because there is probably too much data for the server's RAM;
  • Run the migration on each row, selecting rows one by one: memory usage is low because we only ever load one row into to memory at a time, but the network is accessed as many times as there are rows, which is very slow.

Using batching allows having a somewhat low and stable memory footprint while reducing network overhead. With batching, I select rows 100 by 100 (or 1000 by 100, or whatever fits your use case) and run the migration on each batch. It might look something like this in PHP:

$batchSize = 100;

// SQL's "OFFSET" tends to be slow on large tables. A faster approach is to select rows ordered by ID
// and to set a minimum ID with a "WHERE" clause. Something like:
//
//     SELECT * FROM user WHERE id > :minimumId ORDER BY id ASC
//
// This achieves the effect of an "OFFSET" but is significantly faster.
$minimumId = 0;

do {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);
    $minimumId = $batch[$batchCount - 1]->getId();
    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // ...
        }
    }
} while ($batchCount === $batchSize);

Putting the main loop's body inside a separate method

PHP automatically clears memory of variables that go out of scope, somewhat like what C does with its stack variables. Any variable that is created in a function is cleared once the function ends. To allow PHP to clear more memory, I put the main loop's content inside a separate method. It looks something like this:

$batchSize = 100;
$minimumId = 0;

do {
    list($batchCount, $minimumId) = $this->runOneBatch($batchSize, $minimumId);
} while ($batchCount === $batchSize);

And in a separate method named runOneBatch:

function runOneBatch(int $batchSize, int $minimumId): array {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);
    $minimumId = $batch[$batchCount - 1]->getId();
    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // ...
        }
    }
    return [$batchCount, $minimumId];
}

Marking unused variables as collectable by garbage collection

In PHP, variables are garbage collected, which means that I usually don't have to manage the memory they use manually. Instead, PHP handles memory management automatically. This works well most of the time. But for long-running tasks, it has its limits. Because PHP does not always know that a variable has become useless, it keeps it in memory just in case. This increases memory usage.

To tell PHP that a variable has become useless, I set it to null and unset() it:

$users = loadLotsOfUsersFromMysql();

// ... do some stuff with the users ...

$users = null;
unset($users);

Triggering garbage collection

PHP's garbage collector uses heuristics to figure out when the best time is to remove unused variables from memory. In long-running tasks, I find it useful to manually trigger a GC cycle:

gc_collect_cycles();

Clearing Doctrine's entity manager

Internally, Doctrine, Symfony's ORM, keeps a reference of each entity it has loaded into memory, unless I explicitly tell it to drop the reference. To keep memory usage down, I like to clear the entity manager, which will remove references to all entities currently known to Doctrine:

/** @var \Doctrine\ORM\EntityManagerInterface $em */
$em = $this->getContainer()->get('doctrine')->getManager();
$em->clear();

SQL logging

The SQL logger of Doctrine that is enabled in Symfony by default and can be a source of memory leaks during long-lived commands. I usually disable the SQL logger in configuration or with something like this:

/** @var \Doctrine\DBAL\Connection $connection */
$connection = $this->getContainer()->get('doctrine')->getManager()->getConnection();
$connection->getConfiguration()->setSQLLogger(null);

Here's what I understand from data science after taking Coursera's crash course on data science from Johns Hopkins University.

What data science is

Data science is the science of answering questions with data.

Use of statistics in data science

It usually involves the use of statistics, which are comprised of 4 things.

  1. Descriptive analysis, which is the exploration of the data to get an idea of what it tells, this can include creating graphs or visualizations of raw data;
  2. Inference, which is drawing of conclusions from samples, this might be an estimation of the number of voters for the past polls, inference can be seen as the formalization of the descriptive analysis;
  3. Prediction, which is an attempt to say what the future data will be based on current or past data, this takes the form of linear regression or other machine learning algorithms, as well as deep learning with constructs such as neural networks;
  4. Experimental design, which is the definition of methods needed to conduct meaningful experiments, randomization of the population group is a key aspect of experimental design as it helps to get better generalizability.

Machine learning

There are 2 main types of machine learning:

  1. Unsupervised learning: trying to see things previously unseen - for instance, groups of people within a population, such as "voters of the Republicans" or "voters of the Democrates";
  2. Supervised learning: trying to predict something based on previously known data - for instance, trying to predict the height of a son based on the height of his father.

Even though they are related, there are some differences between traditional statistics and machine learning. Traditional statistics cares primarily about model simplicity and interpretability. Machine learning cares primarily about model performance.

Structure of a data science project

The structure of a data science is as follows:

  1. Asking of a question of which the answer will determine what to do next;
  2. Data is gathered that will help to answer the question;
  3. Exploratory data analysis helps get a vague idea of what an answer might be;
  4. Formal data analysis helps get a deeper understanding of the data;
  5. Interpretation infer conclusions from the data samples;
  6. Communication of the experiment gives the answer to the question;
  7. Data science project is over, now is decision time.

Sometimes, exploratory data analysis can come first, leading to questions which you can then try to answer using the steps above.

Illusion of progress

The course mentions the paper Classifier Technology and the Illusion of Progress, by David John Hand. The paper explains how a lot of current machine learning algorithms may well be too complicated to yield value for real life projects. Some technologies have a theoretical advantage that is very hard to transpose in the real world due to unexpected edge case or lack of domain knowledge on the part of the executor.

Output of a data science project

A data science project should have some kind of output. Ideally the output should be reproducible, meaning it should contain all the data and code needed to run the experiment from scratch. This is useful for auditing the experiment or for running it with another data set.

The output of the data science could be:

Good outcomes

The best outcome of a data science project is to have an answer to the question the project is meant to answer. That and the previously discussed output. Some other good outcomes may be increased knowledge or the conclusion that the data set is not suited to answer the question, in which case you can decide to look for more data or to drop the question. Either way, it's useful to know if the data set is insufficient.

Bad outcomes

Sometimes, even in the hands of smart people, machine learning can make bad predictions, as was the case with some predictions by the Google Flu Trends:

In the 2012-2013 season, it predicted twice as many doctors’ visits as the US Centers for Disease Control and Prevention (CDC) eventually recorded. — New Scientist

And sometimes, machine learning can yield very good performance but be too expensive to run at scale in production, as was the case with an algorithm meant to improve Netflix recommendations:

(...) the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. — Netflix blog

Some other papers are mentioned in the course:

I'm trying to better understand where endianness comes from and its impacts on code performance. Here's what I've got so far.

Description

First, here's how integer 0x0A0B0C0D, or 168496141, is stored in little-endian and big-endian.

In little-endian, the least significant bytes come first:

address | byte value
--------------------
0x00    | 0x0D -> least significant
0x01    | 0x0C
0x02    | 0x0B
0x03    | 0x0A

In big-endian, the most significant bytes come first:

address | byte value
--------------------
0x00    | 0x0A -> most significant
0x01    | 0x0B
0x02    | 0x0C
0x03    | 0x0D

Network performance

Endianness is somewhat linked to network, because network byte order is always big-endian. So I wonder if endianness could have an impact on networking code performance.

The question "What are the pros and cons of little-endian versus big-endian?" on Quora seems to indicate that neither one is better than the other. In addition, the Intel 64 instruction set has a BSWAP instruction that is specially made to convert integers from big-endian to little-endian and vice-versa. So even with an x64 CPU, endianness looks to be heavily optimized.

bswap

This instruction is used by the operating system kernels. For instance, on my Arch Linux installation, BSWAP appears in the /usr/include/bits/byteswap.h header:

byteswap-header

I trust that Intel optimizes these functions as much as possible. So I suppose the performance impact of calling BSWAP is small. In addition, there are several cases where swapping byte order can be completely avoided:

  • big-endian machines already use network byte order internally, so they don't need to swap byte order before sending data off to the network;
  • little-endian machines that talk to each other over TCP do not need to swap bytes since the bytes are read in-order on the receiving end.

The only case where endianness should really cause slow down is if a big-endian machine talks with a little-endian machine.

Math and typecasts

Akash Sharma makes a good point for little-endian. He explains how little-endian makes common maths operations easier:

Consider an example where you want to find whether a number is even or odd. Now that requires testing the least significant bit. If it is 0 it is even. In Little Endian Ordering least significant byte is stored at starting address. So just retrieve the byte and look at its last bit. — Akash Sharma

And how little-endian makes typecasts to smaller types easier:

For example you want to typecast a 4 bytes data type to 2 bytes data type. In Little Endian Ordering it is pretty straightforward as you just need to retrieve the first 2 bytes from starting address in order and you would get the correct number. — Akash Sharma

Consistency and the coin toss

The paper "On holy wars and a plea for peace" by Danny Cohen also gives some interesting background behind the decision to use big-endian for network byte order. This paper is referenced in rfc1700 which defines network byte-order.

Cohen's paper points to the inconsistency of some little-endian systems back in the eighties, when the paper was written:

Most computers were designed by Big-Endians, who under the threat of criminal prosecution pretended to be Little-Endians, rather than seeking exile in Blefuscu. They did it by using the B0-to-B31 convention of the Little-Endians, while keeping the Big-Endians' conventions for bytes and words. — Danny Cohen

This sounds like a joke, but Cohen goes on to describe how the M68000 microprocessor was little-endian for bits but big-endian for words (bytes), double-words and quad-words. And it points to the better consistency of some big-endian systems:

The PDP10 and the 360, for example, were designed by Big-Endians: their bit order, byte-order, word-order and page-order are the same. The same order also applies to long (multi-word) character strings and to multiple precision numbers. — Danny Cohen

Finally, it ends by saying that making a decision and having all computer scientists agree on it will be extraordinarily difficult. For this reason, it suggests:

How about tossing a coin ??? — Danny Cohen

Almost 40 years later, the network folks settle for big-endian and the majority-share CPUs (x86, arm) are little-endian. In software development though, the choice is not clear. Luckily, there is probably no need to worry about bit-endianness, only byte-endianness. Partial win, I guess.

I'm looking to improve test coverage for Rooster, a simple password manager for geeks I've been building with the help of the community.

Why add tests now

A few days ago, contributor Dmitry V. Roenko updated deprecated serialization code adds unit tests to his work. Oleg Yamnikov, another contributor behind the recent command improvements and the upcoming import command tells me he'd like to write tests as well.

Inspired by them, I challenge myself to write 1 test a day for the next 30 days with the hope that this makes testing fun and digestible, whilst revealing unnoticed bugs and making it easier to contribute code to Rooster.

Implemented tests

Here's a list of tests I add during these 30 days:

Overall, it turns out to be both a fun experiment as well as a good way to improve the quality of Rooster.

For some time, I've wanted a backup of my dotfiles, just in case.

Backing up configuration files

I could keep a Git repository, but currently, a simple folder on my laptop, for which I have complete backups via restic, is good enough for me. I rarely change dotfiles, I prefer using tools that do the heavy lifting for me without too much configuration.

Restoring our configuration

Whenever I need to restore dotfiles, I have this little script that will make symlinks as needed:

#!/bin/bash

set -euo pipefail

cd dotfiles
for f in `find . -type f | sed 's#^./##'`; do
    # create the directory in which to put the config file's symlink
    echo -n mkdir $HOME/`dirname $f`
    read
    mkdir -p $HOME/`dirname $f`

    # create the symlink
    echo -n ln -sf $HOME/dotfiles/$f $HOME/$f
    read
    ln -sf $HOME/dotfiles/$f $HOME/$f
done
cd ..

It shows me what it is about to do, asks for confirmation and then creates the symlink where it should be. For instance, ~/dotfiles/.bashrc gets symlinked to ~/.bashrc. Then, when Bash looks for its configuration, it will to ~/.bashrc which will point to ~/dotfiles/.bashrc.

I'm setting up my Thinkpad with Arch Linux and full-disk encryption.

Veracrypt

The first encryption method I look at is Veracrypt. During a job interview, a company asked me questions about Veracrypt. I didn't know what it was. The interviewer told me it had this feature called "encrypted containers" that allowed to encrypt a specific set of files on a computer to keep them safe: ssh keys, password files, etc.

It also seems like with Veracrypt, as long as I have 1.5 times the used disk space, I can encrypt my entire disk without having to reinstall everything. That would save a lot of time. Unfortunately, I don't have enough free disk space for that though.

So, I need to reinstall my system.

dm-crypt and LUKS

The Arch Linux wiki often gives good tips on how to do things "the Arch way". What it recommends is not Veracrypt. Instead, searching "arch full disk encryption" yields this documentation about dm-crypt. The section named "Simple partition layout with LUKS" seems to be exactly what I need, no more, no less:

  • data is protected in case of physical theft of the computer;
  • easy to install;
  • single pre-allocated disk partition.

Things that are possible with other dm-crypt setups but that I don't need:

  • possibility to change partition layouts;
  • plausible deniability in case the computer is seized by the courts.

In addition, dm-crypt seems to be somewhat compatible with TPMs, which is a security feature my computer has. Maybe I can use that later if I choose dm-crypt now. So dm-crypt it is.

On YouTube, I find this tutorial that explains how to install Arch Linux with dm-crypt: https://www.youtube.com/watch?v=AQxk1uUQ-pQ.

Some things in this video are unknown to me.

First, the author talks about the filesystem btrfs. It looks like this filesystem has forward error correction of corrupted files.

Second, the author speaks about the role of the "hardware clock" that is configured with the hwclock command. A Google search tells me that this clock, contrary to the one Linux runs, keeps on going after I turn off the computer, so that when I boot it up again the next day, Linux can know what time it is without asking the internet. The hwclock man page tells me that the hardware clock is sometimes called CMOS clock. From what I remember, CMOS is the BIOS battery's name. If that's the case, then we can rely on it as it can apparently keep working for 6 years or more.

One more thing I learn in this video is the existence of the syslinux bootloader. Most tutorials on the internet talk about grub. It seems like the default bootloader when you want something that works fast without having to tweak a lot of settings. syslinux looks more lightweight and thus a little less powerful. But I don't need all of grub's features (custom filesystems, encrypted /boot partition, etc). So I do as the video author suggests and setup syslinux.

In the end, I do end up with a working system and full disk encryption, except for the boot partition. There are still a few errors that show up on boot. They don't seem critical but should be looked at.

UEFI

One thing I notice with the previous install is that I used the legacy BIOS instead of the more modern UEFI. This works just fine in 99% cases because UEFI's advantages seem aimed at enterprise uses, not so much PC use. But I like bleeding edge stuff so I decide to look for a way to reinstall in UEFI mode. As a side note, I later learn that UEFI is not only aimed at enterprise use cases but it the new de-facto standard for modern computers, so I'm glad to be using this now.

Once again, there is a good YouTube tutorial on installing Arch Linux with UEFI: https://www.youtube.com/watch?v=MMkST5IjSjY

In this tutorial, I discover yet another bootloader, systemd-boot. It looks similar to syslinux, but it comes pre-installed with systemd, which is the service manager in Arch Linux. This means that it will be on my system in all cases. I might as well just use it.

In deciding which bootloader to settle for, I found this schematic about boot loaders — released under the terms of the GNU FDL license — from the Arch wiki to be very helpful:

bootloader

It would be interesting to use btrfs for the /boot partition to prevent file corruption. But I'd have to use syslinux for that, which means installing another software package. I'll stick with systemd-boot and a vfat filesystem for the /boot partition. After finishing up the UEFI and systemd-boot installation, I reboot my computer.

PCCT error

Now comes to the time to fix the pesky errors that show up during boot. They are still there.

pcct

Looking for Error parsing PCC subspaces from PCCT on Google, I can see that I'm not alone in this. A Redditor suggests downgrading from the bleeding edge Linux kernel to the LTS ("Long Term Support") version:

Have you tried rolling back the kernel or installing and running linux-lts? — monotykamary

The LTS kernel should be more stable so I try applying this piece of advice.

pacman -S linux-lts

And regenerate the ramdisk used during boot:

mkinitcpio -p linux-lts

Finally, I update the boot configuration in /boot/loader/entries/arch.conf so it looks like this:

title Arch Linux
linux /vmlinuz-linux-lts
initrd /intel-ucode.img
initrd /initramfs-linux-lts.img
options cryptdevice=/dev/sda2:root root=/dev/mapper/root rw

Now I reboot. And the errors are gone.

MMC error

There is another error that shows during boot, something like mmc0: Unknown controller version (3). You may experience problems. This is apparently a warning. That means there is no need to necessarily fix this now. But I don't like seeing warnings every single time I boot my laptop.

mmc

A bug report on the Linux bug tracker suggests setting the debug_quirks option of the sdhci kernel module to 0x4. I try doing that by creating a file at /etc/modprobe.d/mmc.conf with the following content:

options sdhci debug_quirks2="0x4"

After regenerating the initramfs and rebooting, the error still shows up though.

Since I very rarely use the SD card slot, which is what mmc is here for, I decide to simply deactivate kernel modules related to memory cards in /etc/modprobe.d/mmc.conf:

blacklist mmc_core
blacklist sdhci_pci
blacklist sdhci

After calling mkinitcpio -p linux-lts and rebooting, all errors are gone. If I ever need the memory card slot, I'll have to remember reactivating the kernel modules first.

I'm taking the deep dive into x86-64 assembly. Or at least I'm dipping my toes.

Introduction to assembly

Because my computer has a 64 bits CPU, the easiest way to get started is probably to use the assembly that works on my computer. That would be x86-64, although I sometimes see the terms x64 or amd64 being used too. I am new to assembly. I have seen some assembly code and have studied 68K assembly at university. But that was only on paper. I've never actually compiled and run assembly code. I want to remedy that.

Looking for x64 linux assembly on DuckDuckGo, I find this Stackoverflow thread that has some good tips on learning assembly, both on what to do and what not to do.

What not to do:

The Intel documentation is a bad place to start learning assembler.

What to do:

Take a look here (http://asm.sourceforge.net) it's the best place for Linux Assembly development, you will find resources, docs and links.

This website, asm.sourceforge.net, seems a bit complicated though. So I try to find some easier guides, ideally aimed at total beginners. This series of tutorials by 0xAX turns is a goldmine. It documents the learning steps for x86-64 assembly. It looks like a great place to start learning assembly. It lists basic information like what registers are available and their different forms (64 bits, 32 bits, etc) with the corresponding names (ie rax, eax, ax, al for the first register).

It gives the command I need to install the NASM compiler on Ubuntu:

sudo apt-get install nasm

And the commands I need to compile and link assembly file to actually get a working program:

nasm -f elf64 -o hello.o hello.asm
ld -o hello hello.o

As is assembly for the most part, nasm is new to me. I'd like to know more. DuckDuckGo leads me to the nasm documentation. I highly recommend glancing through it: it is easy to read and contains lots of tips for assembly development with nasm.

nasm-docs-1

One thing this documentation teaches me is that there are 3 different ways to define strings with nasm:

db 'Hello world', 10 ; with single quotes
db "Hello world", 10 ; with double quotes
db `Hello world\n`   ; with backticks, \n, \r, \t, etc are available

This documentation also gives a trick to get the length of a string. It looks like a pointer subtraction, $ being the pointer to the current instruction, ie the end of the string, and message being the pointer to the beginning of the string:

message db  'hello, world'
msglen  equ $-message

Calling conventions

School 42 has a course on assembly. It says the following:

Be aware of the "calling conventions"

Not knowing what "calling conventions" are, I go back to DuckDuckGo which leads me to wiki.osdev.org:

calling-conventions-1

What I understand from this graphic is that the "calling convention" defines what registers have which role when I call a function. For instance, if I call a function with 2 parameters, I'll put the first parameter value in rdi and the second parameter value in rsi. The function must put the return value in rax.

This means that, where I would do this in C:

int return_value_of_my_function = my_function(42);
printf("%d\n", return_value_of_my_function);

In assembly, I would do this (pseudo-code):

rdi <= 42
call my_function
rdi <= "%d\n"
rsi <= rax
call printf

Stack alignment

In school 42's e-learning platform, the video dedicated to assembly talks about "stack alignment", without giving much information about what it is. Searching for "stack alignment" on DuckDuckGo, I find this StackOverflow answer on "stack alignment" which is a great introduction to the topic. Worth a read!

First program, without libc

Now that I have the basics down, I want to create a real program. One that I can run in my terminal. I start with "Hello world".

To write a functional "Hello world", I need to call the write system call. How can I do a syscall with nasm though? A Stackoverflow question about syscalls suggests reading the psABI-x86-64. And here is what that document says:

A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

The number of the syscall has to be passed in register %rax.

Moreover, I know from looking at the ELF file format that read-only data is saved in the .rodata section and that executable code is saved in the .text section. So here is the first version of assembly "Hello world":

section .rodata
    msg:    db 'hello, world', 10
    msglen: equ $-msg

section .text
    main:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall

What's happening here? I put 1 into rax and then use the syscall instruction. This calls write, which has syscall number 1, as can be seen in this syscall table. Then I compile and run:

nasm hello.s -f elf64 -o hello.o && ld hello.o -o hello

But, I get a warning:

ld: warning: cannot find entry symbol _start; defaulting to 0000000000400080

I guess there is a mistake here. It looks like unlike with C, the main function can't be called main. Looking for "cannot find entry symbol _start asm" on DuckDuckGo, I find out that my code should look like this:

.text
        global _start

    _start:
        ; code goes here

The global keyword indicates that the symbol _start is accessible from outside the current module, hello.s in this case. And looking at elf's man page, I can actually see that this global keyword is translated into a STB_GLOBAL flag:

stb-global-1

Anyway, I replace main with _start and add global _start:

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global _start

    _start:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall

Then, I recompile and execute the program with ./hello:

Hello world
Segmentation fault (core dumped)

Woops! But I'm not the only one with this issue, as this Stackoverflow question on segmentation faults with NASM shows:

Because ret is NOT the proper way to exit a program in Linux, Windows, or Mac!!!! For Windows it is ExitProcess and Linux is is system call - int 80H using sys_exit, for x86 or using syscall using 60 for 64Bit or a call to exit from the C Library if you are linking to it.

So I try using the exit syscall:

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global _start

    _start:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall
        ; exit(0)
        mov rdi, 0
        mov rax, 60
        syscall

This time, compiling and running works as expected.

First program, with libc

Now that I've got a working program without libc, I want to try adding libc into the mix. I'd also like to use printf instead of the write syscall. The first step to try compiling with libc, using gcc instead of ld:

nasm hello.s -f elf64 -o hello.o && gcc -Wall -Wextra -Werror -o hello hello.o

I get an error. This time, it looks like main is missing:

hello.o: In function `_start':
hello.s:(.text+0x0): multiple definition of `_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.text+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
collect2: error: ld returned 1 exit status

So I go back to the code and replace _start with main. I also replace exit(0) with return 0 because libc automatically calls return at the end of the main function.

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global main
    main:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall
        ; return 0
        mov rax, 0
        ret

After compiling and running ./hello, I get "Hello world". It works as expected.

Replacing write with printf

Now that I have the libc in place, my next goal is to replace write with printf. On 42's e-learning platform, there was an example printf call that show I had to use extern printf. The extern keyword is described in the NASM documentation:

EXTERN (...) is used to declare a symbol which is not defined anywhere in the module being assembled, but is assumed to be defined in some other module and needs to be referred to by this one.

This is what I get once I've added these bits of code:

section .rodata
    format: db 'Hello %s', 10
    name:   db 'Conrad'

section .text
        global main
        extern printf
    main:
        ; printf(format, name)
        mov rdi, format
        mov rsi, name
        call printf
        ; return 0
        mov rax, 0
        ret

To make things a bit more interesting, I set printf's first argument to a format as opposed to a regular string.

After compiling and running the program, I get a Segmentation fault once again. Because printf has variable arguments, I think maybe I'm not using the calling convention correctly. As is often the case Stackoverflow has a question and an answer about the printf calling convention:

Yes, RAX (actually AL) should hold the number of XMM registers used.

I have no idea what XMM registers are though, so I don't think I've used any. I try to add mov rax, 0 before calling printf and see what happens:

Hello Conrad
Conrad

Better! The segfault is gone. But for some reason, my first name is output twice. After fiddling around a bit, I realise that printf expects a C string that has \0 at the end. After adding \0, this is what my assembly code looks like:

section .rodata
    format: db 'Hello %s', 10, 0
    name:   db 'Conrad', 0

section .text
        global main
        extern printf
    main:
        ; printf(format, name)
        mov rdi, format
        mov rsi, name
        ; no XMM registers
        mov rax, 0
        call printf
        ; return 0
        mov rax, 0
        ret

I compile and run ./hello:

Hello Conrad

It works!

Rewriting bzero

Now, I'd like to clone a "simple" from the libc in assembly. This should be a good way to learn about other x86-64 instructions. I need to find out how to do loops and index incrementation. The prototype of bzero is void bzero(void *s, size_t n). This means that the rdi register is a pointer to s and the rsi register is the number of bytes to set to 0. The simplest way to clone bzero should be something like this:

while (--rsi >= 0) {
    rdi[rsi] = 0;
}

For --rsi, there seems to be a well-suited DEC instruction, as can be seen in the x64 instruction set published by Intel.

For the >= 0 comparison, the Jcc instructions seem appropriate. They set the instruction pointer to a specific address depending on predefined conditions:

The condition codes used by the Jcc, CMOVcc, and SETcc instructions are based on the results of a CMP instruction.

For rdi[rsi] = 0, I think the MOV instruction should work. The nasm docs on indexing know how to tell MOV how to copy at the rsi index of rdi: mov word [addr], val.

Knowing this, the assembly version of bzero should more likely be something like this:

while (1) {
    rsi--; // DEC
    if (rsi < 0) return; // CMP, JL
    rdi[rsi] = 0; // MOV, JMP
}

I end up with the following assembly version of bzero:

section .text
    global my_bzero
    my_bzero:
    .loop:
        ; rsi--
        dec rsi
        ; if (rsi < 0) return
        cmp rsi, 0
        jl .ret
        ; rdi[rsi] = 0
        mov byte [rdi+rsi], 0
        jmp .loop
    .ret:
        ret

I've replaced mov word [...], 0 with mov byte [...], 0 after realising that word means 16 bits instead of 8 bits. My goal is to copy one byte (8 bits) at a time. And I've named my function my_bzero so it doesn't conflict with the libc's bzero. Then, I try my bzero with the following C program:

#include <stdio.h>

#define ZEROED_LEN (10)

void my_bzero(void* addr, long unsigned len);

int main() {
    char test[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    my_bzero(test, ZEROED_LEN);
    printf("%d\n", test[0]);
    printf("%d\n", test[1]);
    printf("%d\n", test[2]);
    printf("%d\n", test[3]);
    printf("%d\n", test[4]);
    printf("%d\n", test[5]);
    printf("%d\n", test[6]);
    printf("%d\n", test[7]);
    printf("%d\n", test[8]);
    printf("%d\n", test[9]);

    return 0;
}

Everything works as expected.

"Repeat String Operations"

School 42's course on assembly suggests learning about "Repeat String Operations". In the x64 instruction set, I see references to these instructions: REP, REPE, REPZ, REPNE and REPNZ. They are supposed to be useful for functions like strlen, so that's what I'd like to try and clone from libc.

The x64 instruction set shows that the operands of these instructions are of type m8, m16, etc:

scas-1

However, in this course about "String instructions" by Ben Howard, there are examples without operands:

MOV     DI, DX ;Starting address in DX (assume ES = DS)
MOV     AL, 0  ;Byte to search for (NUL)
MOV     CX, -1 ;Start count at FFFFh
CLD            ;Increment DI after each character
REPNE SCASB    ;Scan string for NUL, decrementing CX for each char

So I try using SCAS without operands to reproduce strlen's behavior in assembly:

section .text
    global my_strlen

    my_strlen:
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scas   ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

But that doesn't compile:

hello.s:7: error: parser: instruction expected

After multiple searches, and for a reason I don't remember at the time of writing this, I search for "nasm prefix instruction expected" on Google and luckily find a tip regarding the use of "repeat string operations" with nasm on Stackoverflow:

NASM doesn't support the 'LODS', 'MOVS', 'STOS', 'SCAS', 'CMPS', 'INS', or 'OUTS' instructions, but only supports the forms such as 'LODSB', 'MOVSW', and 'SCASD' which explicitly specify the size of the components of the strings being manipulated.

Given that my goal is to compare each and every byte, one after the other, I guess I can use the "byte version", that's SCASB:

section .text
    global my_strlen

    my_strlen:
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scasb  ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

This compiles fine. So I test it with a C program:

#include <stdio.h>

#define ZEROED_LEN (10)

void my_bzero(void* addr, long unsigned len);
long unsigned my_strlen(const char *s);

int main() {
    char test[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    my_bzero(test, ZEROED_LEN);
    printf("%d\n", test[0]);
    printf("%d\n", test[1]);
    printf("%d\n", test[2]);
    printf("%d\n", test[3]);
    printf("%d\n", test[4]);
    printf("%d\n", test[5]);
    printf("%d\n", test[6]);
    printf("%d\n", test[7]);
    printf("%d\n", test[8]);
    printf("%d\n", test[9]);

    printf("length: %lu\n", my_strlen("test"));
    printf("length: %lu\n", my_strlen(""));
    printf("length: %lu\n", my_strlen("hello world"));

    return 0;
}

The output looks OK at first sight:

0
0
0
0
0
0
0
0
0
0
length: 4
length: 0
length: 11

However, there is actually a bug. When I remove the bzero code, I have the following test script:

#include <stdio.h>

long unsigned my_strlen(const char *s);

int main() {
    printf("length: %lu\n", my_strlen("test"));
    printf("length: %lu\n", my_strlen(""));
    printf("length: %lu\n", my_strlen("hello world"));
    printf("length: %lu\n", my_strlen("bla"));

    return 0;
}

And with that, the output is not at all OK:

length: 18446744073709551615
length: 0
length: 11

I think to myself: I know that the char test[10] table was allocated on the stack, maybe this is the "stack alignment" coming back to bite me? But after looking around a bit, I realise that Ben Howard puts -1 in the CX register. When I do this, my code works too:

section .text
    global my_strlen

    my_strlen:
        mov rcx, -1
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scasb  ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

This is the output I get with the test program in C:

length: 4
length: 0
length: 11

Copying and pasting code from Ben Howard without understanding it is no fun though. So I look for the reason why mov rcx, -1 magically fixes things. The answer is in the REP instruction algorithm inside the x64 instruction set documentation:

WHILE CountReg ≠ 0
DO
  Service pending interrupts (if any);
  Execute associated string instruction;
  CountReg ← (CountReg – 1);
  IF CountReg = 0
    THEN exit WHILE loop; FI;
  IF (Repeat prefix is REPZ or REPE) and (ZF = 0)
  or (Repeat prefix is REPNZ or REPNE) and (ZF = 1)
    THEN exit WHILE loop; FI;
OD;

This pseudo-code shows that if CountReg (RCX in my case), reaches 0, then the loop stops. What matters to me though, is the Repeat prefix is REPNE condition. That means that RCX must never reach 0. The simple way to prevent that is setting it to -1 so that when it is decremented at CountReg - 1, it never gets to 0.

I'm learning about ELF, the binary file format for Linux, BSD and others. It is full of interesting information and serves as a good introduction to assembly programming.

High level overview

My first search leads me to the Wikipedia page about ELF. It says something about "tables" and "sections" but everything is still blurry and I want to get a deeper understanding. To learn more, I open up the man elf in my terminal.

man-elf-1

Things start to clear up a bit. There are 4 parts in an ELF file:

  • the ELF header;
  • the program header table;
  • the section header table;
  • and data referenced by said tables.

Header structure

Further down in the manpage, there are more details about the structure of the ELF header. Lots of stuff to examine. It would be interesting to write a C program to try and print out the values contained in this header, for instance for the /bin/ls binary.

elf-header-1

I start off by mapping /bin/ls to memory, so I can later read the header bits:

#include <string.h>
#include <sys/mman.h>
#include <elf.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
    const char *ls = NULL;
    int fd = -1;
    struct stat stat = {0};

    // open the /bin/ls file with read-only permission
    fd = open("/bin/ls", O_RDONLY);
    if (fd < 0) {
        perror("open");
        goto cleanup;
    }

    // find the size of /bin/ls
    if (fstat(fd, &stat) != 0) {
        perror("stat");
        goto cleanup;
    }

    printf("Everything is OK!\n");

    // put the contents of /bin/ls in memory
    ls = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (ls == MAP_FAILED) {
        perror("mmap");
        goto cleanup;
    }

    // close file descriptors and free memory before the program exits
    cleanup:
    if (fd != -1) {
        close(fd);
    }
    if (ls != MAP_FAILED) {
        munmap((void *)ls, stat.st_size);
    }

    return 0;
}

I use mmap because it allows me to access the contents as if it were a large vector of bytes, instead of using multiples successive calls to read. And because this program is only for testing purposes, I forgo out of bounds checking for now.

Then I compile with gcc and run the program against /bin/ls:

gcc -Wall -Wextra -Werror elf.c -o elf && ./elf /bin/ls

The debug message "Everything is OK!" is output to my terminal, good.

Now, I'd like to replace this debug message with code that reads specific parts /bin/ls based on what the ELF header tells me. I want to go over each and every property of the Elf64_Ehdr structure and learn about their meaning.

However, I'm not interested in the properties e_ident[EI_VERSION], e_machine, e_version and e_flags because the manpage is quite clear about what they do.

e_ident[EI_MAG0] and friends

It looks like ELF files contain a series of bytes that allow to quickly recognize it as an ELF file and not a jpg or an mp4. This doesn't guarantee that a file containing these bytes is necessarily a valid ELF file, but at least, we can expect that it probably is.

I test this against /bin/ls:

if (
    (unsigned char)ls[EI_MAG0] == 0x7f &&
    (unsigned char)ls[EI_MAG1] == 'E' &&
    (unsigned char)ls[EI_MAG2] == 'L' &&
    (unsigned char)ls[EI_MAG3] == 'F'
) {
    printf("Yes, this is an ELF file!\n");
}

I see "Yes, this is an ELF file!", so /bin/ls has the correct header for an ELF file. Good first step.

e_ident[EI_CLASS]

From the docs, the memory slot EI_CLASS should contain either ELFCLASS32 or ELFCLASS64. These values respectively mean that the binary ELF file is meant to work on a 32 bits system or a 64 bits system. My laptop has a 64 bit CPU and I installed a 64 bits version of Linux, so I expect ELFCLASS64:

if ((unsigned char)ls[EI_CLASS] == ELFCLASS64) {
    printf("Yes, this file is made for 64 bits!\n");
}

And indeed, I get some output, so /bin/ls on my laptop is a 64 bits version.

I'm curious to see if that still works with a 32 bits program on my 64 bits Linux though. So I try and find a way to compile a little program for 32 bits and display its ELF header. Here's what I've got:

# install the 32 bits libc
sudo apt install gcc-multilib

# create a test file in C that displays "Hello world!"
echo '#include <stdio.h>' >> elf-32.c
echo 'int main() { printf("Hello world!\n"); return 0; }' >> elf-32.c

# compile the file for 32 bits
gcc -m32 -Wall -Wextra -Werror elf-32.c

# read the EI_CLASS value
readelf -h a.out | grep -e '^\s*Class'

The call to readelf shows Class: ELF32 as expected.

On a side note, the manpage talks about an ELFCLASSNONE value possible for EI_CLASS. Although it's unclear for me at first, I eventually find out it is used to indicate an invalid EI_CLASS by programs that analyse ELF binary files, for instance readelf.

e_ident[EI_DATA]

Now, I go on to the EI_DATA field. This memory slot says whether the ELF file is meant for a little-endian or big-endian computers. I always thought that this was a choice at the operating system level. Actually, it's a choice at the CPU level. Just as an example, my 64 bits Intel CPU uses little-endian. And it turns out that Linux supports big-endian architectures as well, for instance AVR32 microcontrollers.

For little-endian, the manpage says I should expect EI_DATA to be ELFDATA2LSB:

if ((unsigned char)ls[EI_DATA] == ELFDATA2LSB) {
    printf("Yes, compiled for little-endian!\n");
}

And sure enough, it does.

e_ident[EI_OSABI]

Next up is the EI_OSABI field which should tell me about the operating system and ABI. At this point, I first have lookup what ABI means. It means how the different bits of already compiled code access each other. Since I run Linux, I expect ELFOSABI_LINUX here:

if ((unsigned char)ls[EI_OSABI] == ELFOSABI_LINUX) {
    printf("Yes, this is a Linux binary!\n");
}

But for some reason, I don't get any output here. So my /bin/ls is not ELFOSABI_LINUX. I go back to the manpage to see if I've overlooked anything. If the ABI of /bin/ls isn't Linux, maybe it is ELFOSABI_NONE or ELFOSABI_SYSV. I don't know what it is, but since it looks like the most generic one, it's worth a try.

Here's the test:

if ((unsigned char)ls[EI_OSABI] == ELFOSABI_SYSV) {
    printf("Yes, this is a SYSV binary!\n");
}

And, this time I get the confirmation message.

I think the ELFOSABI_SYSV is probably the ABI that was used on the System V operating system before Linux chose to use the same. However, I'm not 100% sure of that. It seems that the memory slot EI_OSABI is the source of great confusion. On the one hand, some people say that EI_OSABI identifies the kernel:

elf_osabi identifies the kernel, not userland

On the other hand, it seems that most ELF files on Linux use ELFOSABI_SYSV instead of ELFOSABI_LINUX:

All the standard shared libraries on my Linux system (Fedora 9) specify ELFOSABI_NONE (0) as their OSABI.

Except on recent versions of Fedora :

Possibly the vendor is cross compiling on FreeBSD or using a very recent Fedora system where anything using STT_GNU_IFUNC will be marked as ELFOSABI_LINUX.

And finally, the man elf says that some flags can be platform=specific.

Some fields in other ELF structures have flags and values that have platform-specific meanings; the interpretation of those fields is determined by the value of this byte.

In the end, what I get from this is that Linux mostly uses the SYSV ABI, but maybe someday the LINUX ABI will contain Linux specific things that the Linux kernel will then be able to interpret in a somewhat different way than it would with the SYSV ABI.

e_type

Next up in the ELF header structure is the e_type field. This should be the type of ELF file currently being read. It could be one of: ET_REL, ET_CORE, ET_NONE, ET_EXEC, ET_DYN.

ET_EXEC and ET_DYN seem quite clear to me, an executable and a shared library. But I have no idea what ET_REL, ET_CORE and ET_NONE are.

ET_REL

I did some many Google searches about this that I can't remember them all.

What I ended up understanding though, is that ET_REL stand for "relocatable". That means a piece of code that isn't in its final place and will be added to other code to form an executable.

From my somewhat partial understanding, that could mean object files or static libraries.

ET_CORE

According to gabriel.urdhr.fr, ET_CORE is used for core dumps. I want to see this with my own eyes though. So I try to create a core dump by triggering a segfault:

echo 'int main() {return (*(int *)0x0);}' > test.c
gcc -Wall -Wextra -Werror test.c
./a.out

I get a segmentation fault as expected, and a core dump:

Segmentation fault (core dumped)

However, I can't find the coredump file anywhere. As it turns out, Ubuntu doesn't generate core dump files. Instead, Apport sends crash reports to a webservice managed by Canonical to help them fix bugs. I learn through a StackOverflow thread on core dumps that I can see where core dumps go by running cat /proc/sys/kernel/core_pattern as root.

Then I learn about the gcore utility which allows generating a core dump for any running program. Running readelf -a on that core dump shows "Type: CORE (Core file)" as expected.

I also try looking for core dumps on an Arch Linux laptop. On Arch Linux, coredumps are stored in /var/lib/systemd/coredump/. And again, running readelf -a shows "Type: CORE (Core file)" for those files.

ET_NONE

Same as ELFCLASSNONE, it seems like ET_NONE is rarely used. I wonder if this value is used by ELF file parsers to indicate invalid ELF file contents when needed. Given that I've had the same question with ELFCLASSNONE before, I ask on Stackoverflow. Just minutes later, someone answered:

Anywhere validity of the ELF file is verified. Here is an example from the Linux kernel tools.

Tests

Now, I want to run some code just to make sure that I understand the different values of e_type:

# create a minimal program
echo 'int main() { return 0; }' > test.c
# compile to executable form, should be ET_EXEC
gcc -Wall -Wextra -Werror test.c
readelf -h a.out | grep -e '^\s*Type'

# create a minimal library
echo 'int test() { return 0; }' > test.c

# compile to object form, should be ET_REL
gcc -c -fPIC -Wall -Wextra -Werror test.c
readelf -h test.o | grep -e '^\s*Type'

# compile to a static library, should be ET_REL too
ar -r libtest.a test.o
readelf -h libtest.a | grep -e '^\s*Type'

# compile to a shared library, should be ET_DYN
gcc -shared test.o -o libtest.so
readelf -h libtest.so | grep -e '^\s*Type'

# create a coredump, should be ET_CORE
sudo gcore 1
readelf -h core.1 | grep -e '^\s*Type'

The output is as expected:

  Type:                              EXEC (Executable file)
  Type:                              REL (Relocatable file)
  Type:                              REL (Relocatable file)
  Type:                              DYN (Shared object file)
  Type:                              CORE (Core file)

What's a bit surprising is that running the same tests on Arch Linux does not yield the same results. Instead of EXEC for the first line, I get DYN. This may be related to PIE per this StackOverflow thread about "EXEC" being reported as "DYN". Maybe gcc has different default options on Arch Linux than it does on Ubuntu.

e_entry

According to the man page, e_entry points to the first instruction that the CPU will execute when a program starts. Compiling a very simple program with some assembly should allow me to verify this:

void _start() {
    // _exit(0)
    asm(
        "mov $0, %rdi;"
        "mov $60, %rax;"
        "syscall;"
    );
}

Then I compile and link, with ld instead of gcc to avoid including the libc:

gcc -c -Wall -Wextra -Werror test.c && ld test.o

Finally, in the test program, I print out the e_entry:

printf("Entry: %lu\n", ((Elf64_Ehdr *)ls)->e_entry);

Here, the output value is 4194480 so I expect to find the assembly instructions from above precisely at that address in the a.out binary. I dump the binary using objdump -d a.out:

a.out:     file format elf64-x86-64

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	55                   	push   %rbp
  4000b1:	48 89 e5             	mov    %rsp,%rbp
  4000b4:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
  4000bb:	48 c7 c0 3c 00 00 00 	mov    $0x3c,%rax
  4000c2:	0f 05                	syscall
  4000c4:	90                   	nop
  4000c5:	5d                   	pop    %rbp
  4000c6:	c3                   	retq

The assembly instructions are at addresses 0x4000b4, 0x4000bb (0x3c is 60 in base 10) and 0x4000c2. The _start function is located at address 0x4000b0, which is 4194480 in base 10, as expected.

e_phoff, program headers

Next up in the ELF header is e_phoff, which refers to program headers. Further down in the ELF manual is more information about program headers:

program_header-1

These headers describe bits of memory called "segments" which contain instructions that will run at process startup. This means that shared libraries and object files most likely don't contain program headers, since those types of files are not executable. Running readelf -l test.o confirms that.

At this point, I'm unsure what precisely is in a "segment". What's the difference between a "section" and a "segment"? They feel like they are the same. Before studying "sections", whose understanding will hopefully help me understand "segments", I want to run a few tests to see if I can read the contents of a program header.

#define TEST_HEADER_TYPE(type, value, name)\
    if ((type) == (value)) {\
        printf("%s\n", (name));\
    }

Elf64_Ehdr *eh = (Elf64_Ehdr *)ls;
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    uint32_t type = ph->p_type;

    // all the p_type can be found in /usr/include/elf.h
    TEST_HEADER_TYPE(type, PT_NULL, "PT_NULL");
    TEST_HEADER_TYPE(type, PT_LOAD, "PT_LOAD");
    TEST_HEADER_TYPE(type, PT_DYNAMIC, "PT_DYNAMIC");
    TEST_HEADER_TYPE(type, PT_INTERP, "PT_INTERP");
    TEST_HEADER_TYPE(type, PT_NOTE, "PT_NOTE");
    TEST_HEADER_TYPE(type, PT_SHLIB, "PT_SHLIB");
    TEST_HEADER_TYPE(type, PT_PHDR, "PT_PHDR");
    TEST_HEADER_TYPE(type, PT_TLS, "PT_TLS");
    TEST_HEADER_TYPE(type, PT_NUM, "PT_NUM");
    TEST_HEADER_TYPE(type, PT_LOOS, "PT_LOOS");
    TEST_HEADER_TYPE(type, PT_GNU_EH_FRAME, "PT_GNU_EH_FRAME");
    TEST_HEADER_TYPE(type, PT_GNU_STACK, "PT_GNU_STACK");
    TEST_HEADER_TYPE(type, PT_GNU_RELRO, "PT_GNU_RELRO");
    TEST_HEADER_TYPE(type, PT_LOSUNW, "PT_LOSUNW");
    TEST_HEADER_TYPE(type, PT_SUNWBSS, "PT_SUNWBSS");
    TEST_HEADER_TYPE(type, PT_SUNWSTACK, "PT_SUNWSTACK");
    TEST_HEADER_TYPE(type, PT_HISUNW, "PT_HISUNW");
    TEST_HEADER_TYPE(type, PT_HIOS, "PT_HIOS");
    TEST_HEADER_TYPE(type, PT_LOPROC, "PT_LOPROC");
    TEST_HEADER_TYPE(type, PT_HIPROC, "PT_HIPROC");
}

Here's what I get:

PT_PHDR
PT_INTERP
PT_LOAD
PT_LOAD
PT_DYNAMIC
PT_NOTE
PT_GNU_EH_FRAME
PT_GNU_STACK
PT_GNU_RELRO

The reassuring thing is that I find the same types of program headers when I run readelf -l /bin/ls.

e_shoff, section headers

On to section headers. The manual doesn't say much about them:

A file's section header table lets one locate all the file's sections.

Differences with segments

It seems like sections (pointed by section headers) and segments (pointed by program headers) are closely related. There are some differences between program and section headers though. The section header has something the program header doesn't have: e_shstrndx. It looks like this is the index of the section header for section .shstrtab, which contains the names of all of the sections. Inception!

What this tells me is that I can probably get the buffer that contains all section names like this:

Elf64_Ehdr *elf_header = (Elf64_Ehdr *)ls;

Elf64_Shdr *shstrtab_header = (Elf64_Shdr *)((char *)ls + (eh->e_shoff + eh->e_phentsize * eh->e_shstrndx));

# this is a buffer that contains the names of all sections
const char *shstrtab = (const char *)ls + shstrtab_header->sh_offset;

What's the format of the shstrtab buffer though? The ELF specification defines the format of a "String table". What stands out for me is that each section has a name. That name is stored in the section string table as a C string at a specified index.

strtab-1

Commonality with segments

Sure, there are differences between sections and segments. But some things are the same. For instance, program headers have a type PT_DYNAMIC and section headers have SHT_DYNAMIC. This "Introduction to ELF" by Red Had says they are the same:

SHT DYNAMIC dynamic tags used by rtld, same as PT DYNAMIC

But I have a problem. I can't find anything that explains what "dynamic tags used by rtld" means. Page 42 of the ELF specification shows that segments of type PT_DYNAMIC and sections of type SHT_DYNAMIC contain ElfN_Dyn structs. Back in elf's man page, we can see that these structs contain a d_tag field. Probably the "dynamic tag" that Red Hat was talking about.

dtag-1

This d_tag is interesting. The manual tells me that it can have the type PT_NEEDED, in which case it contains an index into the string table .strtab, so that I can effectively get the name of a shared library that the ELF file depends on. I want to try this with the following steps:

  • find the PT_DYNAMIC segment;

  • inside the PT_DYNAMIC segment, find the d_tag with type DT_STRTAB so I know where the .strtab is;

  • inside the PT_DYNAMIC segment, for each d_tag of type DT_NEEDED, print out the C string that is inside .strtab at index d_un.d_val, which should be the name of a shared library.

    Elf64_Ehdr *eh = (Elf64_Ehdr *)ls; for (int i = 0; i < eh->e_phnum; i++) { Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i)); if (ph->p_type == PT_DYNAMIC) { const Elf64_Dyn *dtag_table = (const Elf64_Dyn *)(ls + ph->p_offset);

          // look for the string table that contains the names of the
          // shared libraries need by this ELF file
          const char *strtab = NULL;
          for (int j = 0; 1; j++) {
              // the end of table of Elf64_Dyn is marked with DT_NULL
              if (dtag_table[j].d_tag == DT_NULL) {
                  break;
              }
    
              if (dtag_table[j].d_tag == DT_STRTAB) {
                strtab = (const char *)dtag_table[j].d_un.d_ptr;
              }
          }
    
          // if there is no string table, we are stuck
          if (strtab == NULL) {
            break;
          }
    
          // now, I'll look for shared library names inside the string table
          for (int j = 0; 1; j++) {
              // the end of table of Elf64_Dyn is marked with DT_NULL
              if (dtag_table[j].d_tag == DT_NULL) {
                  break;
              }
    
              if (dtag_table[j].d_tag == DT_NEEDED) {
                printf("shared lib: %s\n", &strtab[dtag_table[j].d_un.d_val]);
              }
          }
    
          // only look into PT_DYNAMIC
          break;
      }
    

    }

At first sight, everything works as expected. When I run ./a.out on the a.out ELF file instead of /bin/ls, so when I try to find what shared libraries a.out needs, I get:

shared lib: libc.so.6

However, when I try to read shared libraries of system binaries like /bin/ls, I get a segmentation fault. gdb tells me that something is wrong with my printf call. Specifically, gdb tells me that the memory I am trying to access isn't available:

strtab = 0x401030 <error: Cannot access memory at address 0x401030>

That's odd, because readelf -a /bin/ls | grep 401030 shows that the .dynstr section is actually at address 0x401030. I'm a little lost here. Again, I ask call Stackoverflow to the rescue. And the same person that previously answered on Stackoverflow answers one more time:

If the binary isn't running, you need to relocate this value by $where_mmaped - $load_addr. The $where_mmaped is your ls variable. The $load_addr is the address where the binary was statically linked to load (usually it's the p_vaddr of the first PT_LOAD segment; for x86_64 binary the typical value is 0x400000).

Now that I think about it, I think there was something about that in the manual. But I had overlooked it:

When interpreting these addresses, the actual address should be computed based on the original file value and memory base address.

With this help, here's how memory is laid out:

elf-strtab-offset-1

Here's the fixed version of the previous test program:

Elf64_Ehdr *eh = (Elf64_Ehdr *)ls;
// look for the PT_LOAD segment
const char *load_addr = NULL;
uint32_t load_offset = 0;
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    // I've found the PT_LOAD segment
    if (ph->p_type == PT_LOAD) {
        load_addr = (const char *)ph->p_vaddr;
        load_offset = ph->p_offset;
        break;
    }
}

// if there is no PT_LOAD segment, we are stuck
if (load_addr == NULL) {
  goto cleanup;
}

for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    if (ph->p_type == PT_DYNAMIC) {
        const Elf64_Dyn *dtag_table = (const Elf64_Dyn *)(ls + ph->p_offset);

        // look for the string table that contains the names of the
        // shared libraries need by this ELF file
        const char *strtab = NULL;
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_STRTAB) {
              // mark the position of the string table
              const char *strtab_addr = (const char *)dtag_table[j].d_un.d_ptr;
              uint32_t strtab_offset = load_offset + (strtab_addr - load_addr);
              strtab = ls + strtab_offset;
            }
        }

        // if there is no string table, we are stuck
        if (strtab == NULL) {
          break;
        }

        // now, I'll look for shared library names inside the string table
        for (int j = 0; 1; j++) {
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_NEEDED) {
              printf("shared lib: %s\n", &strtab[dtag_table[j].d_un.d_val]);
            }
        }

        // only look into PT_DYNAMIC
        break;
    }
}

Now, everything works as expected, with both a.out and /bin/ls.

I conclude that program headers are useful to an executable file whereas section headers are useful to shared/object files during the link phase. Both types of headers point to memory addresses that will end up in the executable file in the end, but don't describe the memory contents in the same way or with the same granularity.

GOT and PLT

In the man elf, there are multiple references to GOT and PLT. The best explanation I could find is "PLT and GOT, the key to code sharing and dynamic libraries" at TechNovelty.org:

  • the GOT stores addresses of variables that are unknown at compile time but known at link time;
  • the PLT stores addresses of functions from shared libraries that unknown both at compile time and at link time (if compiled with -fPIC which is often the case), and are to be resolved dynamically at runtime.

Additional searches about GOT and PLT on DDG, I find an article that shows how to exploit the GOT to execute a shellcode. Another interesting find is this paper published by Leviathan Security about a modified ELF file format specifically made for forensic analysis.