Hi, I'm Conrad

This website is where I keep a log of things, sort of an online notepad if you will. Mostly focused on topics relating to software engineering.

After a few years of full-stack web development at an online insurance brokerage company, with a bit of devops here and there, I am now a site reliability engineer at a bank.

Outside work, I maintain a couple of open source projects related to password management:

  • rooster, a password manager;
  • rpassword, a cross-platform Rust library for reading passwords from a terminal.

To read my latest article, click/tap on the arrow on the right hand-side of this page.

For years, I've been taking pictures of all kinds of things with my smartphone. The photo above, which is wall in Malta is just one example.

I have been storing all these pictures in a folder without any kind of sorting. It had gotten big enough that I had no idea what this folder contained.

It bothered me. I like my stuff well organized. So I sorted them and thought I'd share the process and scripts I used. Maybe this can help you too.

Deleting duplicate images

I had a lot of duplicate files, so deleting duplicates was a good first step to make sorting easier in the next steps. This is easily done on Linux with the fdupes command, as follows:

fdupes -rdN Pictures/

This will remove images that are perfect duplicates. If you want to go one step further and remove similar images (for instance, the same image but in two different sizes), you may want to use something such as Geeqie. This tutorial explains how you can use Geekie to quickly find similar images. It's helped me remove a few dozen images that were resized down but had the same content.

Sorting images

For me, sorting images by year is sufficient. I don't need to know where the images were taken or what they were about. If it's important enough, I'll remember anyway.

So I tried to sort them by year. The images' Exif data is helpful for that.

I created a script that sorts images by year. Given a directory of images (only images, no folders), it creates a directory that is named after the year at which each photo was taken and then moves the photo inside the created directory.

For instance, if you have one image from 2005 and one from 2018, it will create this directory structure:

2005
 -> one-image-from-2005.jpg
2018
 -> another-image-from-2018.jpg

Alright, let's get to the code:

#!/bin/bash

set -e

find Pictures/ -maxdepth 1 -type f -print0 | while IFS= read -r -d '' file
do
    # Bypasses any non-image files
    file "$file" | grep image > /dev/null
    if [ "0" != "$?" ]; then
        continue
    fi
    
    # Bypasses files that don't have Exif data
    exif="$(identify -verbose "$file" | grep "exif:")"
    if [ 0 != "$?" ]; then
        continue
    fi

    exif_year="$(echo "$exif" | grep exif:DateTimeOriginal | cut -d' ' -f6 | cut -d':' -f1)"

    if [ -n "$exif_year" ]; then
        echo "$file" '->' "$exif_year"
        test -d "$exif_year" || mkdir "$exif_year"
        mv "$file" "$exif_year"/
    fi
done

The -maxdepth 1 is important here, it prevents recursivity errors.

A few photos did not have any Exif data, so I sorted these by hand.

That's was it. Photos are properly sorted!

Creating value from useless images

After sorting those images, I realized a lot of them had no value for me personally. The quality is not professional and do not represent anything personal (it's pictures from public places, airports, etc).

I briefly thought about deleting them. Then I thought:

What if I put my "useless" images on a stock photo website?

Those images may not be useful to me right now, but they may well be useful to someone else. Or maybe they'll be useful to me in a few years and having a personal bank of images I can pick from would be cool.

So I had a look at how I could remove private information from these images. They can contain Exif data, which sometimes contains private information. To avoid leaking it, I'm stripping all of it:

find Pictures/ -type f -exec exiv2 rm {} \;

Now, I still want the photos to describe me as the photographer and I want to include copyright data in those images. So I'm adding back some Exif tags:

exiv2 -M"set Exif.Image.Copyright Copyright Conrad Kleinespel" *

I also want to strip any potentially private information from the image names themselves and the extensions of the images are somewhat odd — there was jpg, JPG, jpeg, JPEG which I'd like to replace with jpg.

So I created this script to normalize image names:

#!/bin/bash

set -e

find Pictures/ -type f -print0 | while IFS= read -r -d '' file
do
    # Bypasses any non-image files
    file "$file" | grep image > /dev/null
    if [ "0" != "$?" ]; then
        continue
    fi

    sha1="`sha1sum "$file" | cut -d' ' -f1`"
    ext="`echo "$file" | grep -o '.[[:alnum:]]*$' | tr '[:upper:]' '[:lower:]'`"

    if [ "$ext" = ".jpeg" ]; then
        ext=".jpg"
    fi

    new_name="$sha1$ext"
    mv "$file" "$new_name"
done

This creates image names like 3914678325556c51762c3c709c322d4357b2163d.jpg, without any personally identifiable information and with the image size.

I'm still trying to figure out the best way to put these images online. Ideally, I need something with automated image tagging or something that allows a seamless search experience without any manual work.

Just recently, I migrated this blog from Hugo to Ghost, here's a quick write up of how I did that. May be useful in the future.

The goal was to migrate a bunch of markdown files located in content/codebase/*.md to the format accepted by Ghost, all the while making sure that URLs, meta descriptions and images were preserved without manual work for each post.

In order to keep URLs the same as before, I used Ghost's custom route feature, following this write up by Andrew Aadland: https://andrewaadland.me/2018-11-05-preserving-publish-date-in-url-ghost-2-0/

In my case, that meant having the following routes.yml file:

routes:

collections:
  /:
    permalink: /codebase/{year}/{month}/{day}/{slug}/
    template:
      - index

taxonomies:
  tag: /tag/{slug}/
  author: /author/{slug}/

I manually put images from Hugo in Ghost's content/images folder with the same paths as in the Hugo version of the blog, to prevent images' URL from changing during the migration.

And now to the commented Node.js script for Markdown content migration:

// Using synchronous function versions makes things a bit simpler and
// is perfectly suitable for a one-time-use migration script
const { readdirSync, readFileSync } = require('fs')
const { basename } = require('path')

// I have my posts in the `content/codebase` directory
const listPosts = () => {
    return readdirSync(`${__dirname}/content/codebase`)
        .map(el => `${__dirname}/content/codebase/${el}`)
}

// Creates a post in Ghost's JSON format described here:
// https://ghost.org/docs/api/v3/migration/#json-file-structure
const createPost = (image, title, slug, publishedAt, markdown) => {
    return {
        "title": title,
        "slug": slug,
        "feature_image": image,
        "mobiledoc": JSON.stringify({
            version: '0.3.1',
            markups: [],
            atoms: [],
            cards: [['markdown', {cardName: 'markdown', markdown }]],
            sections: [[10, 0]]
        }),
        "status": "published",
        // These dates end up somewhat of an approximation, because
        // I would have to extract them from the .md files' metadata
        // which is more work and of little use in my case
        "published_at":  publishedAt,
        "created_at":  publishedAt,
        "updated_at":  publishedAt,
    }
}

// Extracts post data from the Markdown files used in Hugo
//
// The original Hugo files look somewhat like this:
//
// <example lang="hugo">
//   +++
//   title = "Hello world"
//   slug = "hello-world-url"
//   date = "2019-04-16"
//   +++
//   Post content is here...
// </example>
const createPostDataFromFileContent = (filename, fileContent) => {
    const contentRegexp = /^\+\+\+((.|\n)+)\+\+\+((.|\n)+)$/m
    const titleRegexp = /title = "(.+)"/
    const dateRegexp = /date = (.+)/
    const imageRegexp = /image = "(.+)"/

    const contentMatches = fileContent.match(contentRegexp)

    const header = contentMatches[1]
    const titleMatches = header.match(titleRegexp)
    const dateMatches = header.match(dateRegexp)
    const imageMatches = header.match(imageRegexp)

    const image = imageMatches[1]
    const title = titleMatches[1]
    const date = (new Date(dateMatches[1])).getTime()
    const markdown = contentMatches[3].trim()
    // In my case, the filenames are the same as the slug
    const slug = basename(filename.substring(0, filename.length - ('.md'.length)))

    return createPost(
        image,
        title,
        slug,
        date,
        markdown
    )
}

const postsData = listPosts()
    .map(filename => ({ filename, fileContent: readFileSync(filename).toString() }))
    .map(({ filename, fileContent }) => createPostDataFromFileContent(filename, fileContent))

// Prints the posts in JSON format for Ghost, which can be used to debug
// or create a .json file to import into Ghost, like so:
//
// <example lang="shell">
//   node hugo-to-ghost.js > ghost.json
// </example>
console.log(JSON.stringify(
        {
        "meta": {
            "exported_on":1408552443891,
            "version":"3.1.0",
        },
        "data": {
            "posts": postsData,
        },
    }
))

In August 2018, I wrote about the challenges of moving from AWS to Heroku. A new challenge has presented itself on the same website since then: a large number of request timeouts at unpredictable times during the day.

What we saw in Heroku

For the last few months, we've regularly had days where the Heroku metrics would look like this:

What we can see here is that lots of requests take between 8 and 30 seconds, with 30 seconds being the maximum request timeout for Heroku.

These timeouts appeared completely unpredictably. Rebooting the database server and application servers would fix the issue for a few days, before the same things happened again. And again. And again.

Our first thought was that some requests were very slow and were probably clogging up the database server, until eventually, it would be overloaded. So we started looking at which requests initiated the timeouts, first with the good old heroku logs command from the Heroku CLI.

But that wasn't it. And PHP logs didn't reveal anything useful either. So we looked at our database metrics in the AWS RDS dashboard.

What we saw in AWS RDS metrics

We didn't necessarily know what to look for in database metrics because we hadn't had any database performance issues up until then.

What we could see for sure was that when the website crashed, the number of simultaneous database connections was very high. And read IOPS were also very high.

Read IOPS in AWS metrics

Queue depth in AWS metrics

We didn't know why, though. And we didn't know where else to look to get better insights. After talking with some developers outside the company, we realized we needed more logs. So here comes ELK.

Adding an ELK stack into the mix

We setup an ELK (Elastic Search, Logstash, Kibana) and sent it all the logs we could possibly get (Heroku, Nginx, PHP, MariaDB, etc).

We made sure to import Heroku server metrics through Heroku Labs. The useful features in our case were log-runtime-metrics, runtime-dyno-metadata and runtime-heroku-metrics, which would give us access to CPU and RAM usage metrics.

ELK is great in that it allows us to search for something in all logs at once, without having to be too specific about what we're looking for. Heroku timeouts can be searched with the type:heroku AND at:error query:

Once we got ELK setup, we waited until the timeouts reappeared, hoping we would finally figure out what was going on.

But when the timeouts reappeared, we didn't still were unable to figure out what their root cause was. The logs confirmed that SQL requests were taking a long time to complete, but didn't tell us why the requests were suddenly slow.

However, these new logs helped to rule out what the problem was not:

  • metrics from Heroku Labs features confirmed that there was no application server overload,
  • through custom data parsing and graphs in ELK, we could see that our background jobs were not the issue either, as there was nothing odd going on in times of high request timeout rate,
  • there was no significant increase in traffic during high request timeout rate, which indicated we were not getting spammed by malicious 3rd parties.

Given all that, it was time to dig deeper into the database server configuration and metrics.

AWS RDS default configuration

Diving back into the MariaDB documentation and the AWS configuration, we noticed that some of the default timeouts of MariaDB were not suitable to use with Heroku's 30 second timeout.

For instance, the max_statement_time configuration value was set to use the "engine default". MariaDB disables max_statement_time by default. So we set it to 25 seconds, which is more than enough.

We cycled through some other settings and ended up with this updated configuration:

Default MariaDB configuration (left) and updated configuration (right)

AWS RDS best practices

While the updated configuration gave us clearer error messages in logs, the instability was still very much present, so we had a closer look at the AWS RDS manuals and we stumbled onto the AWS RDS best practices:

An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. To tell if your working set is almost all in memory, check the ReadIOPS metric (using Amazon CloudWatch) while the DB instance is under load. The value of ReadIOPS should be small and stable.

We had a look at the metrics again and saw this:

Swap in AWS metrics In addition, as seen above, we had very unstable read IOPS. Now things started to light up. The problem suddenly seemed simple: our database server was not powerful enough, it didn't have enough RAM.

So we upgraded the AWS RDS instances.

That was it:

  • IOPS are low and stable,
  • SWAP is 0Mb and stable,
  • random crashes with high SQL times are gone.

Everything seems so obvious after the fact. We could have just upgraded the server from the start. Adding more servers is a commonly used solution after all. But then again, it's not always scalable to "just add more hardware". More importantly, it doesn't fix the underlying issue in all cases.

In this case it does though. Going through the process of trying to understand what the problem was before adding the hardware means we understand IOPS and SWAP metrics. We now monitor them proactively.

Over the past 3 years, I feel like I've seen a lot of interesting errors cases happen as a full-stack web developer, focusing on PHP-driven backends. Pretty much any error that I think would not happen to me ended up happening anyway. Murphy's law.

Here are some of the error cases I've had happen and their effects. I'm writing this down as a reminder for what curious error cases should be handled in a product that needs highly reliable execution.

Timeouts on networked services

Connection timeout

Connection timeouts are somewhat special. They happen in some rare cases, such as when domain name resolution is impossible. That can happen when doing an HTTP request to an API.

In PHP, one of the most used HTTP client is Guzzle. The way Guzzle handles errors can be a bit surprising. Guzzle allows exceptions to be turned off. I use this feature whenever enhanced error handling is required. You'd think that this makes exception handling useless:

$response = $http->request('GET', '/some-uri', [
    'connect_timeout' => 5,
    'exceptions' => false,
]);

if ($response->getStatusCode() !== 200) {
   // handle unexpected status code
   // ...
   return;
}

Surprisingly, with exceptions turned off in early versions of Guzzle 6, connection timeouts will still raise an exception. More specifically, a ConnectException. This means that exceptions must be handled even when exceptions are disabled, with something like this:

try {
    $response = $http->request('GET', '/some-uri', [
        'connect_timeout' => 5,
        'exceptions' => false,
    ]);
} catch (ConnectException $e) {
    // handle connection timeout
    // ...
    return;
}

if ($response->getStatusCode() !== 200) {
   // handle unexpected status code
   // ...
   return;
}

This behavior has been made more clear in recent Guzzle versions. The exceptions option is now called http_errors, which makes it more clear that it only prevents exceptions for HTTP requests that get a response (which excludes timeouts).

Next, let's move on to response timeouts.

Response timeout

The response timeout is a bit different. It happens when the network request

takes to long to get a response. In the case of HTTP, that's usually because the headers of the response take a long time to come back.

With Guzzle, that's the timeout option:

$guzzle->request('GET', '/some-uri', [
    'timeout' => 5,
]);

Usually, something like 5 seconds should be more than enough. If 5 seconds is not enough, usually that means that things need to be rethought.

Requests that take more than 5 seconds are usually requests that do heavy computation synchronously. Turning the heavy computation into an asynchronous task, handled with a webhook or a polling API is preferable if possible.

Whenever heavy computation needs to be synchronous, 5 seconds might not be enough. In that specific case, using idempotency can make the API immune to timeout errors. This article by Stripe on idempotency is a great read.

Out of memory

Let's move on to another class of errors which happen all too often and can be quite catastrophic, "Out of memory" (or "OOM") errors. These errors happen every time a process uses more RAM than it is allowed to use.

In the case of a PHP website, the maximum amount of RAM is specified with the memory_limit INI directive in the php.ini file. Errors similar to OOM might occur if receive files larger than the upload_max_filesize directive or a request body larger than the post_max_size directive.

Here's a quick example of what these values might look like in php.ini:

memory_limit = 64M
upload_max_filesize = 32M
post_max_size = 32M

OOM errors due to exceeding RAM usage (case of memory_limit being overrun) happen mostly whenever you load to many objects into memory. That can happen if you do something like this on a very large SQL table:

$query = $dbConnection->prepare("SELECT * FROM blog_posts");
$query->execute();
$result = $query->fetchAll(PDO::FETCH_ASSOC);

Here, fetchAll might try to load 1000 blog posts into RAM. This will require more than 64M of RAM and thus cause an "Out of memory" error. This classic case of OOM can be worked around in Symfony, PHP in general and other languages with some techniques I've previously described in ["Memory management with Symfony commands"]({{< ref "/codebase/memory-management-in-long-lived-symfony-commands.md" >}}).

For OOM errors that come from too large requests (case of upload_max_filesize or post_max_size being overrun), you might want to look into APIs that allow you to delegate file upload to dedicated services. If you're using S3 for file storage, than AWS S3 pre-signed upload URLs might be the way to go.

Race condition

During development, there is only one user on a website: the developer. Because of that, race conditions rarely occur during development.

In production, things are different. Lots of users are active at once. That means that the slightest possibility for race condition might become a reality.

The most common error I've seen is the incorrect use of temporary files.

In PHP, a tempfile is created with tempnam:

tempnam

Often times, I've seen code that generates PDFs and looks like this:

$tempfile = tempnam(sys_get_temp_dir(), 'user_data_');
if ($tempfile === false) {
    // handle error
    // ...
    return;
}

@rename($tempfile, $tempfile . '.pdf');
$tempfile = $tempfile . '.pdf';

generatePdf($tempfile);

Note how there is no error handling for the call to rename. That will be key in just a moment.

If this script runs on the same server enough times, eventually, two runs will run into a race condition. User A might end up with the PDF of user B, or vice-versa. More specifically, here's what could happen:

  • user A requests a personal data export in PDF form
  • server creates /tmp/user_data_432565 for user A
  • server renames /tmp/user_data_432565 to /tmp/user_data_432565.pdf
  • user B requests a personal data export in PDF form
  • server creates /tmp/user_data_432565 (which is now available, since the file from user A was renamed to something else) for user B
  • rename to /tmp/user_data_432565.pdf fails because /tmp/user_data_432565.pdf already exists from user A's request, but because there is no error handling, the request of user B continues
  • server puts data of user A in /tmp/user_data_432565.pdf
  • server puts data of user B in /tmp/user_data_432565.pdf
  • server sends /tmp/user_data_432565.pdf to user A and B

And there it is: user A and user B will end up with the data of user B, because it was the last data to be put in the file used by both requests.

Handling all possible errors is of courses a simple solution here. But race conditions can occur in some harder to debug cases as well, such as when you use threads or use the database without proper transaction management.

Let's have a look at another type of error, full disks.

Full-disk

Yes, disks become full. I've had this happen multiple times. Unless you have monitoring systems in place, you probably won't be notified of the disks being full and data will be dropped silently.

The first time I encountered a full-disk, we were still in early stage, using a single server for everything, including storage of uploaded files. These files filled up the disk and at some point, new files could not be uploaded. Too bad, because our code depended on uploads functioning properly.

The second time, we had installed Docker, a piece of software which had issues with disk usage for a long time, which we didn't know before seeing our website go down.

The third time, it was because we were storing log files in an s3fs mount. What we missed was that a failure to mount the s3fs would not trigger any error and instead, files would silently fill up the server's disk until everything blew up.

So basically, what I learn from all of that is that the more immutable a piece of infrastructure is, the easier things are to reason about. If at all possible, mutable data should be stored on bespoke infrastructure with corresponding monitoring services in place.

Null pointer errors

When used without good reason, nullable types are like cancer. They start small and then spread throughout your codebase before they start causing mayhem.

So what's the problem with nullable types?

The problem with nullable types is not so much null itself. The value null can be handled perfectly well.

Instead, what I've seen time and time again is that junior engineers will use nullable types when they don't want to take the time to understand their software's requirements. They "just" use a nullable type and that's it. They don't worry about what might happen if the value is ever null, because in their initial design, the value is never null.

Then, another engineer comes along and in their new code, the value is null at some point. But the old code doesn't handle null.

Let's see an invalid use of a nullable type: take Address type with a field streetNumber that is nullable even though every Address is supposed to have a street number. This kind of thing happens a lot and ends up causing frequent bugs.

And so things start to break down. And the bug tracker starts filling up with errors like these:

nullable-error

So yeah, avoid nullable types if at all possible.

Reasonable uses of nullable types

The only reasonable uses for nullable types are when:

  • you have no time to work properly (which can happen if you need a feature shipped very quick),
  • or when the null value expresses is intuitively a valid value.

For instance, here's a valid usecase for a nullable type: you have BlogPost type persisted in a database and this object has a publishedAt field, which is either a date, if the blog post is published, or null if the object is not yet published.

A trick to get around nullable types

I've often seen nullable fields being used as a way to avoid creating a new object. For instance, you have User type that has addressStreetNumber, addressPostcode and addressCity fields that are all nullable, because maybe the user has not yet informed you about their address — in which case all three fields are null.

User
  addressStreetNumber: string | null
  addressPostcode:     string | null
  addressCity:         string | null

This causes a problem because now, maybe you have users which have only addressStreetNumber that is null but addressPostcode and addressCity not null. What would that mean? Is this user valid?

A architectural trick to get around that is to make the User type have an address field which is of type Address. And make the Address type have 3 non-nullable fields named streetNumber, postcode and city. Now, each user can either have an address, or no address. But each address is valid 100% of the time.

User
  address: Address | null

Address
  streetNumber: string
  postcode:     string
  city:         string

Reducing the amount of places where null is used makes code easier to reason about and reduces the number of bugs in the long run.

SSL certificate of API invalid

More and more, SSL certificate renewals are getting automated, what with Let's Encrypt and all. But not everything is automated yet. This leaves room for occasional human errors.

I've had an SSL certificate be updated in a backwards incompatible manner once. It was a Saturday morning, so as to not disturb customers during the most active hours, which in this case were week hours.

A critical service should try to handle SSL failures securely and gracefully. That means never reverting to insecure connections and trying to revert to a safe alternative whenever possible. As a last resort, dynamically disabling the feature that uses the failing SSL connection whilst having alerts going out to on-call engineers can be a wise temporary solution.

API unavailable

For all of the reasons discussed in this article, web services break. Bugs occur. And servers fail. That's when an API may become unavailable. It may be unavailable for a few seconds only, or for a prolonged period of time.

When an API is unavailable for a brief moment, there is a simple method we've used: retrying. That's when you run the API call multiple times until it works. Usually, retrying will look something like this:

$retries = 5;
while ($retries !== 0) {
    $apiResult = doApiCall();
    if ($apiResult) {
        break;
    }
    $retries--;
}

if (!$apiResult) {
    // handle error
    // ...
    return;
}

// handle successful API call
// ...

Retries are not ideal though: the more you retry, the longer the code runs. This mean that if possible, retrying should be avoided. This means, as software engineers, it is a good idea to assume that APIs can become unavailable at any time and that not handling unavailability transparently will make our product unusable.

Recently, I was tasked with improving the performance of a CRM's search capabilities with SQL, that is, without going through the setup of ElasticSearch or Algolia.

The CRM's pagination was based Symfony's PagerFanta Bundle. It seems that optimizing this bundle for use with large table is a somewhat common issue.

In our case, the problematic queries took up to 15 seconds sometimes, for a simple paginated list of items. So, I dug into some SQL and got the query time down to 2 seconds. Here are some of the things I learned.

Use IN instead of = xxx OR

This one is simple enough. Using WHERE x IN (a, b, c) is faster than using WHERE x = a OR x = b OR x = c.

Symfony makes this simple with the query builder. Let's say we want to list users that are banned or suspended, we could do something like this:

$statuses = [
  1, // banned
  2, // suspended
];

$qb->andWhere($qb->expr()->in('user.status', $statuses));

In such a case, having an index on user.status helps too (see "Indexes" further down).

When joining, put ORDER BY in subquery

In SQL, at least in MariaDB, there is a performance issue affecting "late row lookups" and can be worked around by putting any ORDER BY clauses inside a subquery.

To me that sounded counter-intuitive at first, but the subquery does actually speed things up.

The idea is to change something like:

SELECT my_user.id, my_cars.id
FROM my_user
JOIN cars ON my_cars.user_id = my_user.id
ORDER BY my_user.id DESC

To something like:

SELECT DISTINCT my_user.id, my_cars.id
FROM
    (
        SELECT ordered_user.id
        FROM my_user ordered_user
        ORDER BY ordered_user.id DESC
        LIMIT 999999999
    ) ordered_user
JOIN my_user ON my_user.id = ordered_user.id
JOIN my_cars ON my_cars.user_id = my_user.id

The LIMIT 999999999 is required because MariaDB will only use ORDER BY in a subquery if a LIMIT is also provided.

Avoid prefix wildcards with LIKE

When running an SQL query with a prefix wildcard, indexes are not used.

Let's take this query as an example:

SELECT * FROM my_user WHERE email LIKE '%foobar@gmail.com%'

With this, even if the email column is indexed, the index will not be used. With a large table, this query will be slow.

Instead, what I've found is that more often than not, there are ways to query non prefix wildcards. For instance, in a CRM, it is often the case that a customer service representative will copy paste the email address of a customer. This means that emails are exact. And so you can run this query instead:

SELECT * FROM my_user WHERE email = 'foobar@gmail.com'

Or, if you still want speed and a bit of flexibility, you can do this:

SELECT * FROM my_user WHERE email LIKE 'foobar@gmail.com%'

This works especially well with data types such as phone numbers.

If a searched phone number is valid (which can be checked with Google's libphonenumber library), it probably OK to do a WHERE phone = xxx query. If, on the other hand, the phone number is incorrect, that may mean that it is incomplete, in which case you can revert to a slower but more accurate WHERE phone LIKE '%xxx%' query.

I've found that using data validation can be a good way to build efficient SQL queries on the fly and reduce query time in most cases, reverting to slow queries only when data validation leaves you unsure what to do.

Don't show page number, unless you need to

The CRM I was working on used PageFanta's Doctrine ORM adapter:

$pager = new Pagerfanta(new \Pagerfanta\Adapter\DoctrineORMAdapter($qb));

This Pagerfanta adapter counts the number of pages. This helps display something like this in your front-end:

pagination-1

This looks nice. Doesn't it?

As a developer, it makes us feel smart to display that kind of information. It seems like it has become kind of an automatism to display this data. We don't question it's usefulness.

But it is useful? In our case, and probably in lots of cases, it's mostly useless. What matters more is to display the right data on the first page.

And, fetching the number of pages from the database takes time. Getting rid of this completely was an acceptable solution in our case. This how we do that with PagerFanta:

$pager = new Pagerfanta(new \Pagerfanta\Adapter\CallbackAdapter(
    function() use($limit, $offset) {
        // approximation that allows always having one more page
        return $limit * 2 + $offset;
    },
    function() use($qb) {
        return $qb->getQuery()->getResult();
    }
));

Set indexes

We've discussed the fact that using indexes can speed up queries in some situations. In a Symfony website, adding indexes can be setup with the Index annotation on any entity's class:

use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Table(name="my_user", indexes={
 *   @ORM\Index(name="idx_status", columns={"status"}),
 * })
 * @ORM\Entity(repositoryClass="AppBundle\Repository\UserRepository")
 */
class User
{
    /**
     * @var integer
     *
     * @ORM\Column(type="integer")
     */
    protected $status;
}

You may also use composite indexes to speed up specific types of queries that use WHERE on multiple fields in a single query.

A few months ago, I migrated an enterprise website from AWS to Heroku, to reduce the maintenance burden of AWS EC2 instances and make the website more reliable. Here's how that went and what I learned from it.

Planning for failure

Before migrating the servers, I made sure there was an easy way to switch back to the old infrastructure, just in case something went wrong. As we'll see, this would turn out to be useful.

The domain of the website pointed to an AWS ELB via a CNAME record. On Heroku, CNAMEs are also used.

So, to switch back from the new, potentially buggy, infrastructure to the old one, all I had to do was keep the old servers up and make update the CNAME record.

Having a low "Time To Live" (TTL) for the CNAME record was important to enable quick reverts.

ttl

First try and failure to handle the load

Before moving to Heroku, I actually tried using another PaaS we'll call Cocorico. It seemed very simple to use. No configuration whatsoever. I was very enthusiastic about giving this platform a shot.

In my initial tests, the website I was working on worked nicely. No lag. No crashes. No timeouts. I was off to a great start.

But then my enthusiasm caught up to me. I flipped the DNS switch over to the new infrastructure. Almost instantly, everything started crashing. We had hundreds of timeouts. The website was completely unusable.

Stress started growing. Requests started coming in to ask what was going on. At this point, I thought I could call Cocorico's support team for help. But they were unable to assist me.

So I looked for possible bottlenecks. That too, was difficult, because Cocorico provided almost no monitoring tools. That's when I realised how important monitoring tools are. Not to mention how much more I should have load tested Cocorico's infrastructure before putting it on the front-line.

I had to come to terms with the fact that moving to Cocorico had been a mistake on my part. So I switched back to the old AWS infrastructure. Instantly, the website started delivering quick responses again.

Learning from this first experience, I now knew I'd need the ability to:

  • customize the Nginx & PHP configuration
  • easily scale the number of servers up and down
  • monitor incoming traffic in case issues arose

Heroku seemed like a perfect fit because it supports all of this (and a bit more). But before blindly trusting Heroku's feature-set like I had done with Cocorico, I wanted to make sure that, this time around, the new infrastructure was going to work properly under production conditions.

Figuring out what to load test

The website I was migrating had different kinds of work loads: asynchronous tasks run by workers, static content via Wordpress, APIs which themselves call out to multiple APIs, etc.

All of these things had to at least be as fast and stable as they were with AWS. But I didn't really know how to find out what the current load was on the AWS servers.

My good friend Kim was of great help. He gave me a whole bunch of tips about log analysis.

He walked me through AWS monitoring dashboards amongst others. They provide lots of useful graphs that can help understand the number of concurrent requests you must be able to handle.

This is just one example of the data available for an EC2 instance.

ec2-monitoring

Kim also recommended I take a close look at the number of concurrent database connections and transactions, which would help more accurately define what tests were required.

For inspiration, I highly recommend the article about how Github tested a change the way merges were processed. Not everyone needs to go as far as Github does in this case, but if Github is as stable as it is, it is probably because of all the work that goes into testing changes.

How to load test

After I had a somewhat good idea of what to load test, I ran tests with way more traffic than what would happen in production for the different kinds of workloads that the website runs.

If the servers held up, that would be a good indication that they would also remain available under production traffic.

My colleague Romain pointed me to a list of load testing tools, one of which "Apache HTTP benchmarking tool". What it does, is that it throws a specific number of requests at a specific URL with a predefined concurrency threshold.

Here is what it looks like for 1000 requests with a concurrency of 200:

ab -n 1000 -c 200 https://mywebsite.herokudns.com/some-url-to-load-test

And then, ab output this kind of report:

ab-results

The load tests of all workloads yielded good results. Under high load scenario, increasing the number of servers was painless and without interruption of live traffic. Everything looked good.

So I went ahead and flipped the DNS switch again. We were now in production with the new Heroku based setup. Now, it was time for ironing out edge cases through monitoring.

Edge case detection through monitoring

Heroku has a very helpful "Metrics" view, which displays the number of 4xx and 5xx requests. That's a good first indicator when it comes to the number of failed requests.

logs

The automated alerts that Heroku sends out can be a bit out of place, but they are generally helpful and help avoid downtime, because at the slightest rise in failed requests, a notification is sent out, which can be used to take action (increase the number of servers, look for ways to optimize CPU/memory use, etc).

Looking at Google Analytics performance metrics and Page Speed Insights also helped debunk some bugs.

Some of the unexpected slow downs that were discovered were due to:

  • Heroku not enabling GZIP by default, which slowed down requests for some static content like Javascript and CSS,
  • Heroku having hard memory limits per request, which broke requests that used more than 32 megabytes of memory,
  • Moving from one server to a multi-server setup, which broke file based caching.

The important lesson for me here, was that it took much longer than expected to iron out all of these bugs.

A few weeks in, bugs are getting more infrequent though. The website has become much more reliable than it was on AWS. And giving differentiated access to team members was made easier because of Herku's easy to use ACL model.

A few weeks ago, I ran a shell script on an Ubuntu server and it didn't produce the same result as on my laptop. The reason for its behavior surprised me.

/bin/sh can point to any other shell

For years, I was used to running scripts with bash or sh. And I had no problems using one or the other. The scripts seemingly had the same behavior. Basically, this might be because the default shell on Debian is bash. Basically, /bin/sh is a symlink to /bin/bash.

But recently, I had some issues with a shell script that was set to use sh. The script looked somewhat like this:

#!/bin/sh

# some commands here
# ...

I'd have expected this to run properly on any system that had bash installed, because that's what I was used to up to that point.

But after digging through the internet, I found out on newer Ubuntu /bin/sh refers to /bin/dash on Ubuntu. My laptop, which runs Arch Linux, has /bin/sh pointed to /bin/bash. And Debian also has /bin/sh pointed to /bin/bash.

A quick example

bash and dash can make a shell script behave in very different ways despite being the default shells on Debian based Linux distributions. One example of something that works on bash but not on dash is the if [[ .. ]] syntax:

#!/bin/bash

if [[ 1 = 1 ]]; then
    echo 'works on bash'
fi

And the thing is, even if we assume that the script above is called test.sh, running dash test.sh will return an error because dash doesn't pick up on the #!/bin/bash.

Best practices for shell scripting

Following this, I've been consistently using the same shell for all of your work, in my case, that's bash, partly because it's the default shell on Arch Linux, which I use daily, but also because it is widely used and even recommended in Google's shell style guide.

Most PHP projects eventually pull in open source libraries from Composer. They make complex topics easier (PDF manipulation, software development kits for APIs, user authentication helpers, etc).

All of these Composer dependencies must be updated regularly to avoid security holes or incompatibility issues with our constantly evolving software practices.

Updating Composer dependencies in PHP projects all the while avoiding regressions can be daunting.

How to figure out what might break

Some Composer packages provide a CHANGELOG.md file, which is a great way to know if updating will break a website. But reading through dozens of changelogs can take a long time.

Because most Composer packages (or at least well known ones) rely on SemVer to publicize what might have changed between two releases, there is usually no need to read through changelogs.

SemVer says that any software packages should have a version of the form <major>.<minor>.<patch>, for instance 5.1.34.

Whenever the major part changes, there is a breaking change in the package. That means we will have to verify if our code is affected, or else something might stop working.

Whenever the minor part changes, there is a new feature in the package but it shouldn't break existing code. In practice, I've experienced breakages with minor version changes, so tread carefully.

Whenever the patch part changes, a bug was fixed. This is non-breaking, unless we "used" the bug as expected behavior, which we should avoid if at all possible.

How to leverage this information with Composer

As we just saw, only major and minor version bumps usually break our software. That means that we only need to read the changelog files of packages which have had their major or minor version updated.

In a large project with lots of dependencies, how can we know what was updated ?

First, update composer.json to reflect our needs.

Then, run composer update, which will overwrite the composer.lock file and lock-in dependency versions (which helps with reproducible builds).

composer-update-status

Then, run the following command, which shows only packages which have changed versions:

git diff composer.lock | grep -B1 -E '            "version"'

The output looks like this:

composer-update-diff

From there, take a closer look at the changelogs from packages for which a major or minor version change has occured. If need be, update the PHP code to reflect those changes so that our website keeps running properly.

Making changes to large databases in Symfony projects usually requires a command. Keeping it free of "Out of memory" errors takes some work.

Let's have a look at some of the things we can do to keep memory usage under control.

Batching

Code needs to be performant and work with large amounts of data. Having both requires some non-trivial code. If you've been working on migrations for a while, you might have gone through both of these naive approaches — I sure have:

  • select all of the data at once and migrate it in one run: the network is only accessed once but this inevitably results in an "Out of memory" error because there is probably too much data for your server's RAM,
  • run the migration on each row, selecting rows one by one: memory usage is low because we only ever load one row into to memory at a time, but the network is accessed as many times as there are rows, which is very slow.

The approach for low memory usage and high performance? Batching.

With batching, you select rows 100 by 100 (or 1000 by 100, or whatever fits your use case) and run the migration on each batch. This results in predictable memory usage while reducing network access.

It might look something like this in PHP:

$batchSize = 100;

// We don't want to select the same batch multiple times. We could
// use an SQL "OFFSET" clause for this. But "OFFSET" tends to be slow
// on large tables. A faster approach is to select rows ordered by ID
// and to set a minimum ID with a "WHERE" clause. Something like:
//
//     SELECT * FROM user WHERE id > :minimumId ORDER BY id ASC
//
// This achieves the effect of an "OFFSET" but is significantly faster.
$minimumId = 0;

do {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);

    $minimumId = $batch[$batchCount - 1]->getId();

    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // Do something foreach user
            // ...
        }
    }

// If the batch is smaller than the maximum batch size, it means we have
// gone over all of the users and we are done
} while ($batchCount === $batchSize);

Putting the main loop's body inside a separate method

PHP automatically clears memory of variables that go out of scope, somewhat like what C does with its stack variables, if you're familiar with that. Basically, any variable that is created in a function is cleared once the function ends.

What you can do to allow PHP to clear more memory is to put the main loop's content inside a separate method. It looks something like this:

$batchSize = 100;
$minimumId = 0;

do {
    list($batchCount, $minimumId) = $this->runOneBatch($batchSize, $minimumId);
} while ($batchCount === $batchSize);

And in a separate method named runOneBatch:

function runOneBatch(int $batchSize, int $minimumId): array {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);

    $minimumId = $batch[$batchCount - 1]->getId();

    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // Do something foreach user
            // ...
        }
    }

    return [$batchCount, $minimumId];
}

Marking unused variables as collectable by garbage collection

In PHP, variables are said to be garbage collected, which means that you usually don't have to manage the memory they use manually. Instead, PHP handles memory management automatically.

This works well most of the time. But for long-running tasks, it has its limits. Because PHP does not always know that a variable has become useless, it keeps it in memory just in case. This increases memory usage.

To tell PHP that a variable has become useless, you can set it to null and unset() it:

$users = loadLotsOfUsersFromMysql();

// ... do some stuff with the users ...

$users = null;
unset($users);

Manual garbage collection cycle

PHP's garbage collector uses heuristics to figure out when the best time is to remove unused variables from memory. In long-running tasks, you may want to force PHP to remove unused variables to keep memory usage low.

Running garbage collection manually in PHP goes like this:

gc_collect_cycles();

Clearing Doctrine's entity manager

Internally, Doctrine, Symfony's ORM, keeps a reference of each entity it has loaded into memory, unless you explicitly tell it to drop the reference.

To keep memory usage down, you may want to clear the entity manager, which will remove references to all entities currently known to Doctrine:

/** @var \Doctrine\ORM\EntityManagerInterface $em */
$em = $this->getContainer()->get('doctrine')->getManager();
$em->clear();

SQL logging

There is one feature of Doctrine that is enabled in Symfony by default and can be a source of memory leaks during long-lived commands: the SQL logger.

You can disable the SQL logger with something like this:

/** @var \Doctrine\DBAL\Connection $connection */
$connection = $this->getContainer()->get('doctrine')->getManager()->getConnection();
$connection->getConfiguration()->setSQLLogger(null);

Putting it all together

Here's an example of what a memory efficient long running migration script might look like when you put all of the above tips together:

<?php

namespace App\Command;

use App\Entity\User;
use App\Repository\User as UserRepository;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Bundle\FrameworkBundle\Command\ContainerAwareCommand;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Style\SymfonyStyle;

class MigrateUsersToNewDatabaseFormatCommand extends ContainerAwareCommand
{
    protected function configure()
    {
        $this
            ->setName('app:migrate-users')
            ->setDescription('Migrate users to new database format')
        ;
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $style = new SymfonyStyle($input, $output);

        /** @var \Doctrine\ORM\EntityManagerInterface $em */
        $em = $this->getContainer()->get('doctrine')->getManager();

        /** @var \Doctrine\DBAL\Connection $connection */
        $connection = $em->getConnection();
        $connection->getConfiguration()->setSQLLogger(null);

        /** @var \App\Repository\User $users */
        $users = $em->getRepository(User::class);

        // There are a lot of rows to migrate, so we migrate them little
        // by little to use less RAM. You can tweak this depending on
        // what you find works best for your use case.
        $batchSize = 100;

        // The fastest way to mimic an OFFSET clause is to store the last
        // migrated User.id and to select rows above that ID instead of
        // using an actual OFFSET.
        $minimumId = 0;

        do {
            list($batchCount, $minimumId) = $this->runOneBatch($em, $users, $style, $batchSize, $minimumId);
        } while ($batchCount === $batchSize);
    }

    function runOneBatch(EntityManagerInterface $em, UserRepository $users, SymfonyStyle $style, int $batchSize, int $minimumId): array {
        $batch = $users->getBatchOfUsersOrderedById($batchSize);
        $batchCount = count($batch);

        if ($batchCount > 0) {
            $minimumId = $batch[$batchCount - 1]->getId();

            /** @var User $user */
            foreach ($batch as $user) {
                try {
                    // Do some stuff with the current user, for instance
                    // set a newly added field to "false"
                    $user->setAccessRestricted(false);
                } catch (\Exception $e) {
                    // Handle the exception if needed
                    // ...

                    // Once the exception is handled, it is no longer needed
                    // so we can mark it as useless and garbage-collectable
                    $e = null;
                    unset($e);
                }

                $em->persist($user);

                // We are done updating this user, so we mark it as unused,
                // this way PHP can remove it from memory
                $user = null;
                unset($user);
            }

            // For each batch of users, we display the memory usage in MB so
            // that we can see if it grows during testing: if it does grow,
            // there is most likely a memory leak somewhere
            $memoryUsage = memory_get_usage(true) / 1024 / 1024;
            $style->text("Batch finished with memory: ${memoryUsage}M");

            // Once the batch of users is updated, we don't need it anymore
            // so we mark it as garbage-collectable
            $batch = null;
            unset($batch);

            // Flushing and then clearing Doctrine's entity manager allows
            // for more memory to be released by PHP
            $em->flush();
            $em->clear();

            // Just in case PHP would choose not to run garbage collection,
            // we run it manually at the end of each batch so that memory is
            // regularly released
            gc_collect_cycles();
        }

        return [$batchCount, $minimumId];
    }

}

This week, I took Coursera's crash course on data science from Johns Hopkins University. From the use of statistics to the structure of a data science project, here's what I learned.

What is data science?

Data science is the science of answering questions with data.

Use of statistics in data science

It usually involves the use of statistics, which are comprised of 4 things.

Descriptive analysis, which is the exploration of the data to get an idea of what it tells us. This can include creating graphs or visualizations of raw data.

Inference, which is drawing of conclusions from samples. This might be an estimation of the number of voters for the past polls. Inference can be seen as the formalization of the descriptive analysis.

Prediction, which is an attempt to say what the future data will be based on current or past data. This takes the form of linear regression or other machine learning algorithms, as well as deep learning with constructs such as neural networks.

Experimental design, which is the definition of methods needed to conduct meaningful experiments. Randomization of the population group is a key aspect of experimental design as it helps to get better generalizability.

Machine learning

There are 2 main types of machine learning:

  1. unsupervised learning: trying to see things previously unseen - for instance, groups of people within a population, such as "voters of the Republicans" or "voters of the Democrates",
  2. supervised learning: trying to predict something based on previously known data - for instance, trying to predict the height of a son based on the height of his father.

Even though they are related, there are some differences between traditional statistics and machine learning. Traditional statistics cares primarily about model simplicity and interpretability. Machine learning cares primarily about model performance.

Structure of a data science project

The structure of a data science is as follows:

  1. asking of a question of which the answer will determine what to do next
  2. data is gathered that will help to answer the question
  3. exploratory data analysis helps get a vague idea of what an answer might be
  4. formal data analysis helps get a deeper understanding of the data
  5. interpretation infer conclusions from the data samples
  6. communication of the experiment gives the answer to the question
  7. data science project is over, now is decision time

Sometimes, exploratory data analysis can come first, lead to questions which you can then try to answer using the steps above.

Illusion of progress

The course mentions the paper Classifier Technology and the Illusion of Progress, by David John Hand.

The paper by D. J. Hand explains how a lot of current machine learning algorithms may well be too complicated to yield value for real life projects.

Some technologies have a theoretical advantage that is very hard to transpose in the real world due to unexpected edge case or lack of domain knowledge on the part of the executor.

Output of a data science project

A data science project should have some kind of output.

Ideally the output should be reproducible, meaning it should contain all the data and code needed to run the experiment from scratch. This is useful for auditing the experiment or for running it with another data set.

The output of the data science could be:

Good outcomes

The best outcome of a data science project is to have an answer to the question the project is meant to answer. That and the previously discussed output.

Some other good outcomes may be increased knowledge or the conclusion that the data set is not suited to answer the question, in which case you can decide to look for more data or to drop the question. Either way, it's useful to know if the data set is insufficient.

Bad outcomes

Sometimes, even in the hands of smart people, machine learning can make bad predictions, as was the case with some predictions by the Google Flu Trends:

In the 2012-2013 season, it predicted twice as many doctors’ visits as the US Centers for Disease Control and Prevention (CDC) eventually recorded. — New Scientist

And sometimes, machine learning can yield very good performance but be too expensive to run at scale in production, as was the case with an algorithm meant to improve Netflix recommendations:

(...) the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. — Netflix blog

Other papers I have yet to read

Some other papers are mentioned in the course:

I'd like to read those.

Computers can be little or big endian. Wondering what endianness is all about, where it originated from and its impacts on code performance? I was too.

A first look at endianness

Let's take the integer 0x0A0B0C0D, or 168496141, as an example.

In little-endian, the least significant bytes come first, which gives us:

address | byte value
--------------------
0x00    | 0x0D -> least significant
0x01    | 0x0C
0x02    | 0x0B
0x03    | 0x0A

In big-endian, the most significant bytes come first, which gives us:

address | byte value
--------------------
0x00    | 0x0A -> most significant
0x01    | 0x0B
0x02    | 0x0C
0x03    | 0x0D

But why bother having two different ways to save integers in memory?

Network performance

The question "What are the pros and cons of little-endian versus big-endian?" on Quora seems to indicate that neither one is better than the other.

But, x86-64 is a very popular CPU architecture and it is little endian. Same with ARM CPUs, even recent ones: they use little-endian encoding. So with these CPU architectures, networked programs need to reorder bytes on a regular basis to accommodate to the network byte-order: big-endian.

Given all of this byte shuffling, would network performance decrease? Probably. But not so much.

I just checked the Intel 64 instruction set and it has a BSWAP instruction that is specially made to convert integers from big-endian to little-endian and vice-versa:

bswap

This instruction is used by the operating system kernels. For instance, on my Arch Linux installation, BSWAP appears in the /usr/include/bits/byteswap.h header:

byteswap-header

I trust that Intel probably optimized these functions as much as possible. So I suppose the performance impact of calling BSWAP is small. In addition, there are several cases where swapping byte order can be completely avoided:

  • big-endian machines already use network byte order internally, so they don't need to swap byte order before sending data off to the network
  • little-endian machines that talk to each other over TCP do not need to swap bytes since the bytes are read in-order on the receiving end

The only case where endianness should really cause slow down is if a big-endian machine talked with a little-endian machine.

Math and typecasts

However, Akash Sharma makes a good point for little-endian.

He explains how little-endian makes common maths operations easier:

Consider an example where you want to find whether a number is even or odd. Now that requires testing the least significant bit. If it is 0 it is even. In Little Endian Ordering least significant byte is stored at starting address. So just retrieve the byte and look at its last bit. — Akash Sharma

And how little-endian makes typecasts to smaller types easier:

For example you want to typecast a 4 bytes data type to 2 bytes data type. In Little Endian Ordering it is pretty straightforward as you just need to retrieve the first 2 bytes from starting address in order and you would get the correct number. — Akash Sharma

Consistency and the coin toss

The paper "On holy wars and a plea for peace" by Danny Cohen also gives some interesting background behind the decision to use big-endian for network byte order. This paper is referenced in rfc1700 which defines network byte-order, which is why it makes sense to talk about it here.

Cohen's paper points to the inconsistency of some little-endian systems back in the eighties, when the paper was written:

Most computers were designed by Big-Endians, who under the threat of criminal prosecution pretended to be Little-Endians, rather than seeking exile in Blefuscu. They did it by using the B0-to-B31 convention of the Little-Endians, while keeping the Big-Endians' conventions for bytes and words. — Danny Cohen

This sounds like a joke, but Cohen goes on to describe how the M68000 microprocessor was little-endian for bits but big-endian for words (bytes), double-words and quad-words.

And it points to the better consistency of some big-endian systems:

The PDP10 and the 360, for example, were designed by Big-Endians: their bit order, byte-order, word-order and page-order are the same. The same order also applies to long (multi-word) character strings and to multiple precision numbers. — Danny Cohen

Finally, it ends by saying that making a decision and having all computer scientists agree on it will be extraordinarily difficult. For this reason, it suggests:

How about tossing a coin ??? — Danny Cohen

Almost 40 years later, the network folks have settled for big-endian and the majority-share CPUs (x86, arm) are little-endian. In software development though, the choice is not clear. Luckily, we probably will never have to worry about bit-endianness, only byte-endianness. Partial win, I guess.

Rooster is a simple password manager for geeks. After 2 years of development, its feature set is complete and it's time to add automated tests to the mix. I'm doing that 1 test a day, over 30 days.

Why I'm testing for 30 days

A few days ago, first time Rooster contributor Dmitry V. Roenko updated deprecated serialization code and he had the awesome idea to add unit tests to his work. Oleg Yamnikov, another notable Rooster contributor behind the recent command improvements and the upcoming import command, has told me he'd like to write tests as well.

They have inspired me to start unit-testing my contributions.

So today, I'm starting the "30 Days Of Tests" challenge: each and every day for the next 30 days, I'll write 1 unit test. My hope is that it will make testing fun and digestible, whilst revealing unnoticed bugs and making it easier to contribute unit tested code to Rooster.

Implemented tests

Until September 10th 2017, I'll update this article daily with a link to the most recent unit-test and what I've learned from that unit-test. Here we go:

Takeaway

33 days later, here we are. Rooster has now 30 tests (unit and integration).

But wait, why 33 days?

Well, running this challenge during vacation wasn't the smartest idea. So I ended up being a bit late a times. But, having the goal of 1 test a day was still a good way to build up a nice set of tests.

And now, we can run Rooster's tests with cargo test and ./integration-tests.sh whenever we need to check whether an upgrade broke anything.

And that's pretty awesome!

As software engineers, we spent time improving our tools, which leads to increased output and better software. It is important to backup this work so that it doesn't go to waste.

Backing up configuration files

As software engineers, customizing our development environments can lead to big boosts in productivity and fewer bugs.

All the work that goes into tweaking your Neovim, Sublime or VSCode configuration pays for itself once you get intelligent code analysis, automatic refactorings, proper syntax highlighting and fast keyboard shortcuts.

Backing up this configuration is just as important as backing up any other files on your system. Usually, this configuration comes in the form of so-called "dotfiles", which are not put in your Dropbox or Google Drive folder. And dotfiles might now be accounted for in backup software focused on the needs of the general population.

So what can we do to keep a safe backup of our configuration files ?

The most common way of doing that is keeping them in Git, a software that allows us to track changes in files and send them to a safe server so we can restore the later.

Here are some of the files from dotfiles I currently have stored in Git:

.asoundrc
.bashrc
.config
.gitconfig
.local
.npmrc
.profile
.ssh
.Xdefaults
.xinitrc

Restoring our configuration

If our hard drive ever breaks down, or our computer gets stolen, we need a way to restore the files we have previously saved in Git.

After running a git clone of the dotfiles' Git repository, I run this little script that will make symlinks to where the system will actually look for the configuration files:

#!/bin/bash

cd dotfiles
for f in `find . -type f | sed 's#^./##'`; do
    # create the directory in which to put the config file's symlink
    echo -n mkdir $HOME/`dirname $f`
    read
    mkdir -p $HOME/`dirname $f`

    # create the symlink
    echo -n ln -sf $HOME/dotfiles/$f $HOME/$f
    read
    ln -sf $HOME/dotfiles/$f $HOME/$f
done
cd ..

When I run this script, it shows me what it is about to do, asks for confirmation and then creates the symlink where it should be.

For instance, ~/dotfiles/.bashrc gets symlinked to ~/.bashrc. Then, when Bash looks for its configuration, it will to ~/.bashrc which will point to ~/dotfiles/.bashrc, at which point Bash will be able to read the custom configuration I have been improving over the last few years.

Setting up the Thinkpad W541 with Arch Linux and full-disk encryption was a bit daunting. So I wrote down some notes during the process. Here they are.

This post might be useful to you if you have a W541 as well. And it will probably be useful to me when I need a new Thinkpad and want to install Arch Linux on it.

Veracrypt

The first encryption method I look at is Veracrypt.

During a job interview, a company asked me questions about Veracrypt. I didn't know what is was. The interviewer told me it had this feature called "encrypted containers" that allowed to encrypt a specific set of files on a computer to keep them safe: ssh keys, password files, etc.

So I thought, "hmm, I might need that to protect my keys too!".

A feature that seems particularly interesting to me is encrypting the disk that is currently being used to run my oprating system. That means I wouldn't have to reinstall my entire system. I like that because I am a lazy programmer.

But, unfortunately, it seems Veracrypt requires 1.5x the used disk space for the encryption to work. Probably because it copies files in the encrypted format before moving them. I don't have enough free disk space for that though.

So, OK, I'll just reinstall. And while I am at it, I'll try setting up encryption with Arch Linux. I have used it before and have wanted to try using it again for some time.

dm-crypt and LUKS

Veracrypt is available on Linux, so I could encrypt files using that. However, the Arch Linux wiki often gives good tips on how to do things "the Arch way", so let's have a look at that first.

Searching "arch full disk encryption" yields this documentation about dm-crypt. The section named "Simple partition layout with LUKS" seems to be exactly what I need, no more, no less:

  • data is protected in case of physical theft of the computer
  • easy to install
  • single pre-allocated disk partition

Things that are possible with other dm-crypt setups but that I don't need:

  • possibility to change partition layouts
  • plausible deniability in case the computer is seized by the courts

In addition, dm-crypt seems to be somewhat compatible with TPMs, which is a security feature my computer has and I might want to use later.

So OK then, I'll go with dm-crypt.

On YouTube, there is a straightforward tutorial that helps with installing Arch Linux with dm-crypt: https://www.youtube.com/watch?v=AQxk1uUQ-pQ

I have to say, I don't understand everything the author is doing. So let's dig in!

First thing I ignore is btrfs which the author chooses as his filesystem. Looks like this filesystem has forward error correction of corrupted files. Wow! I need that too!

Second thing I ignore is the role of the "hardware clock" that is configured with the hwclock command. A quick Google search tells me that this clock, contrary to the one Linux runs, keeps on going after I turn off the computer, so that when I boot it up again the next day, Linux can know what time it is without asking the internet. Sweet!

Even better, the hwclock man page tells me that the hardware clock is sometimes called CMOS clock. Wait, "CMOS"? That's the BIOS's battery's name. If that's the case, then we can rely on it: apparently, it can keep working for 6 years or more. Nice!

One more thing I learn in this video is the existence of the syslinux bootloader. Most tutorials on the internet talk about grub. It seems like the default bootloader when you want something that works fast without having to tweek a lot of settings. syslinux looks more lightweight and thus a little less powerful. But I don't need all of grub's features (custom filesystems, encrypted /boot partition, etc). So I do as the video author suggests and setup syslinux.

In the end, I do end up with a working system and full disk encryption (except for the boot partition, but there is no private data in there, so I don't mind this getting stolen).

There are still a few errors that show up on boot, but I'll fix those later. They don't seem critical and I'd like to try a few other things first.

UEFI

One thing I notice with the previous install is that I used the legacy BIOS instead of the more modern UEFI. This works just fine in 99% cases because UEFI's advantages seem aimed at enterprise uses, not so much PC use. But I like bleeding edge stuff so I decide to look for a way to reinstall in UEFI mode.

Once again, there is a good YouTube tutorial on installing Arch Linux with UEFI: https://www.youtube.com/watch?v=MMkST5IjSjY

In this tutorial, I discover yet another bootloader: systemd-boot. It is similar to syslinux, but it comes pre-installed with systemd, which is the service manager in Arch Linux. This means that whether or not I use systemd-boot, it will be on my system. I might as well just use it.

In deciding which bootloader to settle for, I found this schematic about boot loaders (released under the terms of the GNU FDL license) from the Arch wiki to be very helpful:

bootloader

It would be interesting to use btrfs for the /boot partition to prevent file corruption. But I'd have to use syslinux for that, which means installing another software package. I'll stick with systemd-boot and a vfat filesystem for the /boot partition.

After finishing up the UEFI & systemd-boot installation, I reboot my computer.

PCCT error

Now comes to the time to fix the pesky errors that show up during boot.

They are still there!

pcct

Looking for Error parsing PCC subspaces from PCCT on Google, I can see that I'm not alone in this. A fellow Redditor suggests simply downgrading from the bleeding edge Linux kernel to the LTS ("Long Term Support") version:

Have you tried rolling back the kernel or installing and running linux-lts? — monotykamary

The LTS kernel should be more stable so I'll try applying this piece of advice. If the errors vanish, it'll mean that the bleeding edge kernel was the cause.

Let's install the LTS kernel:

pacman -S linux-lts

And regenerate the ramdisk used during boot:

mkinitcpio -p linux-lts

Finally, I'll update the boot configuration in /boot/loader/entries/arch.conf so it looks like this:

title Arch Linux
linux /vmlinuz-linux-lts
initrd /intel-ucode.img
initrd /initramfs-linux-lts.img
options cryptdevice=/dev/sda2:root root=/dev/mapper/root rw

Now I can reboot.

Wow, the errors are gone. Awesome!

MMC error

There is another error that shows during boot: mmc0: Unknown controller version (3). You may experience problems. This is apparently a warning. That means there is no need to necessarily fix this now. But I don't like seeing warnings every single time I boot my laptop.

mmc

A bug report on the Linux bug tracker suggests setting the debug_quirks option of the sdhci kernel module to 0x4. I try doing that by creating a file at /etc/modprobe.d/mmc.conf with the following content:

options sdhci debug_quirks2="0x4"

After regenerating the initramfs and rebooting, the error still shows up though.

Since I very rarely use the SD card slot (which is what mmc is here for), I decide to simply deactivate kernel modules related to memory cards in /etc/modprobe.d/mmc.conf:

blacklist mmc_core
blacklist sdhci_pci
blacklist sdhci

After calling mkinitcpio -p linux-lts and rebooting, all errors are gone!

If I ever need the memory card slot, I'll have to remember reactivating the kernel modules first.

Summary

I now have a good idea of what is possible in terms of disk encryption on Linux, both with Veracrypt and dm-crypt. I've installed Arch Linux with a legacy BIOS and a UEFI bios. I've setup the system with 2 boot loaders, syslinux and systemd-boot. I was also able to correct some errors that appeared on boot. For some other errors, I had to revert to deactivating the kernel modules that were causing them. And finally, I now know what the btrfs filesystem is good for and how the hardware clock works.

One thing I'd like to do in the future is make use of the TPM to further prevent malicious people from messing with my laptop. That's a story for another day.

I've worked mostly on web development projects and I want to learn about languages that are closer to the metal. So, this week, I'm taking the deep dive into x86-64 assembly.

Starting off

Because my computer is a 64 bits computer, the easiest way to get started is probably to use the assembly that works on my computer. That's x86-64 (which is apparently referred to as x64 or AMD64 sometimes).

I am new to assembly. I have seen some assembly code and have studied 68K assembly at University. But that was only on paper and I've never actually compiled and run assembly code. Let's remedy that!

Looking for x64 linux assembly on DuckDuckGo, I find this Stackoverflow thread that has some good tips, both on what to do and what not to do.

What not to do:

The Intel documentation is a bad place to start learning assembler.

What to do:

Take a look here (http://asm.sourceforge.net) it's the best place for Linux Assembly development, you will find resources, docs and links.

This website, asm.sourceforge.net, seems a bit complicated though. I'll try to find some easier guides. Guides aimed at total beginners.

This series of tutorials by 0xAX is a goldmine. It documents the learning steps for x86-64 assembly. It looks like a great place to start learning assembly.

It lists basic information like what registers are available and their different forms (64 bits, 32 bits, etc) with the corresponding names (ie rax, eax, ax, al for the first register).

It gives you the command you need to install the NASM compiler on Ubuntu:

sudo apt-get install nasm

And the commands you need to compile and link assembly file so you actually get a working program:

nasm -f elf64 -o hello.o hello.asm
ld -o hello hello.o

nasm is new to me. I'd like to know more.

DuckDuckGo leads me to the nasm documentation. I highly recommend glancing through it: it is an easy read and contains plenty tips for assembly development with nasm.

nasm-docs-1

One thing this documentation teaches me is that there are 3 different ways to define strings with nasm:

db 'Hello world', 10 ; with single quotes
db "Hello world", 10 ; with double quotes
db `Hello world\n`   ; with backticks, \n, \r, \t, etc are available

This documentation also gives a trick to get the length of a string. It looks like a pointer subtraction, $ being the pointer to the current instruction, ie the end of the string, and message being the pointer to the beginning of the string:

message db  'hello, world'
msglen  equ $-message

Calling conventions

42, the school at which I studied until the end of January 2017, has a course on assembly and it says the following:

Be aware of the "calling conventions"

Not knowing what "calling conventions" are, I go back to DuckDuckGo which leads me to wiki.osdev.org:

calling-conventions-1

What I understand from this graphic is that the "calling convention" defines what registers have what role when you call a function. For instance, if I call a function with 2 parameters, I'll put the first parameter value in rdi and the second parameter value in rsi. The function must put the return value in rax.

This means that, where I would do this in C:

int return_value_of_my_function = my_function(42);
printf("%d\n", return_value_of_my_function);

In assembly, I would do this (pseudo-code):

rdi <= 42
call my_function
rdi <= "%d\n"
rsi <= rax
call printf

Stack alignment

In 42's e-learning platform, the video dedicated to assembly talks about "stack alignment", without giving to much information about what it is. Searching for "stack alignment" on DuckDuckGo yields no easy to understand explanations.

Given that I'm not hindered by stack alignement for now, I'll keep going and come back to it only if it actually poses a problem.

Update 08/2017: Since I wrote this post, I found this StackOverflow answer on "stack alignment" which is a great introduction to the topic. Worth a read!

Hello world

Now that I have the basics down, I'd like to create a real program. One that I can run in my terminal. Nothing better than a good old "Hello world"!

To write a functional "Hello world", I'll need to call the write system call. How can I do a syscall with nasm? A Stackoverflow question about syscalls suggests reading the psABI-x86-64. And here is what that document says:

A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

The number of the syscall has to be passed in register %rax.

Moreover, when I had a look at the ELF file format, I saw that read-only data is saved in the .rodata section and that executable code was saved in the .text section. So here is the first version of assembly "Hello world":

section .rodata
    msg:    db 'hello, world', 10
    msglen: equ $-msg

section .text
    main:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall

What's happening here? I put 1 into rax and then use the syscall instruction. This calls write, which has syscall number 1, as can be seen in this syscall table.

Then I compile with:

nasm hello.s -f elf64 -o hello.o && ld hello.o -o hello

But, I'm getting a warning:

ld: warning: cannot find entry symbol _start; defaulting to 0000000000400080

I made a mistake here. I assumed that the first function to be called would be main, as in C. But it seems like this is not the case with assembly.

Looking for "cannot find entry symbol _start asm" on DuckDuckGo, I find out that my code should look like this:

.text
        global _start

    _start:
        ; code goes here

The global keyword indicates that the symbol _start is accessible from outside the current module, hello.s in this case. And if you have a look at the man elf, you can actually see that this global keyword is translated into a STB_GLOBAL flag:

stb-global-1

Anyway, so I replace main with _start and add global _start:

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global _start

    _start:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall

Then, I recompile and execute the program with ./hello, which gives me the following output:

Hello world
Segmentation fault (core dumped)

That's not good!

But I'm not the only one that runs into this is issue, as this Stackoverflow question on segmentation faults with NASM shows:

Because ret is NOT the proper way to exit a program in Linux, Windows, or Mac!!!! For Windows it is ExitProcess and Linux is is system call - int 80H using sys_exit, for x86 or using syscall using 60 for 64Bit or a call to exit from the C Library if you are linking to it.

Let's try to apply this by using the exit syscall:

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global _start

    _start:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall
        ; exit(0)
        mov rdi, 0
        mov rax, 60
        syscall

This time, after I've compiled and run the program, everything works, yay!

Now, with libc

In the previous examples, I was compiling without the libc. I'd like to be able to use printf instead of the write syscall now, so I'll need the libc.

Let's start by compiling (more specifically linking) with gcc instead of ld. gcc will automatically add the libc into the mix:

nasm hello.s -f elf64 -o hello.o && gcc -Wall -Wextra -Werror -o hello hello.o

Oh! This time, it looks like main is missing:

hello.o: In function `_start':
hello.s:(.text+0x0): multiple definition of `_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.text+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
collect2: error: ld returned 1 exit status

Let's go back to using main instead of _start:

section .rodata
    msg:    db 'Hello world', 10
    msglen: equ $-msg

section .text
        global main
    main:
        ; write(1, msg, msglen)
        mov rdi, 1
        mov rsi, msg
        mov rdx, msglen
        mov rax, 1
        syscall
        ; return 0
        mov rax, 0
        ret

Note how I also replaced exit(0) with return 0 because the libc's main function automatically calls the exit syscall with the value that is returned from main.

After compiling and running ./hello, I get "Hello world". All is good!

And now, with printf

Now that I have the libc in place, I'll replace write with printf.

On 42's e-learning platform, there was an example printf call that show I had to use extern printf. The extern keyword is described in the NASM documentation:

EXTERN (...) is used to declare a symbol which is not defined anywhere in the module being assembled, but is assumed to be defined in some other module and needs to be referred to by this one.

This is what I get once I've added these bits of code:

section .rodata
    format: db 'Hello %s', 10
    name:   db 'Conrad'

section .text
        global main
        extern printf
    main:
        ; printf(format, name)
        mov rdi, format
        mov rsi, name
        call printf
        ; return 0
        mov rax, 0
        ret

To make things a bit more interesting, I set printf's first argument to a format as opposed to a regular string.

After compiling and running the program, I get a Segmentation fault once again. Because printf has variable arguments, I'm thinking maybe I got the calling convention wrong. As is often the case Stackoverflow has a question and an answer about the printf calling convention:

Yes, RAX (actually AL) should hold the number of XMM registers used.

I have no idea what XMM registers are at this point, so I don't think I've used any. I'll just try to add mov rax, 0 before calling printf and see what happens:

Hello Conrad
Conrad

Better! The segfault is gone. But for some reason, my first name is output twice. After fiddling around a bit, I realise that printf expects a C string that has \0 at the end.

After adding \0, this is what my assembly code looks like:

section .rodata
    format: db 'Hello %s', 10, 0
    name:   db 'Conrad', 0

section .text
        global main
        extern printf
    main:
        ; printf(format, name)
        mov rdi, format
        mov rsi, name
        ; no XMM registers
        mov rax, 0
        call printf
        ; return 0
        mov rax, 0
        ret

I compile and run ./hello:

Hello Conrad

It works!

A "simple" function, bzero

Now, let's try to clone some "simple" functions from the libc, but in assembly. This will be a good way to learn about new x86-64 instructions, since I will need loops and index incrementation.

The prototype of bzero is void bzero(void *s, size_t n). This means that the rdi register will be a pointer to s and the rsi register will be the number of bytes to set to 0. The simplest way to clone bzero would probably be something like this:

while (--rsi >= 0) {
    rdi[rsi] = 0;
}

For --rsi, there seems to be a well-suited DEC instruction, as can be seen in the x64 instruction set published by Intel.

For the >= 0 comparison, the Jcc instructions seem appropriate. They set the instruction pointer to a specific address depending on predefined conditions:

The condition codes used by the Jcc, CMOVcc, and SETcc instructions are based on the results of a CMP instruction.

For rdi[rsi] = 0, I think the MOV instruction should work. However, I don't know how to tell MOV how to copy at the rsi index of rdi.The nasm docs on indexing know though: it is mov word [addr], val.

Knowing this, the assembly version of bzero would be something like this:

while (1) {
    rsi--; // DEC
    if (rsi < 0) return; // CMP, JL
    rdi[rsi] = 0; // MOV, JMP
}

I end up with the following assembly version of bzero:

section .text
    global my_bzero
    my_bzero:
    .loop:
        ; rsi--
        dec rsi
        ; if (rsi < 0) return
        cmp rsi, 0
        jl .ret
        ; rdi[rsi] = 0
        mov byte [rdi+rsi], 0
        jmp .loop
    .ret:
        ret

I've replaced mov word [...], 0 with mov byte [...], 0 after realising that word meant 16 bits instead of 8 bits. My goal is to copy one byte (8 bits) at a time. And I've named my function my_bzero so it doesn't conflict with the libc's bzero.

Now, I'll test my bzero with the following C program:

#include <stdio.h>

#define ZEROED_LEN (10)

void my_bzero(void* addr, long unsigned len);

int main() {
    char test[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    my_bzero(test, ZEROED_LEN);
    printf("%d\n", test[0]);
    printf("%d\n", test[1]);
    printf("%d\n", test[2]);
    printf("%d\n", test[3]);
    printf("%d\n", test[4]);
    printf("%d\n", test[5]);
    printf("%d\n", test[6]);
    printf("%d\n", test[7]);
    printf("%d\n", test[8]);
    printf("%d\n", test[9]);

    return 0;
}

I can change the ZEROED_LEN constant to check for a variety of behaviors.

Everything works as expected. Nice!

"Repeat String Operations"

42's course on assembly suggests learning about "Repeat String Operations". In the x64 instruction set, I see references to instructions: REP, REPE, REPZ, REPNE and REPNZ.

These instructions are supposed to be useful for functions like strlen, so that's what I'll try to clone from libc.

The x64 instruction set shows that the operands of these instructions are of type m8, m16, etc:

scas-1

However, in this course about "String instructions" by Ben Howard, there are examples without operands:

MOV     DI, DX ;Starting address in DX (assume ES = DS)
MOV     AL, 0  ;Byte to search for (NUL)
MOV     CX, -1 ;Start count at FFFFh
CLD            ;Increment DI after each character
REPNE SCASB    ;Scan string for NUL, decrementing CX for each char

So I try using SCAS without operands to reproduce strlen's behavior in assembly:

section .text
    global my_strlen

    my_strlen:
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scas   ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

But that doesn't compile:

hello.s:7: error: parser: instruction expected

After multiple searches, and for a reason I don't remember, I searched for "nasm prefix instruction expected" on Google and luckily found a tip regarding the use of "repeat string operations" with nasm on Stackoverflow:

NASM doesn't support the 'LODS', 'MOVS', 'STOS', 'SCAS', 'CMPS', 'INS', or 'OUTS' instructions, but only supports the forms such as 'LODSB', 'MOVSW', and 'SCASD' which explicitly specify the size of the components of the strings being manipulated.

Given that my goal is to compare each and every byte, one after the other, I'll use the "byte version", that's SCASB:

section .text
    global my_strlen

    my_strlen:
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scasb  ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

This compiles fine!

Let's test with the following C program:

#include <stdio.h>

#define ZEROED_LEN (10)

void my_bzero(void* addr, long unsigned len);
long unsigned my_strlen(const char *s);

int main() {
    char test[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

    my_bzero(test, ZEROED_LEN);
    printf("%d\n", test[0]);
    printf("%d\n", test[1]);
    printf("%d\n", test[2]);
    printf("%d\n", test[3]);
    printf("%d\n", test[4]);
    printf("%d\n", test[5]);
    printf("%d\n", test[6]);
    printf("%d\n", test[7]);
    printf("%d\n", test[8]);
    printf("%d\n", test[9]);

    printf("length: %lu\n", my_strlen("test"));
    printf("length: %lu\n", my_strlen(""));
    printf("length: %lu\n", my_strlen("hello world"));

    return 0;
}

The output looks OK at first sight:

0
0
0
0
0
0
0
0
0
0
length: 4
length: 0
length: 11

However, there is actually a bug. When I remove the bzero code, I have the following test script:

#include <stdio.h>

long unsigned my_strlen(const char *s);

int main() {
    printf("length: %lu\n", my_strlen("test"));
    printf("length: %lu\n", my_strlen(""));
    printf("length: %lu\n", my_strlen("hello world"));
    printf("length: %lu\n", my_strlen("bla"));

    return 0;
}

And with that, the output is not at all OK:

length: 18446744073709551615
length: 0
length: 11

What's happening here? I know that the char test[10] table was allocated on the stack. Maybe this is the mysterious "stack alignment" coming back to bite me?

Actually, it's not. After looking around a bit, I realise that Ben Howard puts -1 in the CX register. When I do this, my code works too:

section .text
    global my_strlen

    my_strlen:
        mov rcx, -1
        mov rsi, rdi ; backup rdi
        mov al, 0    ; look for \0
        repne scasb  ; actually do the search
        sub rdi, rsi ; save the string length
        dec rdi      ; don't count the \0 in the string length
        mov rax, rdi ; save the return value
        ret

This is the ouput I get with the test program in C:

length: 4
length: 0
length: 11

Copying and pasting code from Ben Howard without understanding it is no fun though. So I'll look for the reason why mov rcx, -1 magically fixes things. The answer is in the REP instruction algorithm inside the x64 instruction set documentation:

WHILE CountReg ≠ 0
DO
  Service pending interrupts (if any);
  Execute associated string instruction;
  CountReg ← (CountReg – 1);
  IF CountReg = 0
    THEN exit WHILE loop; FI;
  IF (Repeat prefix is REPZ or REPE) and (ZF = 0)
  or (Repeat prefix is REPNZ or REPNE) and (ZF = 1)
    THEN exit WHILE loop; FI;
OD;

This pseudo-code shows that if CountReg (RCX in my case), reaches 0, then the loop stops. What matters to me though, is the Repeat prefix is REPNE condition. That means that RCX must never reach 0. The simple way to prevent that is setting it to -1 so that when it is decremented at CountReg - 1, it never gets to 0.

Summary

I've discovered x86-64 assembly basics: registers, libc function calls, creating custom functions and calling them from C code, and various instructions (CMP, DEC, MOV, JMP, SUB, RET, JL, POP, PUSH and REPNE SCASB).

I have a lot left to learn, most notably what stack alignment is all about. And of course, there are tons of instructions and register types (XMM for instance) that could be helpful.

This week, I'm learning about ELF, the binary file format for Linux, BSD and others. As we will see, it is full of interesting information and will serve as a good introduction to assembly programming.

High level overview

My first search lead me to the Wikipedia page about ELF. It says something about "tables" and "sections" but everything is still blurry and I want to know every last detail about ELF.

To learn more, I open up the man elf in my terminal.

man-elf-1

OK, this is starting to clear things up a bit. There are 4 parts in an ELF file:

  • the ELF header,
  • the program header table,
  • the section header table,
  • and data referenced by said tables.

Method

I'll briefly explain how I plan to learn about ELF.

I tend to understand a technical subject better by programming something related to said subject. So my goal here is to build up a functional program that will read an ELF file bit by bit. As I learn more, I'll add more code to the program so it prints out more information.

Let's get started by examining what can be found in the ELF header:

elf-header-1

Lots of stuff! We can see that the ELF header is either an Elf32_Ehdr structure or an Elf64_Ehdr structure, depending on which processor architecture the binary file was built for. My computer has a 64 bits processor, so I'll use Elf64_Ehdr.

The ELF file I will be reading is /bin/ls, that's the file behind the ls command which lists files in a directory.

I start off with the following elf.c file:

#include <string.h>
#include <sys/mman.h>
#include <elf.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
    const char *ls = NULL;
    int fd = -1;
    struct stat stat = {0};

    // open the /bin/ls file with read-only permission
    fd = open("/bin/ls", O_RDONLY);
    if (fd < 0) {
        perror("open");
        goto cleanup;
    }

    // find the size of /bin/ls
    if (fstat(fd, &stat) != 0) {
        perror("stat");
        goto cleanup;
    }

    printf("Everything is OK!\n");

    // put the contents of /bin/ls in memory
    ls = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (ls == MAP_FAILED) {
        perror("mmap");
        goto cleanup;
    }

    // close file descriptors and free memory before the program exits
    cleanup:
    if (fd != -1) {
        close(fd);
    }
    if (ls != MAP_FAILED) {
        munmap((void *)ls, stat.st_size);
    }

    return 0;
}

For reading /bin/ls, I'm using mmap. This allows me to access the contents as if it were a large vector of bytes, instead of using multiples successive calls to read.

Note that in the following tests, I won't be checking "out of bounds" references. So I won't do this:

if (i < file_size) {
    do_something(file_contents[i])
}

Instead, I'll do this:

do_something(file_contents[i]);

This is clearly unsafe in a production context, but for testing, it'll be alright.

I can compile this program with the following command:

gcc -Wall -Wextra -Werror elf.c -o elf && ./elf /bin/ls

For now, all is good. The debug message "Everything is OK!" is output to my terminal.

Little by little, I'll now replace this debug message with code that will read specific parts of /bin/ls based on what the ELF header tells me. This means I'll go over each and every property of the Elf64_Ehdr structure and learn about their meaning.

However, I will not study e_ident[EI_VERSION], e_machine, e_version and e_flags because the man elf is quite clear about what they do.

e_ident[MAG*]

An ELF files apparently contains a series of bytes that allow to quickly recognize it as an ELF file and not a jpg or an mp4. This does not mean that a file containing these bytes is necessarily a valid ELF file, but at least, we can expect that it probably is.

Let's check if /bin/ls contains those bytes:

if (
    (unsigned char)ls[EI_MAG0] == 0x7f &&
    (unsigned char)ls[EI_MAG1] == 'E' &&
    (unsigned char)ls[EI_MAG2] == 'L' &&
    (unsigned char)ls[EI_MAG3] == 'F'
) {
    printf("Yes, this is an ELF file!\n");
}

OK ! /bin/ls does start with \0x7fELF.

e_ident[EI_CLASS]

The memory slot EI_CLASS contains either ELFCLASS32 or ELFCLASS64. These values respectively mean that the binary ELF file is meant to work on a 32 bits system or a 64 bits system.

if ((unsigned char)ls[EI_CLASS] == ELFCLASS64) {
    printf("Yes, this file is made for 64 bits!\n");
}

On my machine, /bin/ls was compiled for a 64 bits CPU.

Now, I'll test that EI_CLASS actually contains ELFCLASS32 for binaries that have been compiled for 32 bits CPUs. Here's how I can do that:

# install the 32 bits libc
sudo apt install gcc-multilib

# create a test file in C that displays "Hello world!"
echo '#include <stdio.h>' >> elf-32.c
echo 'int main() { printf("Hello world!\n"); return 0; }' >> elf-32.c

# compile the file for 32 bits
gcc -m32 -Wall -Wextra -Werror elf-32.c

# read the EI_CLASS value
readelf -h a.out | grep -e '^\s*Class'

And I get the following output:

  Class:                             ELF32

So for a 32 bits binary, I do get the ELF32, or ELFCLASS32, value in EI_CLASS. Awesome!

Note how there is also an ELFCLASSNONE value possible for EI_CLASS. I'm not sure what this could be used for.

Update 08/2017: After playing a bit more with ELF, I now understand what ELFCLASSNONE exists for: it is used to indicate an invalid EI_CLASS by programs that analyse ELF binary files, for instance readelf.

e_ident[EI_DATA]

Let's take a look at EI_DATA now. This memory slot says whether the ELF file is meant for a little-endian or big-endian computers.

I always thought that this was a choice at the operating system level. Actually, it's a choice at the CPU level. My x86-64 Intel CPU uses little-endian. And it turns out that Linux supports big-endian architectures as well, for instance AVR32.

Anyway, let's see of /bin/ls was indeed compiled for a little-endian CPU:

if ((unsigned char)ls[EI_DATA] == ELFDATA2LSB) {
    printf("Yes, compiled for little-endian!\n");
}

And I get the "Yes, compiled for little-endian!" message, so all is good!

e_ident[EI_OSABI]

The ELF header also allows one to know for which operating system and ABI (ie how different bits of already compiled code access each other) the ELF file was created.

I expect /bin/ls to have been compiled for ELFOSABI_LINUX since I'm running the tests on Ubuntu. Let's confirm this:

if ((unsigned char)ls[EI_OSABI] == ELFOSABI_LINUX) {
    printf("Yes, this is a Linux binary!\n");
}

Woops, the confirmation message is not displayed. That means that ls[EI_OSABI] is not ELFOSABI_LINUX. Unexpected!

I'll check the docs one more time to see what I've overlooked.

If the ABI of /bin/ls is not Linux, maybe it is ELFOSABI_NONE (or ELFOSABI_SYSV). I don't know what it is, but since it looks like the most generic one, it's worth a try.

Here's the test:

if ((unsigned char)ls[EI_OSABI] == ELFOSABI_SYSV) {
    printf("Yes, this is a SYSV binary!\n");
}

And, this time I get the confirmation message. Yes!

I think the ELFOSABI_SYSV is probably the ABI that was used on the System V operating system before Linux chose to use the same. However, I'm not 100% sure of that.

It seems that the memory slot EI_OSABI is the source of great confusion.

On the one hand, some people say that EI_OSABI identifies the kernel:

elf_osabi identifies the kernel, not userland

On the other hand, it seems that most ELF files on Linux use ELFOSABI_SYSV instead of ELFOSABI_LINUX:

All the standard shared libraries on my Linux system (Fedora 9) specify ELFOSABI_NONE (0) as their OSABI.

Except on recent versions of Fedora :

Possibly the vendor is cross compiling on FreeBSD or using a very recent Fedora system where anything using STT_GNU_IFUNC will be marked as ELFOSABI_LINUX.

And finally, the man elf says the following:

Some fields in other ELF structures have flags and values that have platform-specific meanings; the interpretation of those fields is determined by the value of this byte.

So what I think is important to remember here is that Linux mostly uses the SYSV ABI, but maybe someday the LINUX ABI will contain Linux specific things that the Linux kernel will then be able to interprete in a somewhat different way than it would with the SYSV ABI.

e_type

Let's move on to the next piece of information the ELF header gives us access to: e_type. That's apparently the type of ELF file currently being read. It could be one of:

  • ET_REL,
  • ET_CORE,
  • ET_NONE,
  • ET_EXEC, an executable file such as /bin/ls,
  • ET_DYN, a shared library.

ET_EXEC and ET_DYN seem quite evident to me. But I had to do some reading to understand what ET_REL, ET_CORE and ET_NONE are.

The ET_REL mistery

I did some many Google searches about this that I can't remember them all.

What I ended up understanding though, is that ET_REL stand for "relocatable". That means a piece of code that is not in its final place and will be added to other code to form an executable.

That's typically object files or static libraries.

The ET_CORE mystery

Asking DuckDuckGo about "ELF ET_CORE", I am redirected to a description of ET_CORE. This page tells me that ET_CORE files are coredumps.

Coredumps contain a copy of the crashed-process's memory when the crash happens, which can help programmers debug the crash after the fact.

But how can I get access to a coredump if I want to check for the ET_CORE flag inside it?

My first idea is to willingly make a program crash with a segmentation fault because I remember seeing "core dumped" when that happens. Here we go:

echo 'int main() {return (*(int *)0x0);}' > test.c
gcc -Wall -Wextra -Werror test.c
./a.out

I get a segmentation fault as expected. And a coredump.

Segmentation fault (core dumped)

However, I can't find the coredump file anywhere.

While reading up on coredumps on Stackoverflow, I learn about gcore. This tool makes generating a coredump much simpler:

sudo gcore <pid>

Update 08/2017: By default, Ubuntu does not generate coredump files. Instead, Apport sends crash reports to a webservice managed by Canonical to help them fix bugs. That's why I couldn't find the coredump file on my machine. You can see where coredumps go on your Linux system by running cat /proc/sys/kernel/core_pattern as root.

The ET_NONE mystery

Same as ELFCLASSNONE, it seems like ET_NONE is never actually used by anyone. I wonder if maybe this value is meant to be used by ELF file parsers to indicate invalid ELF file contents when needed. For instance, try to execute a file that has an invalid e_type, maybe internally, Linux uses ET_NONE to represent "This file is not valid, don't use it". Or something like that.

Given that I have had the same question with ELFCLASSNONE before, I'll ask on Stackoverflow. Just minutes later, an engineer from Google has an answer:

Anywhere validity of the ELF file is verified. Here is an example from the Linux kernel tools.

Tests

Now, let's run some tests just to make sure that I understand the different values of e_type.

# create a minimal program
echo 'int main() { return 0; }' > test.c
# compile to executable form, should be ET_EXEC
gcc -Wall -Wextra -Werror test.c
readelf -h a.out | grep -e '^\s*Type'

# create a minimal library
echo 'int test() { return 0; }' > test.c

# compile to object form, should be ET_REL
gcc -c -fPIC -Wall -Wextra -Werror test.c
readelf -h test.o | grep -e '^\s*Type'

# compile to a static library, should be ET_REL too
ar -r libtest.a test.o
readelf -h libtest.a | grep -e '^\s*Type'

# compile to a shared library, should be ET_DYN
gcc -shared test.o -o libtest.so
readelf -h libtest.so | grep -e '^\s*Type'

# create a coredump, should be ET_CORE
sudo gcore 1
readelf -h core.1 | grep -e '^\s*Type'

Sure enough, we get the expected output:

  Type:                              EXEC (Executable file)
  Type:                              REL (Relocatable file)
  Type:                              REL (Relocatable file)
  Type:                              DYN (Shared object file)
  Type:                              CORE (Core file)

Update 08/2017: I've just run the tests again on Arch Linux and the very first test, the one that shows "EXEC" on Ubuntu, now shows "DYN". I think this may be related to PIE per this StackOverflow thread about "EXEC" being reported as "DYN". Maybe gcc has different default options on Arch Linux than it does on Ubuntu.

e_entry

According to man elf, e_entry points to the first instruction that the CPU will execute when you launch a program. Let's test this with the most basic assembly we can use. We should be able to see our assembly code at the address pointed by e_entry.

void _start() {
    // _exit(0)
    asm(
        "mov $0, %rdi;"
        "mov $60, %rax;"
        "syscall;"
    );
}

I'll compile and link with:

gcc -c -Wall -Wextra -Werror test.c && ld test.o

I've manually linked with ld because linking with gcc automatically adds the libc (and all its magic) which I don't want for these tests.

And now, in our ELF test script, we'll print e_entry:

printf("Entry: %lu\n", ((Elf64_Ehdr *)ls)->e_entry);

In my case, e_entry has the value of 4194480. So, I expect to find the assembly instructions from above precisely at that address in the a.out binary. Let's see:

objdump -d a.out

This is the output:

a.out:     file format elf64-x86-64

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:	55                   	push   %rbp
  4000b1:	48 89 e5             	mov    %rsp,%rbp
  4000b4:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
  4000bb:	48 c7 c0 3c 00 00 00 	mov    $0x3c,%rax
  4000c2:	0f 05                	syscall
  4000c4:	90                   	nop
  4000c5:	5d                   	pop    %rbp
  4000c6:	c3                   	retq

The assembly instructions are indeed at addresses 0x4000b4, 0x4000bb (0x3c is 60 in base 10) and 0x4000c2. The _start function is located at address 0x4000b0, which is 4194480 in base 10.

As expected! Great!

Program headers

Usefulness

Further down in the ELF manual, I see information about program headers:

program_header-1

Apparently, these headers describe bits of memory called "segments" which contain instructions that will run at process startup.

This means that shared libraries and object files most likely don't contain program headers, since those types of files are not executable. A quick readelf -l test.o confirms that.

Alright, but now, what precisely is in a "segment"? What's the difference between a "section" and a "segment"? They feel like they are the same.

Before studying "sections", whose understanding will hopefully help me understand "segments", I'd like to run a few tests to see if I can read the contents of a program header.

Tests

The program headers are made accessible via e_phoff, e_phentsize and e_phnum. And each header contains a p_type.

I'd like to see the values of the p_type for each header in /bin/ls:

#define TEST_HEADER_TYPE(type, value, name)\
    if ((type) == (value)) {\
        printf("%s\n", (name));\
    }

Elf64_Ehdr *eh = (Elf64_Ehdr *)ls;
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    uint32_t type = ph->p_type;

    // all the p_type can be found in /usr/include/elf.h
    TEST_HEADER_TYPE(type, PT_NULL, "PT_NULL");
    TEST_HEADER_TYPE(type, PT_LOAD, "PT_LOAD");
    TEST_HEADER_TYPE(type, PT_DYNAMIC, "PT_DYNAMIC");
    TEST_HEADER_TYPE(type, PT_INTERP, "PT_INTERP");
    TEST_HEADER_TYPE(type, PT_NOTE, "PT_NOTE");
    TEST_HEADER_TYPE(type, PT_SHLIB, "PT_SHLIB");
    TEST_HEADER_TYPE(type, PT_PHDR, "PT_PHDR");
    TEST_HEADER_TYPE(type, PT_TLS, "PT_TLS");
    TEST_HEADER_TYPE(type, PT_NUM, "PT_NUM");
    TEST_HEADER_TYPE(type, PT_LOOS, "PT_LOOS");
    TEST_HEADER_TYPE(type, PT_GNU_EH_FRAME, "PT_GNU_EH_FRAME");
    TEST_HEADER_TYPE(type, PT_GNU_STACK, "PT_GNU_STACK");
    TEST_HEADER_TYPE(type, PT_GNU_RELRO, "PT_GNU_RELRO");
    TEST_HEADER_TYPE(type, PT_LOSUNW, "PT_LOSUNW");
    TEST_HEADER_TYPE(type, PT_SUNWBSS, "PT_SUNWBSS");
    TEST_HEADER_TYPE(type, PT_SUNWSTACK, "PT_SUNWSTACK");
    TEST_HEADER_TYPE(type, PT_HISUNW, "PT_HISUNW");
    TEST_HEADER_TYPE(type, PT_HIOS, "PT_HIOS");
    TEST_HEADER_TYPE(type, PT_LOPROC, "PT_LOPROC");
    TEST_HEADER_TYPE(type, PT_HIPROC, "PT_HIPROC");
}

Here's what I get:

PT_PHDR
PT_INTERP
PT_LOAD
PT_LOAD
PT_DYNAMIC
PT_NOTE
PT_GNU_EH_FRAME
PT_GNU_STACK
PT_GNU_RELRO

The reassuring thing is that I find the same types of program headers when I run readelf -l /bin/ls, so all is good here!

Section headers

Usefulness

On to section headers. The manual doesn't say much about them:

A file's section header table lets one locate all the file's sections.

As I said before, it seems like sections (pointed by section headers) and segments (pointed by program headers) are closely related.

There are some differences between program and section headers though. Let's find out exactly what they are.

Differences

The section header has something the program header doesn't have: e_shstrndx. It looks like this is the index of the section header for section .shstrtab, which contains the names of all of the sections. Inception!

Anyway, what this tells me is that we can probably get the buffer that contains all section names like this:

Elf64_Ehdr *elf_header = (Elf64_Ehdr *)ls;

Elf64_Shdr *shstrtab_header = (Elf64_Shdr *)((char *)ls + (eh->e_shoff + eh->e_phentsize * eh->e_shstrndx));

# this is a buffer that contains the names of all sections
const char *shstrtab = (const char *)ls + shstrtab_header->sh_offset;

What's the format of the shstrtab buffer though? Searching for "elf format" on DDG leads me to the ELF specification. It clearly defines the format of a "String table".

strtab-1

The important thing to note is that each section has a name. That name is stored in the section string table as a C string at a specified index.

Maybe less important, but very interesting: section names seem very close to the kind of thing you see in assembly language. I have never programmed in assembly before. I've only seen bits of assembly here and there. But maybe knowing ELF will help me learn assembly down the road.

Update 06/2017: I have started learning x86-64 assembly and knowing about ELF has turned out to be very useful.

Similarities

Sure, there are differences between sections and segments. But some things are the same. For instance, program headers have a type PT_DYNAMIC and section headers have SHT_DYNAMIC. This "Introduction to ELF" by Red Had says they are the same:

SHT DYNAMIC dynamic tags used by rtld, same as PT DYNAMIC

But I have a problem. I can't find anything that explains what "dynamic tags used by rtld" means. Page 42 of the ELF specification shows that segments of type PT_DYNAMIC and sections of type SHT_DYNAMIC contain ElfN_Dyn structs. Back in man elf, we can see that these structs contain a d_tag field. Probably the "dynamic tag" that Red Hat was talking about.

dtag-1

This d_tag is interesting! The manual tells me that it can have the type PT_NEEDED, in which case it contains an index into the string table .strtab, so that we can effectively get the name of a shared library that the ELF file we are reading depends on.

Let's try this out! Here's what I plan to do:

  • find the PT_DYNAMIC segment,
  • inside the PT_DYNAMIC segment, find the d_tag with type DT_STRTAB so I know where the .strtab is,
  • inside the PT_DYNAMIC segment, for each d_tag of type DT_NEEDED, print out the C string that is inside .strtab at index d_un.d_val (that should be the name of a shared library).

Let's go:

Elf64_Ehdr *eh = (Elf64_Ehdr *)ls;
// look for the PT_DYNAMIC segment
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    // I've found the PT_DYNAMIC segment
    if (ph->p_type == PT_DYNAMIC) {
        const Elf64_Dyn *dtag_table = (const Elf64_Dyn *)(ls + ph->p_offset);

        // look for the string table that contains the names of the
        // shared libraries need by this ELF file
        const char *strtab = NULL;
        for (int j = 0; 1; j++) {
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_STRTAB) {
              strtab = (const char *)dtag_table[j].d_un.d_ptr;
            }
        }

        // if there is no string table, we are stuck
        if (strtab == NULL) {
          break;
        }

        // now, I'll look for shared library names inside the string table
        for (int j = 0; 1; j++) {
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_NEEDED) {
              printf("shared lib: %s\n", &strtab[dtag_table[j].d_un.d_val]);
            }
        }

        // only look into PT_DYNAMIC
        break;
    }
}

And I compile again with gcc -g -Wall -Wextra -Werror test.c.

At first sight, everything works as expected. When I run ./a.out on the a.out ELF file instead of /bin/ls, so when I try to find what shared libraries a.out needs, I get:

shared lib: libc.so.6

However, when I try to read shared libraries of system binaries like /bin/ls, I get a segmentation fault. gdb tells me that something is wrong with my printf call. Specifically, gdb tells me that the memory I am trying to access is not available:

strtab = 0x401030 <error: Cannot access memory at address 0x401030>

That's odd, because readelf -a /bin/ls | grep 401030 shows that the .dynstr section is actually at address 0x401030.

I'm a little lost here. Again, I ask call Stackoverflow to the rescue. And the same Google engineer as before answers one more time:

If the binary is not running, you need to relocate this value by $where_mmaped - $load_addr. The $where_mmaped is your ls variable. The $load_addr is the address where the binary was statically linked to load (usually it's the p_vaddr of the first PT_LOAD segment; for x86_64 binary the typical value is 0x400000).

Now that I think about it, I think there was something about that in the manual. But I had overlooked it:

When interpreting these addresses, the actual address should be computed based on the original file value and memory base address.

Thanks to "Employed Russian" over on Stackoverflow and to the manual, I now understand how the memory is laid out. I've made a little image out of it for future reference:

elf-strtab-offset-1

Knowing this, I can fix my test script:

Elf64_Ehdr *eh = (Elf64_Ehdr *)ls;
// look for the PT_LOAD segment
const char *load_addr = NULL;
uint32_t load_offset = 0;
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    // I've found the PT_LOAD segment
    if (ph->p_type == PT_LOAD) {
        load_addr = (const char *)ph->p_vaddr;
        load_offset = ph->p_offset;
        break;
    }
}

// if there is no PT_LOAD segment, we are stuck
if (load_addr == NULL) {
  goto cleanup;
}

// look for the PT_DYNAMIC segment
for (int i = 0; i < eh->e_phnum; i++) {
    Elf64_Phdr *ph = (Elf64_Phdr *)((char *)ls + (eh->e_phoff + eh->e_phentsize * i));
    // I've found the PT_DYNAMIC segment
    if (ph->p_type == PT_DYNAMIC) {
        const Elf64_Dyn *dtag_table = (const Elf64_Dyn *)(ls + ph->p_offset);

        // look for the string table that contains the names of the
        // shared libraries need by this ELF file
        const char *strtab = NULL;
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_STRTAB) {
              // mark the position of the string table
              const char *strtab_addr = (const char *)dtag_table[j].d_un.d_ptr;
              uint32_t strtab_offset = load_offset + (strtab_addr - load_addr);
              strtab = ls + strtab_offset;
            }
        }

        // if there is no string table, we are stuck
        if (strtab == NULL) {
          break;
        }

        // now, I'll look for shared library names inside the string table
        for (int j = 0; 1; j++) {
            // the end of table of Elf64_Dyn is marked with DT_NULL
            if (dtag_table[j].d_tag == DT_NULL) {
                break;
            }

            if (dtag_table[j].d_tag == DT_NEEDED) {
              printf("shared lib: %s\n", &strtab[dtag_table[j].d_un.d_val]);
            }
        }

        // only look into PT_DYNAMIC
        break;
    }
}

Now, everything works as expected, with both a.out and /bin/ls.

From all of this, what I can conclude is that program headers are useful to an executable file whereas section headers are useful to shared/object files during the link phase. Both types of headers point to memory addresses that will end up in the executable file in the end, but don't describe the memory contents in the same way or with the same granularity.

GOT and PLT

In the man elf, there are multiple references to "GOT" and "PLT". Given I have no idea what this means, I'll try and see if I can find anything about those abbreviations.

The best explanation I could find is "PLT and GOT, the key to code sharing and dynamic libraries" at TechNovelty.org:

  • the GOT stores addresses of variables that are unknown at compile time but known at link time,
  • the PLT stores addresses of functions from shared libraries that unknown both at compile time and at link time (if compiled with -fPIC which is often the case), and are to be resolved dynamically at runtime.

After some additional queries about GOT and PLT on DDG, I find an article that shows how to exploit the GOT to execute a shellcode.

Another interesting find is this paper published by Leviathan Security about a modified ELF file format specifically made for forensic analysis.

Summary

During the writing of this article, I've learned about ELF file format and its contents (the ELF header, the program and section headers, etc), the readelf command`, how a shared library gets linked to an executable and the different kinds of ELF files (core, exec, object, shared).

This introduction to ELF also has also made me aware of the GOT and PLT, which will allow me to better understand certain types of hacks in the future.