CategoryRamblings

Know your data structures – List vs Dictionary vs HashSet

Are there any cases when it doesn’t really matter how your data is structured, as long as you’re fulfilling the task at hand? Or is it always important to use the perfect data structure for the job? Let’s find out!

Those collections have quite different purposes and use cases. Specifically, Lists should be used when all you have to do is stuff like enumerating the items or accessing them randomly via index.

Lists are very similar to plain arrays. Essentially they are an array of items that grow once its current capacity is exceeded. It’s the standard and probably the most used collection. Items can be accessed randomly via the [] operator at constant time. Adding or removing at the end costs O(1) as well, except when capacity is exceeded. Doing it in the beginning or the middle requires all the items to be shifted.

Dictionaries and HashSets instead are specialised collections intended for fast-lookup scenarios. They basically map the item with a key built using an hash function. That key can be later on used to quickly retrieve the associated item.

They both share more or less the same asymptotic complexity for all the operations. The real difference is the fact that with a Dictionary we can create key-value pairs (with the keys being unique), while with an HashSet we’re storing an unordered set of unique items.

It’s also extremely important to note that when using HashSets, items have to properly implement GetHashCode() and Equals() .


On Dictionaries instead that is obviously needed for the Type used as key.

I wrote a very small profiling application to check lookup times of List, Dictionary and Hashset. Let’s do a quick recap of what these collections are. It first generates an array of Guids and uses it as source dataset while running the tests.

The code is written in C# using .NET Core 2.2 and was executed on a Macbook Pro mid 2012. Here’s is what I’ve got:

Collection creation
Collection creation

Lists here perform definitely better, likely because Dictionaries and HashSets have to pay the cost of creating the hash used as key for every item added.

Collection creation and lookup
Collection creation and lookup

Here things start to get interesting: the first case shows the performance of creation and a single lookup. More or less the same stats as simple creation. In the second case instead lookup is performed 1000 times, leading to a net win of Dictionary and HashSets. This is obviously due to the fact that a lookup on a List takes linear time ( O(n) ), being constant instead ( O(1) ) for the other two data structures.

Lookup of a single item
Lookup of a single item

In this case Dictionaries and HashSet win in both executions, due to the fact that the collections have been populated previously.

Lookup in a Where()
Lookup in a Where()

For the last example the system is looping over an existing dataset and performing a lookup for the current item. As expected, Dictionaries and HashSet perform definitely better than List.

It’s easy to see that in almost all the cases makes no difference which data structure is used if the dataset is relatively small, less than 10000 items. The only case where the choice matters is when we have the need to cross two collections and do a search.

The importance of setting the boundaries (of your domain models)

First article of the year! I really wanted to start writing this few weeks ago, but honestly I wasn’t inspired enough.
Now that I’ve spent a good portion of the Christmas break reading blogs, books and watching courses on Pluralsight, I still don’t feel inspired enough.

I guess it’s due to how I spent the other portion (eating, sleeping and playing with my kids, mostly), which left me basically without any energy at all.

But as they say, the first step is always the hardest, no?

Lately I’ve been putting some effort into improving my DDD techniques. For those of you who are still living under a rock, please consider taking a copy of the marvelous blue book by Eric Evans.

I still am in the middle of the learning process, even though I probably started this path years ago. Might be a sign of impostorism. Or it might be the fact that in this job, as in many other jobs, you never stop learning.

Not going to discuss about DDD now, or what the benefits are. I shall leave this for another post.

This is going to be just a very quick introduction about boundaries and aggregates instead. What’s an aggregate? An example should make things easier.

We can consider an Order our aggregate and Order Lines compose its internal state.

I said “internal” for a reason: the Order is the “entry-point”. We can’t have Order Lines without an Order that contains them and obviously an Order without Order Lines is pretty useless.

At the same time, you can’t access Order Lines from the outside world without going through the Order.

Clear? Definitely not rocket science.

Why “aggregate” ? Well, because you’re combining things together and building up a structure that mimics your current domain. The Order and the Order Lines are entities and value objects that will eventually form the Aggregate.

I’ll talk more about the distinction between Entities and Value Objects in another post, for now just take that for granted (or lookup on Google!).

For now just think that eventually your system will be a composition of multiple Aggregates acting and interacting together, and the quality of their interaction will be a representation of how good you know your Domain. Communication is the key here!

At this point I would say that the trick is to find the right balance and being able to identify the right boundaries for your Aggregates. Why? Simply because of divide-et-impera.

Our domain might be complex from the beginning, or become extremely complex over time. Everybody has seen this happening at some point. So take a deep breath, talk to your Domain Expert ™ and identify what are the edges and the sets of your entities. That’s it. Compartmentalize.

There’s a lot more to write on this, and I definitely will. When? That’s definitely a good question! A lot is going on in my life these days and weeks and I have the sensation that 2019 is going to be a crucial year for everyone here.

Ciao!

talk @ UniSa – 12/12/2018

Every couple of weeks or so the University of Salerno hosts former students like me to talk a bit about their professional experience. Yesterday was my turn!

I was invited by my old teacher Vittorio Scarano (hi prof!) to talk in front of a big audience of grads and undergrads. I had a lot of fun telling them my story, we discussed about my current Company, my daily job and routine and a little bit about Design Patterns (because, why not).

I got very interesting questions from them at the end and it’s amazing to see how much passion these young students have and how strong is their desire to learn and improve.

So a big, big thank you to everyone who attended! And just FYI:

  1. my team is always hiring, so make sure to check the latest open positions
  2. a very good friend of mine is hiring as well in Salerno, check it out!
  3. don’t forget to register to the DevDay Salerno community!

List of useful Docker commands

In the last few weeks I’ve been doing several experiments with Docker, just trying to grasp the main idea and maybe even come up with something useful.

As often happens with tools these days, there’s an entire world of command line tools that you should learn.

OR you can just be lazy as me and keep a list of the most common ones 🙂

I’ll be updating this list from time to time, just to keep track of what I’ve used so far and be a quick reference for my sloppy memory.

docker login [HOST] # in case you need auth to pull or push images
docker build -t [TAG]
docker images # shows a list of the local images along with the size
docker push [TAG]
docker rm $(docker ps -a -q) # this will remove ALL your containers. Be careful!
docker system prune -a # this will remove ALL your images and containers. Very useful when you run low on space, but be careful!

The next command will display the Docker daemon logs, extremely useful when you don’t know what happened (basically me most of the time) and you’re looking for an answer, even if obscure.

It depends on the OS you’re running on:

  • Ubuntu (old using upstart ) – /var/log/upstart/docker.log
  • Ubuntu (new using systemd ) – sudo journalctl -fu docker.service
  • Boot2Docker – /var/log/docker.log
  • Debian GNU/Linux – /var/log/daemon.log
  • CentOS – /var/log/daemon.log | grep docker
  • CoreOS – journalctl -u docker.service
  • Fedora – journalctl -u docker.service
  • Red Hat Enterprise Linux Server – /var/log/messages | grep docker
  • OpenSuSE – journalctl -u docker.service
  • OSX – ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/log/d‌​ocker.log
  • Windows – Get-EventLog -LogName Application -Source Docker -After (Get-Date).AddMinutes(-5) | Sort-Object Time, as mentioned here.

(thanks, Scott)

Immutable Builder Pattern

This time we’ll talk about the Immutable Builder Pattern, but with a twist: the resulting instance has to be immutable.

From time to time I need to move away from the routine, just to avoid getting bored. Also, taking short breaks might help viewing things under a different perspective. Anyways, while working on Statifier I felt the need to get back to the roots and dive a little bit into the Design Patterns World ™ .

Straight from Wikipedia:

“the intent of the Builder design pattern is to separate the construction of a complex object from its representation”

This pattern is indeed extremely useful when you have to create an instance of a complex class, eg. with a long list of properties that have to be initialized.

It can be confused sometimes with the Factory pattern, but there’s a very important difference: Factories can be used to create an object instance based on some rules and return an interface. The result will be any concrete implementation of that interface, but consumers won’t (and don’t need to ) know which one.

Builders instead know how to create a single, specific class and the result is exactly an instance of that class.

And why should I be using the Builder if it can create just one thing? Well maybe because the creation is complex and requires several steps. Maybe you cannot just do something like builder.Build(param1, param2…..) , but you need a more structured approach.

The Wikipedia page has already some good examples of how the pattern can be implemented so I won’t talk much about that. 

But what is immutability anyway? Why do I need it?

At its core, immutability is a simple, very important concept: don’t let anyone from the outside update the internal state of your objects. That’s it.

Coding-wise, it translates basically into classes with no setters at all. All the properties will be initialized in the cTor, along with all the dependencies.

And why should I use it? Here’s an excellent article by Martin Fowler (yeah I love quoting him). Take your time, I’ll wait.

Ok so now let’s get back to out Builder. Here’s a sample implementation:

Few points to note:

  1. Vehicle cTor is private. Only the Builder can create an instance, and this is the main reason why the Builder class is declare inside the Vehicle class.
  2. all properties on Vehicle are get-only. Internal state cannot be changed, immutability is ensured. Good luck changing them.
  3. every call to Build() creates a new instance of Vehicle. This is again made to ensure immutability but comes with a twist: you have to be more careful if you want to reuse an instance of the builder to create more vehicles. You’ve been warned.

© 2019 Davide Guida

Theme by Anders NorenUp ↑