Firestore Tips #1 & #2: How to Play God in Firestore, and Firestore vs. SQL

Plus: Why going against the alphabet makes queries easier

Lukas Oldenburg
The Bounce

--

Long before Firestore made it into the toolkit of Google Tag Manager’s Server-Side offering, it was a powerful document database. Working with Firestore for some years now, I want to share simple tips that may help you avoid the same mistakes I made. First though, I will compare Firestore with classical SQL-style databases.

“God in a Firestore” (OpenAI DALL·E)

Part 2 is out: Read about the properties every Firestore document needs and how to handle concurrent updates to documents.

Firestore vs. classical SQL databases/BigQuery

I have been using Firestore for 3 years now. It has become an integral part of all my Google Cloud Platform data flows (e.g. for the Adobe Analytics Component Manager for Google Sheets). However, understanding the difference between Firestore and classical SQL databases was at first not that easy. After all, both are “databases”, right?

Google Cloud Firestore is a super-fast document database, i.e., it is not a relational database where multiple tables can be joined by common identifiers (like MySQL databases). It is also not like BigQuery, a column-oriented Data Warehouse. You can imagine Firestore like a big container of documents, where each document is like its own JS object/Python dictionary. The structure of each document is completely flexible: Document 1 may have 3 keys “saladType”, “sausageType” and “cheeseType”, which have all string values. Document 2 may have just “sausageType”, but the value is not a string, but an array of objects (even though that would likely be dumb, it is possible). Document 2 could also have some additional keys (e.g. “houseAddress” and “schoolAddress”).

That is the total opposite to a rigid relational or BigQuery-style SQL database where you need to clearly define each column (closest equivalent to “key”) and its data type (string, number, map etc.). Sending in a row with a column that has not been defined? Error. A column’s value does not adhere to the defined data type? Error.

Not in Firestore. That makes it perfect to store unnormalized, i.e. flexible, semi-structured or highly nested data.

Looking at a Firestore document that contains a data layer error event. Every data layer event has a different payload, so the variables and properties inside of each document can vary massively.

Firestore is furthermore perfect for retrieving individual documents by their IDs. So if you know you need the address and other data about customer “c1234”, you simply ask for document “c1234” and you have it within milliseconds.

Compare that to BigQuery where we usually wait seconds or minutes depending on the query, even if it returns just one row. And the costs for such a query are usually significantly higher.

Another good thing about Firestore is that you can update or delete individual documents super-fast and with no limits. BigQuery, instead, has strict limits on the number of DELETE or UPDATE statements, and engineers often have to find questionable and resource-intense ways to work around those limitations. Also, BigQuery works a lot with caching, so when another INSERT or UPDATE query alters a BigQuery table and you run a SQL statement just a minute later, your query might get results that are based on the state of a couple of minutes ago. That never happens in Firestore.

That does not mean Firestore is better than BigQuery or other SQL databases. Its use case is simply very different. BigQuery is not made to be a transactional or application database. Its strength are complex analytical queries on millions of rows. Which is exactly what Firestore stinks at. So: Where BigQuery excels is where Firestore fails. Getting more than just one document (think: one row in BQ) is possible, but as soon as you try to get hundreds of documents, Firestore becomes slow and highly inefficient, as you have to iterate over a “generator” object and then read out every single document in a “for” loop.

Your Firestore costs are based on document reads, amongst others. So you definitely don’t want to do such multi-document reads frequently. Thus, using Firestore to quickly export a couple of thousand rows is not what you should use Firestore for. Analyzing the data in Firestore collections (like “give me the average duration of a script run by looking at all documents in the scriptRuns collection and their duration property values and then calculating the average”) is also a should-not if you have thousands of documents.

Furthermore, the queries you can put on Firestore databases are very limited, and queries with more than one condition (or even a combination of “key” equals “value” plus a different sorting order than “ascending”) require you to create an index for that query first. This takes hours initially and incurs additional costs.

80% of the limited querying options of Firestore

So much for what I have found to be the biggest differences between Firestore and SQL-style databases. However, “SQL-style” is very broad, as there are so many SQL database types, so I am likely not covering the SQL sphere sufficiently. Feel free to correct me in the comments, thanks! 😇

Now to my first tip when working with Firestore:

Tip #1: Use an alphabetically inverted document ID

The default sorting method of Firestore is “ascending”. And you can’t change the default. That means the document with the alphabetically lowest ID will always be gotten first. That also means that, if your document ID starts with a timestamp or an incrementing number, you will always see the oldest document first. I am showing it here based on the interface, but the logic is the same if you query Firestore via code (e.g., when getting a “stream” of documents):

In most of my use cases, I want to get the newest documents first, not the oldest ones. But for that, I already need to use a filter query (switching to “descending” order):

If I now want to combine that with a query like “result equals ‘done’”, I get the error that “Sorting isn’t applicable for certain types of conditional filtering”. So even a simple query like “give me all completed scripts sorted by date, newest first”, becomes a challenge. You can overcome this by creating an index, but that is a bit cumbersome, increases costs, and you need to create an index for every query combination you want to use, which is inefficient.

So a better approach is to prepend your document ID with a value which, when sorted alphabetically, lists the newest document first:

We thus need to assign the lowest values to the newest documents. How can you do that? You simply take a date far into the future, e.g., Jan 1, 2100 (assuming that our architecture will be obsolete by then… 😅), and then subtract the current date and time from it. As long as we don’t time-travel, the number becomes lower as we approach Jan 1, 2100. In Python, that could look like this (feel free to make this more efficient):

import time
from datetime import timedelta, timezone, date
from google.api_core.datetime_helpers import DatetimeWithNanoseconds

SCRIPT_RUN_ID = "your-actual-document-ID"
# get the current time into logged_at
logged_at = DatetimeWithNanoseconds.now(timezone.utc)
# get the 1st of January 2100
jan1_2100 = time.mktime((date(2100, 1, 1)).timetuple())
# turn the current time into a timetuple timestamp
logged_at_ts = time.mktime(logged_at.timetuple())
# subtract the current time from the 1st of January 2100 and turn it into an integer number, e.g. "3623784"
decreasing_ts = int(jan1_2100 - logged_at_ts)
# prepend it to the script run ID, e.g. "3623784-your-actual-document-ID"
log_id = f"{decreasing_ts}-{SCRIPT_RUN_ID}"
# Create the Firestore document, using the original ID in the "id" field so you can easily reference it directly later.
firedoc = {
"id": SCRIPT_RUN_ID,
"loggedAt": logged_at
# the rest of your document values come here
}
# now create your document based off of the firedoc dictionary as usual

Note that Firestore timestamp fields are nanosecond-specific, so you need the “DatetimeWithNanoseconds” module to accomplish that, as Python’s standard datetime module “only” provides precision to the microsecond, and Firestore will reject microsecond-specific timestamps.

Of course, “newest first” is just one example. The takeaway is: When designing your document IDs, think about the order you would like to have by default when querying documents, and then create the ID logic accordingly.

Addendum: Hernán had a helpful note that you should take into account if your Firestore database serves a high-volume environment (500 document creations per seconds and up). Using “monotonously increasing” values like timestamps can be problematic in these cases if you don’t handle the volume well.

Tip #2: Play God: Decide over the “Time to Live”

In many cases, your Firestore database will just get more and more documents over time. Take our Data Layer Error tracking example. Another case for me is to monitor Cloud Function runs (runtime, timeouts, etc.), where each Cloud Function run gets its own document. Or architectures where you create Firestore documents with task lists inside to then work on the tasks, continuously updating the document with the progress.

Once the tasks are done, there is some value in keeping the Firestore documents around for a while for debugging and maybe reporting purposes. But nobody will care whether e.g. the Adobe Analytics Product Classification import half a year ago ran to completion. Since all these documents cause unnecessary storage costs and may have negative impacts on your Firestore queries, you should delete those that you don’t need anymore.

Similar to Cloud Storage lifecycle management, Firestore introduced their own lifecycle management not so long ago.

And because it is Firestore, it is flexible: You simply specify the key which contains the timestamp when to delete the document. That timestamp is the document’s “Time To Live” (TTL). So when creating the document, already tell Firestore when it should perish — like a God. In this example, I use the expireAt field for that purpose:

from datetime import timedelta, timezone
from google.api_core.datetime_helpers import DatetimeWithNanoseconds

now = DatetimeWithNanoseconds.now(timezone.utc)
doc_content = {
"id": "{{the-document-ID}}",
"createdAt": now,
"status": "created",
"expireAt": now + timedelta(days=90) # document should be deleted in 90 days
}
# Now create the document in Firestore as usual ...

In the Firestore UI, click on the trash icon on the left side, then “Create Policy” on the top and follow the steps to set up your TTL. You need to do that for each collection where you want documents to “die” after their “time to live” is up.

This is what heaven looks like? Quite boring…

What happens if I create a document without an expireAt key? The document will live forever.

That’s it for my first tips & tricks episode with Firestore. In the meantime, Part 2 has come out: Read about the properties every Firestore document needs and how to handle concurrent updates to documents.

And if you want to read my content right in your mailbox, without the Medium reading restrictions for non-paying users, immediately after I publish it, subscribe! I will not use your contact for anything but this purpose.

--

--

Digital Analytics Expert. Owner of dim28.ch. Creator of the Adobe Analytics Component Manager for Google Sheets: https://bit.ly/component-manager