Firestore Tips 3-5: Properties any Document Needs & Handling Concurrent Document Updates

How to securely update keys and increment numbers “atomically”

Lukas Oldenburg
The Bounce

--

Imagine two people writing a message on the same piece of paper, at the same time, without seeing each other’s message until after they are done. The outcome will likely be an illegible mix. Similar things can happen in Firestore if you don’t handle concurrent document updates properly. The second episode of my series with Firestore tips helps you with that.

But first, let’s start with something that is already so automatic for me that I almost forgot to include it in this list:

“Concurrent handling of Firestore documents” (OpenAI DALL·E)

Tip #3: The properties that any document should have

One thing that takes a bit of practice to get used to is the difference between get() and to_dict(): While get() gets you a “Snapshot” with meta information like the document’s id, to_dict() gives you a proper dictionary (JSON) with the document’s content like you see it in the Firestore interface:

# Example 1: Get 1 document:
# Specify the collection and document you want to retrieve
doc_ref = db.collection("myCollection").document("TheDocumentId")

# Gets a "Snapshot" with meta information about the document, e.g. `id`, `create_time`
doc_snap = doc_ref.get()
print(doc_snap.id) # print document ID
print(doc_snap.create_time) # print timestamp when doc was created

# Check if the document exists and get its content into a dictionary
if doc_snap.exists:
doc_content = doc_snap.to_dict()

# ---
# Example 2: Get one or more documents from a "stream" via a filtered query:
# Create a stream of documents (as a Python "generator" object)
filtered_doc_stream = db.collection("myCollection").where(filter=FieldFilter("FieldToFilter",
op_string="==", value="myFilter")).stream()

# Iterate through the stream results:
for doc_snap in filtered_doc_stream:
print(doc_snap.id) # In a stream, no "get()" is needed to get the meta info.
doc_content = doc_snap.to_dict()
print(doc_content) # document content as dictionary

While you can access important meta information like the document’s ID via the snapshot, I find it a lot easier if the most important meta info is also part of the document’s content. That meta info in my case is usually:

  • id: The document’s ID
  • lastUpdated: I very often need this to filter or sort by the “newest” or “last edited” documents
  • createdAt or startedAt (when the document creation coincides with the start of a process): Similar to lastUpdated, I often need to filter or sort by the newest or oldest documents
  • expireAt: For the Time-to-Live logic (when this document shall die), see part 1.
Example Firestore document. You can see the “expireAt”, “id”, ”lastUpdated” and “startedAt” properties.

Having these properties in every document offers the following benefits:

  • You can use stream queries to filter and sort documents by their ID, the document creation or last update time. I need that — a lot!
  • It is easier and requires less code to handle everything via a proper dictionary (the Python equivalent to a JavaScript “object”) than having to jump back and forth between the doc meta info from the Snapshot and the dictionary.
  • Having all that meta info in the document content makes copying and exporting of the document content easy, e.g. converting document properties into columns of a dataframe, a BigQuery table, or simply converting the document to JSON for further processing.
  • If you ever need to migrate your Firestore data to another system or create a backup, having a self-contained document that includes both content and meta information makes the process straightforward.
  • Last but not least, you can use these fields when creating an index (if you have to — as I mentioned in part I, I avoid this as much as I can, because with a smart document structure, you can avoid indexes). By having that meta information as fields in your documents, you can leverage combined indexes to create more complex queries without additional read costs.

These benefits are not relevant for every use case. It always depends on the purpose of the document collection and how you need to be able to access it. I would however at least always have an id property.

Missed Part 1? Read about the differences between BigQuery and other SQL-based databases and Firestore, how to play God in Firestore, and how you can save query complexity and costs by designing your index wisely.

Tip #4: Use ‘update plus dot notation’ to update only specific keys in a document

When working with data pipelines, you often have multiple operations running at the same time. I have some pipelines where multiple Cloud Functions can access and update the same Firestore document at nearly the same time. This can cause really difficult problems when one script overwrites what the other script just changed.

Consider the following case:

Assume a Firestore document has 3 properties. Let’s represent them as a JSON object:

{
"task1": "running",
"task2": "running",
"task3": "completed"
}

Now assume the following happens in this order:

  1. Script 1 gets the document into memory.
  2. Script 2 gets the document into memory.
  3. Script 1, which is responsible for handling task 1, notes that task 1 is finished. So it sends an update to Firestore with the updated full document as follows:
{
"task1": "completed", // changed by script 1
"task2": "running",
"task3": "completed"
}

4. Just 10 milliseconds later, script 2, which is responsible for handling task 2, notes that task 2 is also finished and updates the document. Since it got the document into memory before script 1 updated it, it still has “task1” as “running” in its memory. So after script 2 updates the document, the document looks as follows:

{
"task1": "running", // back to "running", as script 2 overwrote script 1!
"task2": "completed", // changed by script 2
"task3": "completed"
}

Now imagine a typical scenario where there is a monitoring task that checks which tasks need to be handled (this is what I do for example when running huge bulk tasks like getting Adobe Analytics Component Usage for all segments, dimensions, metrics and date ranges or deleting 1000's of segments in one run). This task monitor now sees that task 1 is still running, even though it started a long time ago. Depending on the logic, it might mark the task as “failed” or restart it again, suspecting a timeout. Not what we want!

The solution is the update statement with dot notation (documentation). It is similar to Python’s native update command for dictionaries: Instead of passing back the whole updated document, you can update e.g. just a single key in a document, leaving everything else unchanged. So, for example, script 2 would run:

doc_ref = db.collection(u'myCollection').document(u'TheDocumentID')
doc_ref.update({'task2': 'completed'}) # this will change only task2, not overwrite task1 again

# or, with a dynamic reference:
task_id = 2
doc_ref.update({f'task{task_id}': 'completed'})

You can of course also update multiple keys in one run:

doc_ref.update({"lastUpdated": DatetimeWithNanoseconds.now(), "task2": "completed"})

I call it “with dot notation” because similar to JavaScript objects (but unlike regular Python dictionaries), you can use a dot (.) to access sub-properties directly and safely. Imagine our document looks like this:

{
"tasks": {
"0": {
"status": "running",
"startedAt": "2023-08-11 12:34:56.123456789"
},
"1": {
"status": "running",
"startedAt": "2023-08-11 11:33:52.123456789"
}
}
}

To update tasks["0"], we do:

updated_task0 = {
"status": "completed",
"startedAt": "2023-08-11 12:34:56.123456789",
"completedAt": "2023-08-11 12:38:56.223423456"
}

doc_ref.update({f'tasks.0': updated_task0}) # note the "."

If this sounds too obvious for you, I included it due to my own stupidity aka learning experience. I was not aware of this for quite a while.

Last but not least: If you (like me) like to use the set statement a lot instead of update because set also creates a document in case it does not exist yet, you can get a similar result to ‘update with dot notation’ by adding merge=True to set commands (more in Google’s documentation).

Tip #5: Use “Increment“ to in- or de-crement a value atomically

Now imagine though that the key you must update is not exclusively reserved for your script, like in the task ID example above. Instead, a key can be updated by many different sources.

Example: You have a script that does thousands of changes using an API that has a limit in terms of how many requests it can handle at the same time. So you want to make sure that the number of currently running concurrent tasks are stored centrally. For that, you can use a simple Firestore document that keeps track of the number of currently running tasks like this one:

Now imagine we have 2 API tasks that start at almost the same time. Both tasks want to tell this Firestore document to increment the value of runningTasksby 1. In the end, the value should be 2.

But if we first had to get the document into memory, read out the value of runningTasks and then increment it and then update the document, we would run into a similar issue as before: If both tasks get the document in their original state, both would increment from 0 to 1, and then we would end up with 1 and not 2 in the runningTasks field. Similarly, when the tasks end, they would decrement the value. If there is no time overlap this time, we would then end up with -1 currently running tasks… 😝

Luckily, Firestore has a solution for that: Increment. With it, you can atomically increment (or decrement) a value of a key. Atomically means it is done directly, i.e., without the possibility that another incrementation operation happening at the same time could cause issues. So when using “Increment”, you don’t need to first get the document into memory, read it, increment it and then send the update. Instead, you directly update the document (again with update plus dot notation). No need to first read it:

doc_ref = db.collection(u'myCollection').document(u'runningTasks')

# Increment the value of the `runningTasks` key by 1
doc_ref.update({u'runningTasks': firestore.Increment(1)})
# Decrement by 1:
doc_ref.update({u'runningTasks': firestore.Increment(-1)})

That’s it for my second tips & tricks episode with Firestore. Maybe there will be more? Stay tuned!

Do you want to read my content right in your mailbox, without the Medium reading restrictions for non-paying users, immediately after I publish it? Subscribe! I will not use your contact for anything but this purpose.

--

--

Digital Analytics Expert. Owner of dim28.ch. Creator of the Adobe Analytics Component Manager for Google Sheets: https://bit.ly/component-manager