1. Subcollections

The connector currently supports extracting subcollections. When configuring it, you need to explicitly specify which ones you want to extract.

The most efficient way for extracting subcollections is by using Collection Groups, which allows rerieving subcollections from different parent documents in a single query.

Imagine the following structure:

When using Collection Groups, the connector would retrieve all reviews from all restaurants in a single query, significantly speeding up the extraction process, especially for large collections with nested subcollections.

2. Incremental vs full sync

Incremental sync is allowed for both collections and subcollections when using Collection group queries. For setting up incremental mode with collection groups, you have to create an index based on the incremental key.

For example, if you’re using an updated_at field as incremental key, you must create an index for that same key to allow Collection Group queries.

More information about this can be found on the official documentation.

3. Schema

All streams have the following properties:

  • _id: a string field corresponding to the document ID.
  • document: a stringified JSON version of the document.

Additionally, there are some properties that might be included depending on the extraction mode and sync type.

For incremental sync, the replication key is mapped to a field:

  • replication_key: a date-time field corresponding to the incremental key selected by the user during the stream configuration.

For subcollections extracted using Collection groups, there’s a path indicating the full path to reach the document:

  • path: a string indicating the full path of the document when extracting subcollections.

4. Connector configuration

  1. In the Sources tab, click on the “add source” button located on the top right of your screen. Then, select the Firestore option from the list of connectors.

  2. Click Next and you’ll be prompted to add your access. Check the instructions next to each configuration option to discover where you can find the required parameters for the connection.

  3. These are the available configurations for this source:

    • Credentials file: The credentials file for the service account linked to your Firestore project. Make sure the account has access to perform read operations on your collections. You should upload the JSON credentials file directly.
    • Database name: The database name to extract data from. If not provided, the default database for the account will be used.
    • Batch size: The number of documents to process in a single stream. Keep in mind higher batch sizes may cause timeouts when reading the document stream.
    • Subcollection extraction mode: Determines how subcollections are handled by the connector.
      • Nested documents: Recursively fetches subcollections for each parent collection, embedding subcollections within parent documents.
      • Collection group: Extracts subcollections as separate streams, using Collection Group queries to speed up the extraction process.
      • None: Ignores subcollections.
    • Filter collections by name: This parameter allows you to filter only specific collections from the database. This is useful to speed up the discovery process when you’re interested in just a few collections, but your database has a lot of available ones to be discovered.
    • Filter subcollections by name: Filter nested subcollections to avoid extracting unnecessary data. This is useful when you just need a subset of subcollections from a given document. This configuration depends on the Subcollection extraction mode chosen.
      • For Nested documents mode, you should use the notation collection.sub_collection, for example conversations.messages if you want to extract the subcollection messages from the top-level collection conversations. A wildcard is also accepted if you want to get all nested subcollections, for example conversations.messages.* will extract all nested subcollections under conversations -> messages.
      • For Collection group mode you can simply enter the name of each subcollection you want to extract. Please note subcollections with the same name under different root level collections will be mapped to the same stream. It’s a good practice to use unique names for subcollections to avoid this behavior.
    • Start date: Starting point for incremental syncs (ISO-8601 format)

    Best Practices

    • Configure indexes for using Collection group queries for subcollections whenever possible. This significantly improves extraction time, saving time and resources.
    • If configuring your streams as incremental, make sure to include a date-time field that indicates when was the last time a document was updated. This is necessary to ensure data integrity and consistency when performing incremental extractions.
    • Be explicit when defining filters for collections and subcollections to avoid extracting data that won’t be useful for you. This helps reduce costs from both the cloud resources needed to perform the data extraction, as well as Firestore itself. You can always add more streams later in the process if needed.

4. Select your Firestore streams

  1. The next step is letting us know which streams you want to bring. Each stream available in that list corresponds to a top-level collection or subcollection on Firestore. You can select entire groups of streams or only a subset of them.

    Tip: The stream can be found more easily by typing its name.

  2. Click Next.

5. Configure your Firestore data streams

  1. Customize how you want your data to appear in your catalog. Select a name for each table (which will contain the fetched data) and the type of sync.
  • Table name: we suggest the same name as the collection, but feel free to customize it. You have the option to add a prefix and make this process faster!
  • Sync Type: you can choose between INCREMENTAL and FULL_TABLE.
    • Incremental: every time the extraction happens, we’ll get only the new data - which is good if, for example, you want to keep every record ever fetched. In order for that to work, you need to have a valid date-time incremental key inside your documents.
    • Full table: every time the extraction happens, we’ll get the current state of the data - which is good if, for example, you don’t want to have deleted data in your catalog. However, keep in mind this increases resource usage such as computing time and storage.
  1. Click Next.

6. Configure your Firestore data source

  1. Describe your data source for easy identification within your organization. You can inform things like what data it brings, to which team it belongs, etc.

  2. To define your Trigger, consider how often you want data to be extracted from this source. This decision usually depends on how frequently you need the new table data updated (every day, once a week, or only at specific times).

Check your new source!

  1. Click Next to finalize the setup. Once completed, you’ll receive confirmation that your new source is set up!

  2. You can view your new source on the Sources page. Now, for you to be able to see it on your Catalog, you have to wait for the pipeline to run. You can now monitor it on the Sources page to see its execution and completion. If needed, manually trigger the pipeline by clicking on the refresh icon. Once executed, your new table will appear in the Catalog section.

If you encounter any issues, reach out to us via Slack, and we’ll gladly assist you!