MongoDB High Availability Min configuration
If you have 1 Primary node, 1 Secondary node and 1 Arbiter, and you are using majority write, then you are requiring that the write be made on both the Primary and the Secondary.
As soon as one of those nodes goes offline, you will no longer be able to write to a majority of nodes (at least 2, in this case) so the writes will fail. When you bring the Primary back online, it will rejoin without issue because no new writes have taken place.
.However, if you don’t require majority write and the Primary goes down, there is a chance that the Secondary that is promoted to Primary had not yet replicated all the writes. In this case, the Primary will come back online, realize that it has writes that the current Primary does not, and do a rollback.
The best approach for high availability (HA) is to have three data-bearing nodes: 1 Primary and 2 Secondaries. If you use a majority write, you are guaranteeing that a majority of the nodes have received the write. With three data-bearing nodes, you are therefore guaranteed that one of the two Secondaries will be up-to-date in the event of a disaster to the Primary.
With 2 data-bearing nodes and an Arbiter, is NOT a HA MongoDB configuration, and will always be susceptible to data loss until you reconfigure it with 3 data-bearing nodes.
We have explained why 3 data bearing nodes are recommended for high availability and durable data across multiple nodes.
As we also mentioned, if we want to offer a 2 node + arbiter configuration for less critical apps, we can
a) still use w:majority but that means we will wait for a primary to be available again or
b) use w:1 in the application and be prepared to cache in the application some time window of writes to check if they were applied to the new primary after a failover, or manually apply the data from the rollback log.
a) is certainly easier if you are fine waiting for manual failover, which seems similar to how you set up RDBMSs with a DR server
If we use 2 + arbiter, it is best to have the arbiter on a third node. If we only have 2 machines, then usually we would put the arbiter on the primary machine so a failure of the secondary machine does not affect the primary.
How we handle failover though depends on the a) and b) options.
Certainly with all of this we need hardware that is sufficient for the load, which seemed to initiate the issue here.