How to control duplicate posts ?

Are you getting duplicate posts in your WordPress website ?

By default WPeMatico includes duplicate posts controls, but if necessary they can be disabled or adjusted as needed.

Another behavior is that when the running campaign finds a duplicate post, the fetching process is interrupted because it assumes that all the following posts are also duplicates and this saves time and resources. You can disable these controls in the settings or individually within each campaign.

WPeMatico’s duplicate control works by three different methods; furthermore new methods can be added by WordPress filters, but that is additional and remains available for developers who want to extend the features.

Two of these methods are activated by default. They are:

1- Control of duplicate posts by title.

The duplicate control by post title compares the feed items with the posts that are already created, if it gets a post with the same title as one of the feed items then it takes it as a duplicate and stops the process.  This duplicate control by title works with all posts already inserted in WordPress, no matter if it was published manually or by a WPeMatico campaign.

1.1- Allow posts with the same title.

The duplicate control by post title can be deactivated and thus allow to have posts with the same duplicate title, but not the duplicate article of the same feed item, because the duplicate control by hash of the source will continue to work.

 

2- Control of duplicate posts by a hash

Duplicates checking by hash is a boost to checking to duplicate checking by title, which may fail many times.

For example, you can remove the title control and leave the hash in case the feed has items with the same title.

As the hash is checked only by the last retrieved item, selecting this option may generate duplicate posts if duplicate checking by title is unchecked in the campaign.

The duplicate control hash is stored in the campaign, so if you delete the post to fetch it again, the hash must also be deleted from the campaign, otherwise the campaign will think that the post still exists and will not fetch it.

 

3- The extra and third option to avoid duplicates are deactivated by default.

Add an extra duplicate filter by source permalink in meta field value. As the name suggests, for each item in the feed, it checks all posts to see if any have already been retrieved from the same source permalink.

This process may not be recommended for large websites because, although it is optimized to make only one query to the table, due to the time it takes to read the entire database for each published entry it can consume extra resources.

 

4- Extra Option: Continue Fetching if found duplicated items.

Unless it is the first time fetching a feed, when the campaign finds a duplicate, it means that all following items were obtained before.

This option avoids the interruption of the fetch allowing to jump every duplicate post and continues reading the feed searching more new items.

Thus, this option avoids the interruption of fetching when a duplicate post is found and allows jumping them and keep reading the feed looking for more news until the end of the feed… every time. This is very BAD for performance, so is not recommended because most of times the following posts were already fetched.

How it works: The feed items are ordered by date-time in almost all cases. When the campaign runs, goes item by item from newest to oldest, and stops when found the first duplicated item, this mean that all items following (the old ones) are also duplicated.

See more info at this faq.

 

5- Still duplicating ?

This may be due to many different cases and certainly each case must be looked at individually.

But some options to check could be the following:

  • It is likely that the feed is not entirely standard. You can check if the feed has some errors on this link or any other feed validator.
  • Many feeds that fail are usually missing the <guid> tag which is the tag containing the URL by which duplicates and hashes are compared.
  • If you are using the Custom Titles feature from the Professional extension, you’ll also lose Duplicate Titles checking, because the original title will be saved always as a different title.
  • There are some types of caching systems or cache plugins that affect the duplicates controls because the Url of new posts are not compared in real time, but with previous cached versions of the DB. See more info in this faq.
  • Systems such as CloudFlare or firewalls would not have to affect duplicate control, but it might be possible in some types of configurations that we are not aware of.

After checking these cases also remember that we have our support ticket system so you can open a case for free and we will be very happy to help you solve the problem.