How to perform Data Anonymization for Sugar development

Post originally written by Emmanuel iNet.

Here is another guest post from Emmanuel Dyan from the Elite SugarCRM Partner iNET Process. In it he addresses a common Sugar project requirement using an open source tool developed by iNET Process.

The problem that we will try to solve with this post is:

How do we make sure that we are never developing using actual customer data but, at the same time, work with data that reflects reality?

Data Anonymization

Usually, when we work on a customization project for a customer you have a minimum of 3 different environments: Development,  User Acceptance Testing (UAT), and Production. To make sure that we work in optimal and consistent conditions, we usually copy the database from one environment to another (preferably from production to other environments). Doing this type of manipulation has multiple drawbacks, including:

  • We have to collect a Database Dump which means that it contains unencrypted raw data. What would happen if we mistakenly expose this dump to someone who is unauthorized?
  • We have to test some functionality to make sure that it works. What would happen if we test a Campaign that sends thousand of e-mails ... to the ... actual customers of our customer?

Anonymizing the data is the best practice to avoid "playing" with customer data and to keep their trust in us.

The challenge with anonymizing data is figuring out how to overwrite the data with something that is completely unrecognizable. For example: "John Doe" will become "voluptatem accusantium". His job title becomes "doloremque" and his country "magnam". His phone number will become "569898520114457878744778" instead of "+123456789".

Anonymization Tool

Now how do we work in realistic conditions with that kind of anonymization? Indeed, we need another solution that works with any Sugar instance. The solution we are demonstrating below is implemented in our open source CLI tool (sugarcli).

The anonymization architecture follows:

  • It uses another independent tool, called "neuralyzer" (by iNet Process) that is a command line tool to anonymize a database.
  • Neuralyzer uses a library called "Faker" to generate realistic data.
  • It is composed of two subcommands:
    • A configuration generator.  This is used to generate a configuration file automatically without destroying the system tables (config, relationships, etc).
    • The anonymizer that connects directly to the Sugar Database to perform the job. It uses iNET Process libsugarcrm to performs SQL queries (via a PDO connection). It also purges deleted records and cleans _cstm tables of deleted records. It finishes by then emptying all _audit tables.

Configuration Generator

The configuration generator reads all the Sugar database tables and tries to guess, from the field type or the field name what type of fake data needs to be generated. For example a field containing the string city will receive a random city.  If a field contains _user_id then it will be ignored in order to preserve unique IDs.

To use it, first download SugarCLI from https://github.com/inetprocess/sugarcli.

Then run:

./sugarcli.phar anonymize:config --path <sugarcrm_path>

The parameters are the following:

    --file=FILE                   Path to the configuration file [default: "../db/anonymization.yml"]    --ignore-table=IGNORE-TABLE   Table to ignore. Can be repeated (multiple values allowed)    --ignore-field=IGNORE-FIELD   Field to ignore. Can be repeated (multiple values allowed)-p, --path=PATH                   Path to SugarCRM installation.

Example

The command ./sugarcli.phar anonymize:config creates a file that looks like:

guesser_version: 1.0.0 entities:     accounts:         cols:             name: { method: company }             description: { method: sentence, params: [20] }             facebook: { method: url }             twitter: { method: url }             googleplus: { method: url }             account_type: { method: randomElement, params: [['', Analyst, Competitor, Customer, Integrator]] }             industry: { method: randomElement, params: [['', Apparel, Banking, Biotechnology, Chemicals]] }             annual_revenue: { method: randomNumber, params: [4] }             phone_fax: { method: phoneNumber }             billing_address_street: { method: streetAddress }             billing_address_city: { method: city }             billing_address_state: { method: state }             billing_address_postalcode: { method: postcode }             billing_address_country: { method: country }             rating: { method: sentence, params: [8] }             phone_office: { method: phoneNumber }             phone_alternate: { method: phoneNumber }             website: { method: url }

As you can see, the commands uses different methods to guess the type of faker methods to use:

  • If it is a dropdown then it gets the list of values from vardefs
  • If the field contains a known word it uses a pre-defined method (example .*_city = city)
  • Otherwise it will use the dbType (varchar = sentence)

You can change the content of the file once generated to match your criteria.

Run the anonymization

Warning! Do not run this on a production instance! This command will overwrite data in the target database. 
./sugarcli.phar anonymize:run --path <sugarcrm_path>

The parameters are the following:

    --file=FILE        Path to the configuration file [default: "../db/anonymization.yml"]    --force            Run the queries    --remove-deleted   Remove all records with deleted = 1. Won't be launched if --force is not set    --clean-cstm       Clean all records in _cstm that are not in the main table. Won't be launched if --force is not set    --sql              Display the SQL of UPDATE queries    --table=TABLE      Anonymize only that table (repeat for multiple values) (multiple values allowed)-p, --path=PATH        Path to SugarCRM installation.

Example

The command ./sugarcli.phar anonymize:run --table=accounts --force gives the following output:

Be careful, the anonymization is going to start That will overwrite every data in the Database !
If you are sure, please type "yes" in uppercase YES
Anonymizing accounts50/50 [============================] 100%
Emptying accounts_auditEmptying bugs_auditEmptying campaigns_auditEmptying cases_auditEmptying contacts_auditEmptying contracts_audit....
Done in 0.42 sec (consuming 40.5Mb)
Now you can dump your database and send it to your development team! If you use the parameters --remove-deleted and --clean-cstm, it will be smaller too.
  • Hi All,

    I would like to ask whether this tool is still works with PHP 7.4, SugarCRM 12.0+ and 5.7

    I tried https://support.sugarcrm.com/Documentation/Sugar_Developer/Sugar_Developer_Guide_12.0/Integration/Migration/Migrating_From_On-Demand_to_On-Site/ 

    and sugar works fine on our development linux server. And want to do "Date Anonymization".

    But getting following error when the sugarcli command runs (sugarcli anonymize:config --path /var/www/html/sugarcrm/) ....

    [Inet\SugarCrm\Exception\SugarException]
    '/var/www/html/sugarcrm' is not a valid sugar installation.

    Please advice.

    Regards

    Sudipta

  • Hello, I would like to ask whether this tool is compatible with PHP 7.3 and SugarCRM 10 

  • Comment originally made by edyaninet.

    Hi Anthony,

    Thanks for your message.

    We know the “system-data-level” tables by experience, but the list could evolve as you have noticed and we have no warranty that our tool will not break something on future versions. But the config generator is there only to help not to provide a final config file that you'll use blindly. It helps because with dozens of fields in dozens of tables, it's almost impossible to generate the config yourself.

    Also, there are some fields that you don't want to anonymize sometimes, because it helps the developers. In that case you need to edit and change the config.

    Conclusion: read the config and adapt it to your needs :)

    Thanks for your link, that'll help us to make sure that we maintain our tables list up to date ! If you need help please post a message there, we'll be pleased to answer.

    Have a nice week end

    Emmanuel

  • Comment originally made by Anthony.

    This is really impressive. My question (not having tested this out yet): how does it know which tables are "system-data-level" versus which ones are "business-data-level"? For example, what part of the config generator knows how to properly handle the 'config' table or the 'user_preferences' table, etc? I ask not only out of general curiosity, but also because I'm wondering how "intelligent" this part of the process is, eg "future-proof". If Sugar adds in a new system-level table (like metadata_cache was added in 7.6 or field_value_cache in 7.7) or removes a once-essential table (like 'relationships' was removed in 7.7), is this something that the sugarcli tool (or config-generator) will handle gracefully, or if that's not reasonable to expect (too magical), then are such changes something that can be easily "massaged" into working by making version-specific tweaks to the generated anonymize:config file (as needed)?

    I would very much like to dive into playing with this tool and--if its as awesome as it sounds like it might be--try to get it folded into Sugar's KB article that overlaps with this topic:

    https://support.sugarcrm.com/Knowledge_Base/Platform_Management/Providing_a_Backup_Without_Sensitive_Data/

    But as you can see in the above article, the challenge is ensuring that certain tables be left alone and intact to preserve parity from the source instance, while other tables might still need to be mashed up a bit to ensure the anonymized instance is neutered from certain external behavior (like sending out a pending campaign, firing off workflow alerts, pulling in emails from external accounts and thus marking the messages as read, etc).

    Again, very pleased with this idea/tool as a concept and hope it meets the needs any developer or support agent would need to request a sanitized yet functionally-equivalent version of a client instance.

  • Comment originally made by edyaninet.

    Hi rszczypka and sorry for the delay, could you download the most recent version of sugarcli and try again ?

    Thanks