AwesomeOps Part 6: Windows Server 2022 Ansible Drift Control In-Depth

Back by popular demand, it is Windows Server 2022 drift control with Ansible! We held a LinkedIn poll to see what people in the AwesomeOps community wanted to read about most, and a deep dive into Windows Server drift control finished 2nd place (1st place can be found here). So, in this post we are going to dive deep into the ocean of configuration management

and cover everything you and your team will need to know about drift control on Windows Servers. If you have not been following along, checkout our other posts here to see how our platform is setup. The other posts will give you a solid overview of all of the components used to control drift. With all of that said, let's get into it!

The Setup

In order to understand drift control within Windows ecosystems you must build things from the ground up, and approach your infrastructure and operating system configuration as layers. We lifted the images below from our previous blog posts to illustrate the foundation of our drift control system. The image titled "Packer + Ansible + ADO" is the architecture for creating a scalable image building and distribution platform. This will be necessary so that each build you create produces identical images across all of your compute platforms. Starting all builds from an identical place will both drastically reduce your mean time to resolution when you run into Ansible drift errors, and eliminate all initial drift configurations in the honey-moon period that occurs post build.

The image titled "Ansible + JFrog + K8s" depicts a layer in the drift control platform that is responsible for connecting to N# of Windows hosts in parallel to determine if the binary checksum on the remote host matches the checksum of the same binary held in our source of truth JFrog. If the checksum does not match, we kick off a download and installation of a given application.

OK! We have an image layer and a centralized location to store and version control binaries, as well as a mechanism to retrieve and deploy those binaries on N# of Windows Server hosts, but what about the other layers? We are glad you asked. Here is a rough approximation of the layers of drift control planning that we consider when building automation platforms:

Automation Platform Tooling Layers

Code repository with git-flow strategy
Binary storage, retrieval, and version control
Secrets management solution
Compute destinations
Virtual machine image creation and distribution
Build management tooling
Configuration management tooling
Orchestration
Telemetry - centralized logs and visualizations
Notification engine

Operating System Layers

Baseline config - The hard requirements. Think: NTP, DNS, Windows Licensing, Active Directory, WinRM policy, RDP policy, Proxy settings, Certificates, Timezone, etc.
Security config - UAC policies, required users and groups, disabling guest account and rotating the password automatically, renaming the administrator account, ensuring your Ansible user accounts are in the correct groups, Firewall policies, logging locations and configs.
Enterprise tooling config - What agents need to be installed AND running on each system. Examples: Antivirus, MFA, Security, etc.
Drift config intervals - How often do you want to check to make sure all layers above are in their desired state.
Presentation config - how users interact with the Windows Server Desktop. You need to make sure that the user experience is identical when a user logs into server 1 or server 10,000. This can be things like: desktop shortcuts, default applications installed that are not enterprise reporting and protection agents, BGInfo, backgrounds, etc.
Application config - Dependencies installed and configured correctly to suit the needs of your most prominent and/or widespread applications.

Phew, now that all of the architecture is out of the way we can have some fun with the low level engineering details.

In-Depth

Layers, layers, layers! As you may have guessed by this point we like to think about things in layers. And, Windows Server 2022 drift control is no different. When planning your Windows Server drift control system you must break your enterprise down into a manageable set of layers. Yes, the operative word here is manageable. It can be a hard line to walk, but it is necessary. Doing this will allow you and your team to design the Ansible roles and ad-hoc PowerShell scripts needed to effectively control drift within your ecosystem. First up is the

baseline. We naturally need to start with the baseline because these settings give you a minimum viable machine (MVM) allowed and capable of running within your environment. Starting from the bottom will not only ensure that all of your Windows Servers have a minimum level of compliance, but it also is like the strongest part of your automation defense against the dark arts of manual intervention. This is to say, you may eventually have to give up all of the other layers on certain systems (no one wants to, but we have all been there when the application developers say that this agent or that config is breaking their app!), but you never can give up the baseline drift or nothing will work. So what is in a baseline config? To the code!

Now that you have defined your layers, you will want to stub out a main.yml file under the tasks directory of your main role. This main file looks a little something like this:

---
- name: Gather facts about host
  ansible.builtin.setup:
  tags: 
    - always
    - gather_facts

- name: Print facts about host
  when: "'gather_facts' in ansible_run_tags"
  debug: 
    msg: "{{ ansible_facts }}"
  tags: gather_facts 

- name: Windows configuration tasks
  block:
    - name: Include Baseline Windows Config
      import_tasks: baseline_config.yml
      tags:
        - packer
        - drift_control
        - baseline_config

    - name: Include tasks for user accounts
      import_tasks: users.yml
      tags:
        - packer
        - drift_control
        - users

    - name: Include Security Tasks
      import_tasks: security_config.yml
      when:
        - "'10.0.20348' in ansible_distribution_version"
        - ansible_os_family == 'Windows'
      tags:
        - drift_control
        - version2022

As we talked about in our previous drift control blog here, we utilize this main file to call all of our other layers, use tags to target layers, and gather facts for all tasks so we can pass facts to tasks embedded within each layer. Let's take a look at some our baseline config:

- name: Retrieve NTP server list
      win_shell: |
        $ntpServers = Get-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\W32Time\Parameters" -Name "NtpServer"
        $ntpServers.NtpServer -split ',' | ForEach-Object { $_.Trim() }
      register: __ntp_servers
    
- name: Set NTP on Host
  win_shell: |
    Set-ItemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Services\W32Time\Parameters -Name NtpServer -Value {{ your_ntp_server }}
    Stop-Service w32time
    Start-Service w32time
  when: __ntp_servers.stdout_lines[0]!={{ your_ntp_server }}

What we are doing here is checking the remote system to get a list of NTP servers configured, registering the result in __ntp_servers, and then setting the NTP server in registry only when the returned value from __ntp_servers does not match the variable your_ntp_server. Time may not seem all that important, however consider this: When the clocks of the Kerberos server and your servers are too far out of synchronization, you cannot authenticate properly! Uh-o!! It turns out time is actually very important. :) Not only that, but there are exploits out there that can compromise systems via man in the middle attacks (checkout more here) and the NTP protocol. With this little set of tasks in place you can ensure that when you Ansible drift control pipeline runs daily, NTP will always be configured correctly.

Next up we set the power plan to high performance power plan and set the timezone. Ansible has fast and easy modules to make quick work of drift here. The win_power_plan module just needs a name passed, and the win_timezone module simply needs the name of the timezone you wish to set.

- name: Set Power plan if  not properly configured
  win_power_plan:
    name: high performance

- name: Update Timezone configuration
  win_timezone:
    timezone: Eastern Standard Time

Below we configure a few remote desktop service configurations via the win_regedit module.

- name: Configure Remote Desktop Services Local Policy
  win_regedit:
    path: HKLM:\Software\Policies\Microsoft\Windows NT\Terminal Services
    name: LicensingMode
    data: 4
    type: dword
    state: present

- name: Configure Remote Desktop Services Settings
  win_regedit:
    path: HKLM:\Software\Policies\Microsoft\Windows NT\Terminal Services
    name: LicenseServers
    type: string
    data: localhost
    state: present

Most enterprises have a centralized way of managing Windows patching, so adding the settings below into your drift control Ansible arsenal is a quick win. Patches are typically rolled out in a controlled way so setting updates to manual is important.

- name: Set Windows update to manual
  win_regedit:
    path: HKLM:\Software\policies\microsoft\windows\windowsupdate\AU
    name: AUOptions
    data: 0x1
    type: dword
    state: present

- name: Set Windows update to manual - update AU
  win_regedit:
    path: HKLM:\Software\policies\microsoft\windows\windowsupdate\AU
    name: NoAutoUpdate
    data: 0x1
    type: dword
    state: present

After that we configure the proxy settings, which is usually a must within the enterprise.

- name: Configure IE proxy settings to apply to all users
  win_regedit:
    path: HKLM:\SOFTWARE\Policies\Microsoft\Windows\CurrentVersion\Internet Settings
    name: ProxySettingsPerUser
    data: 0
    type: dword
    state: present

- name: Configure IE to use explicit proxy "{{ site_proxy }}" host with port and without auto detection
  win_inet_proxy:
    auto_detect: yes
    proxy: "{{ site_proxy }}"

- name: Set proxy override to local
  win_regedit:
    path: HKCU:\Software\Microsoft\Windows\CurrentVersion\Internet Settings
    name: ProxyOverride
    type: string
    data: <local>
    state: present

- name: Set proxy enable to persistent
  win_regedit:
    path: HKCU:\Software\Microsoft\Windows\CurrentVersion\Internet Settings
    name: ProxyEnable
    type: dword
    data: 1
    state: present

After setting up a few things, we create our automation logging and drift directories.

# Create Directory Drift Control Root
- name: Create Drift Control Directory
  ansible.windows.win_file:
    path: C:\Windows\driftcontrol
    state: directory

# Create Directory Drift Control Logs
- name: Create Drift Control Directory Logs
  ansible.windows.win_file:
    path: C:\Windows\driftcontrol\logs
    state: directory
    
- name: Create Drift Control Binary Directory
  ansible.windows.win_file:
    path: "{{ drift_binaries_dir }}"
    state: directory

Using the ansible.windows.win_file module we create a set of directories to hold all of our logs and binary files that we want to track.

- name: Remove Windows telnet client feature
  win_feature:
    name: telnet-Client
    state: absent
    include_management_tools: no
    include_sub_features: no

## Host Activation
- name: Configure {{ ansible_fqdn }} for system for KMS server IP
  win_shell: cscript slmgr.vbs /skms "{{ kms_server }}"
  args:
    chdir: C:\Windows\System32\

- name: Activate {{ ansible_fqdn }}  Windows License
  win_shell: cscript slmgr.vbs /ipk "{{ kms_2022_key }}"
  args:
    chdir: C:\Windows\System32\
  failed_when: false

In this block of tasks we make use of the Ansible module win_feature. This module is super helpful because it allows you to easily install or remove Windows Server features with ease. If you would like to learn more about this module check out the readme here. The following 2 tasks leverage the win_shell Ansible module so that we can activate the license of the Windows hosts we connect to. This is critical in the enterprise, especially if you have a ELA with Microsoft.

Once you have completed your baseline configuration, move onto your next logical layer. Since we baked the vast majority of security into our image with the Ansible Lockdown code here, we will only cover a few additional tasks we added to our drift control code.

- name: Generate Complex Password for the Guest Account
  set_fact:
    random_guest: "{{ lookup('ansible.builtin.password', '/dev/null', length=8, encrypt=sha512_crypt, chars=['ascii_letters', 'digits', 'punctuation']) }}"
  no_log: true
    
# Set Random Password
- name: Set Random Complex Creds On Guest Account
  win_user:
    name: Guest
    password: "{{ random_guest }}"
    state: present
    account_disabled: yes
    password_never_expires: no
    update_password: always

We added this task so that every time our drift control runs, we create a new password for the guest account. This way when the machines are handed of to client application developers and infrastructure administrators and someone purposefully or accidentally enables the guest account we reset the password and disable the account. We have a few more that we add to drift control to keep systems as secure as possible, however we wanted to get to a great simple pattern for installing enterprise agents and other binaries that you want to ensure remain on your Windows Server systems. Below is a pattern that we use and extend for all of our Windows installs:

---
- name: BgInfo Download Binary from Artifactory
  win_get_url:
    url: "{{ artifactory_url }}/BgInfo.zip"
    dest: "{{ drift_binaries_dir }}"
    validate_certs: false
    force: false
    use_proxy: false
  register: __artifactory

- name: Setup BgInfo if Checksums Differ OR if Packer
  when: (__artifactory.changed | bool == true) or
        (packer | bool == true)
  block:
    - name: BgInfo Unzip Package Into {{ drift_binaries_dir }}
      community.windows.win_unzip:
        src: "{{ drift_binaries_dir }}\\BgInfo.zip"
        dest: "{{ drift_binaries_dir }}"

    - name: Configure BgInfo folder
      win_file:
        path: C:\BgInfo
        state: directory

    - name: Copy BgInfo
      win_copy:
        src: 'C:\Windows\drift\services\BgInfo\'
        dest: C:\BgInfo\
        force: no
        register: copy_result
        remote_src: yes

    - name: Configure BgInfo registry
      win_regedit:
        path: HKLM:\software\microsoft\windows\currentversion\run
        name: BgInfo
        type: string
        data: c:\bginfo\bginfo.exe /NOLICPROMPT /TIMER:0
        state: present

First we use the Ansible win_get_url to query JFrog artifactory and register the return as __artifactory. When you query JFrog artifactory using this method you will get a lot of return values. One of the return values is the checksum of the binary. Next, we only execute all following code when 1 of 2 conditions are met like this __artifactory.changed | bool == true. A word of caution here, you need to think through each enterprise service being installed on your hosts because there will be applications that you will want to also perform a services check to determine if the service is not only installed, but also that it is actually running. And, depending on the circumstance you will want to combine both strategies of checksum compare AND service running. Wow, that was a lot of Ansible!

The Ansible code is great and all, but how do you control drift? Great question. Once you have a few of your Ansible layers in place, you will want to start testing your end to end drift control automation. First, you need to setup a pipeline on a schedule to execute your playbook on an interval that you define. Any system that has pipelines that can be scheduled is perfectly fine. We use Azure DevOps as our orchestration layer to execute Ansible. In our previous drift control post we covered a few notes about pipelines and schedules here. Below is an example pipeline that we have running daily:

---
variables:
  - group: drift-variables
  - name: pool # Required - The agent pool
    value: the-name-of-your-agent-pool
  - name: ansible_project # Required - The project to look in to
    value: ansible
  - name: ansible_repo # Required - Ansible repository to clone
    value: ansible-windows-drift-common
  - name: repo_branch # Required - Ansible repository branch to checkout. Default should be develop
    value: ${{ replace( variables['Build.SourceBranch'], 'refs/heads/', '' ) }}
  - name: ansible_playbook # Required - Launch playbook in Ansible command
    value: ./roles/drift/tasks/drift_control.yml
  - name: inventory_input_file # Optional - Only used to generate an inventory from a already existing template in the repo. Leave blank if unsure.
    value: ""
  - name: ansible_inventory_path # Optional - Self explanatory 
    value: "-i inventory.yml"
  - name: ansible_inventory_repo # Optional - If your Ansible inventory exists in a separate repo
    value: ""
  - name: ansible_inventory_repo_branch # Optional - If your Ansible inventory exists in a separate repo, name with branch name for this value ansible_inventory_repo
    value: ""
  - name: ansible_inventory_repo_path # Optional - If your Ansible inventory exists in a separate repo
    value: ""
  - name: limit_var # Required - Ansible group or host var to target. Include the --limit parameter with var name when using this
    value: -l localhost,10.0.0.0
  - name: extra_var # Optional - Ansible extra vars passed to playbook. Format must be key=value with space in between additional vars
    value: "${{ parameters.action }}=true '$(ansible-drift-username)' 
  - name: tag # Optional - Ansible parameter for CLI command if using Ansible tags. Include the --tag parameter with tag name when using
    value: ""
  - name: arguments # Optional - Ansible arguments for CLI command. This can have as many additional arguments passed as desired
    value: "${{ parameters.verbosity }}"
  - name: provider # Optional - provider options are aws, azure, vmware, or "".
    value: "vmware"
  - name: ansible_python_interpreter # Optional - python3 interpreter if the Ansible module requires python3. If not needed put null "" as the value
    value: ""
  - name: developer_mode
    value: true
  - name: scheduled_pipeline
    value: true
  - name: apply_environment
    value: drift_ansible_apply_scheduled

parameters:
  - name: action
    displayName: Action to perform.
    default: report
    values:
      - report
      - execute
  - name: verbosity
    displayName: "Please set the verbosity level:"
    type: string
    default: "-v"
    values:
      - "-v"
      - "-vv"
      - "-vvv"
      - "-vvvv"
      - "-vvvvv"

resources:
  repositories:
    - repository: templates
      type: git
      name: the-project-name/pipeline-templates
      ref: develop

trigger:
  branches:
    exclude:
      - '*'

pool: $(pool)

schedules:
  - cron: 0 13 * * *
    displayName: Daily at 8am EST
    branches:
      include:
        - develop
    always: true
    
stages:
  - template: ansible/ansible-preflight-check-v2.yaml@templates
  - template: ansible/ansible-plan.yaml@templates
  - template: ansible/ansible-apply.yaml@templates

There is a lot going on here, and we do not have time to get into all of it. So we will summarize the sections below and then call out a few key areas of interest.

ADO pipelines have a few key sections you will need to understand to get moving. Below is an outline of the anatomy of an ADO pipeline.

Variables - checkout more here.
Parameters - handy for passing values from the pipeline UI to variables and/or other parts of your pipeline. Checkout more here.
Resources - simple way to pull in more repos and other things into your pipeline. More on that here.
Trigger - runs the pipeline when a condition is met. These are helpful when you are rapidly developing code so each commit runs the pipeline. We also make use of these for PRs/MRs.
Pool - where your code will be stored executed.
Schedule - setup a schedule to run your pipe.
Stages - the order of operations execution of your code.

The key items in the pipeline above are the variables section, which specify all of the variables that will get exported into memory of the container agent running that can then be picked up and reused in our stage templates section. The second key take away is the schedules section that specifies when our drift control code runs.

Having your drift control run on a schedule is great, but what about visibility? While we personally love reading through tens of thousands of lines of pipeline logs, most organizations will need visualizations to show drift and drift corrections. To do this we implemented our KELK stack.

K = Kafka - durable queue

E = Elasticsearch - persistent data store

L = Logstash - log parser

K = Kibana - visualization engine

All pipeline logs are shipped to our KELK cluster where we have dashboards to show various aspects of drift.

Last drift run
Unreachable machines
Percent compliance per layer
Repeat offenders
Number of settings changed
Enterprise agent versions

And many more. Not only does Elastic provide the ability to store and visualize automation actions, but it sets your organization up to handle event driven automation (EDA). EDA is essentially what Kubernetes does on a day to day basis to orchestrate container workloads, but EDA is usually geared towards virtual machines and systems and services that are being monitored. One of the super helpful and simple features in Elastic that enables EDA is something called triggers. Triggers basically search index patterns and execute actions when criteria of the search are found. Below is an example of a trigger schedule in Elastic.

  {
    "trigger" : {
        "schedule" : {
            "daily" : { 
                "at" : [ "midnight", "noon", "17:00" ] 
             }
           }
         }
       }

This is similar to a cron pipeline schedule. In the example above we have setup this trigger to execute 3 times a day. The trigger itself is very flexible and provides the ability to always, never, compare, compare arrays, and script conditions. Below is an example:

{
    "condition" : {
        "compare" : { 
            "automation.ansible.drift.service.crowdstrike. hits.total" : {
                 "gte" : 5 
             }
         }
     }
}

With the above condition we initiate a trigger to run our Ansible drift control playbook with with the tag Crowdstrike against the 5 nodes that have drifted. Because of the flexibility we can easily set the time of day to execute and use one or many metrics to execute a trigger. So here we get the benefit of both drift control with EDA. Here is the end to end drift control workflow:

Conclusion

And that is it for Windows Server 2022 deep dive on drift control with Ansible! We hope you enjoyed this post and found it helpful on your AwesomeOps journey.

Next time on AwesomeOps, we will delve into the quantum realm by training a security focused machine learning (ML) model on IBM quantum computing platform.