Screen OS

last person joined: 8 months ago 

This is a legacy community with limited Juniper monitoring.
  • 1.  Site to Site VPN blocking Active Directory Replication (sporadically) - Possibly RPC/DNS blocking

    Posted 04-20-2015 14:02

    We have been fighting this issue for several months now and have narrowed it down somewhat.

     

    We have 3 site Active Directory setup with 3 sites (HQ, BRANCH, COLO). The HQ and BRANCH sites use SonicWall firewwalls running 2.9.1 firmware. The COLO site uses a Juniper 350M running 6.3.0r18.0 firmware. All the sites are cross connected using site to site VPN connections such that all LAN addresses at each site are accessible by all other sites.

     

    The initial symptom we discovered is that at various periods of time the Active Directory servers in HQ and BRANCH would cease to be able to replicate to COLO - but they would continue to replicate with each other. Typically around 6:30-7:30am each day whatever was causing the replication to fail would release and all the servers would resync up for a few hours and replication would work perfectly well. Then between 10:30am and 11:30am the replication would start to error out again and would often remain that way until the next morning.


    In the process we tried promoting new hardware, changing the MTU on the AD servers, upgrading all the firewall firmware to the latest releases. None of these things has fixed the issue - although upgrading the firmware does seem to have caused the predictability of the outage to be less obvious (vs reliably working 3-5 hours each morning - it sometimes works later in the day for a short while).

     

    This appears to be something related to the firewall blocking traffic - possibly just the DNS portion of the AD replication that is causing larger issues. And here is what we see.

     

    * When replication is working - DNS traffic is working perfectly well. From any AD admin console I can connect to any AD server's DNS at any site location. Additionally all the command line tools for repadmin can freely connet between domain servers.

     

    * When replication stops working - the command line repadmin tools are able to talk freely between Domain Controllers at the same sites - and also freely between HQ and BRANCH. But HQ <-> COLO and BRANCH <-> COLO are locked out.  Additionally using the DNS management tool cannot seem to bridge that divide.

     

    * BUT all other traffic seems to be functional over the site to site VPN. TCP Stream socket things like Remote Desktop, FTP, Telnet, etc work without issue. Windows file shared and DFS replication continue to work fine. PING traffic also flows freely. So the site to site VPN is functional.

     

    * If I take my machine in the HQ location and VPN directly into a machine in the COLO location using PPTP it is able to connect to the remote DNS servers without issues.

     

    So there doesn't appear to be any specific issues with the AD servers - the issue appears to be that somehow the site to site VPN is blocking some level of traffic causing the AD servers to fail to replicate. At a minimum DNS is important to AD replication and DNS traffic is clearly being blocked. But as was stated - all other traffic appears to be flowing perfectly fine during the "outage". And then it will magically self correct itself (typically early in the morning) without any use intevention.

     

    Obviously I am using 2 different firewalls here so you could point the issue at either side. But the compelling factor that makes me belive it is on the Juniper side is that once replication and/or DNS traffic is blocked from COLO to HQ - it is also fails from COLO to BRANCH. If this were as simple as a state problem within the HQ Sonicwall firewall it shouldn't prohibit the COLO from talking to the BRANCH using their own direct site to site VPN. And the HQ to BRANCH replication continues to work without issue.

     

    What is also odd about this is rebooting the HQ firewall seems to release the lockup whatever it is for a while. This would point to a potential issue with the SonicWall but once again - I only have to reboot the one to get traffic to flow to the BRANCH site from the COLO and when the outage happens it happens in unison between the sites.

     

    This leads me to think there is something in the Juniper that is actually blocking the traffic - although the site to site policy is configured to permit any.

     

    Has anyone encountered anything even remotely similar to this or have any guidance? As I have said this has been an ongoing headache for months and belive that we have narrowed it down to the firewall but don't see anything obvious that would be causing it to work some of the time and then fail the rest of the time.



  • 2.  RE: Site to Site VPN blocking Active Directory Replication (sporadically) - Possibly RPC/DNS blocking

     
    Posted 04-20-2015 19:43

    Hi,

     

    That is a very nice problem description and looks like you have looked into this from all possible angles 

     

    Does the AD replication use MS-RPC? If yes, I have come across a similar problem, where the RPC data session on the firewall would just freeze. I would have suggested an upgrade, but r18 is the latest and good.

     

    Is it possible for you to collect a PCAP on either of the servers when the connection is stuck and check if there is any kind of RPC traffic flowing between them? I would assume that one of the servers keeps pushing data, which does not make it to the other side.

     

    Also, if you know the data ports used for AD sync, I would suggest creating a custom policy, to permit this traffic, along with RPC control port between the servers and set application to 'IGNORE'. This will bypass the RPC ALG, effectively skipping the TCP-Proxy mechanism of the SSG as well.



  • 3.  RE: Site to Site VPN blocking Active Directory Replication (sporadically) - Possibly RPC/DNS blocking

    Posted 04-21-2015 21:41

    AD definitely does us MS-RPC and the error presented to the Event Logs is an RPC time out error. And the description of an RPC freeze seems exactly what we are talking about here.

     

    I will work to get a PCAP - but related to your response I have some questions.

     

    Right now there is a Pollicy that is set up...

     

    - Source: HQ LAN

    - Destincation: Trust LAN

    - Service: Any

    - Application: None

    - Tunnel: HQ to COLO VPN

     

    ... which should permit any traffic through.

     

    Although your response got me looking and you indcated setting application to IGNORE to bypass the RPC ALG.I am wondering if there is a difference between None and IGNORE for the Application drop down list box.

     

    I am not an expert on the Juniper by any stretch but what I am reading is that the Application is infered from the Service (unless you define a custom service).

     

    So in this context even though I am allowing ANY service - the "None" might still be allowing the ALG to interrogate and process things. And perhaps there is a bug in the MS-RPC ALG that causes the freeze.

     

    And if I were to set it to IGNORE it would effectively tell the firewall to stop interrogating any of the packets and just allow them all through unencumbered? I don't need to be too specific about the ports because I want ALL ports to pass freely between these two sites - I don't need any filtering to be in place.

     

    Does that make sense and is that eseentially what you were recommending?

     

    Addendum: It doesn't seem to be too happy when I put in IGNORE with Service: Any - the UI fails with an error. I supposed I could define a service that includes every TCP port and UDP port range possible??

     



  • 4.  RE: Site to Site VPN blocking Active Directory Replication (sporadically) - Possibly RPC/DNS blocking
    Best Answer

    Posted 04-21-2015 22:17

    We may have a winner.

     

    I created a custom service as...

     

    All PortsTCP src port: 0-65535, dst port: 0-65535
    UDP src port: 0-65535, dst port: 0-65535
    ICMP type = 8, code = 0

     

    ...and changed the policies to do Service: All Ports and Application: IGNORE on both the COLO<->HQ link and the COLO<->BRANCH link. As soon as I did this I was able to invoke the Active Directory repadmin commands (which previously would hang across sites) and force replication to begin again.

     

    So it does appear that the ALG was freezing the MS-RPC. Does anyone see any issues that may arrive out of reconfiguring it this way? I added ICMP type 8 to allow ping traffic to go through. I don't know if there is any other ICMP stuff that I would care about.

     

    Or would we be better off simply disabling the MS-RPC ALG (and possibly the DNS ALG) as we have no MS-RPC traffic actually originating from anywhere except the site to site VPNs)?

     

    I will have to run it for a couple days to make sure it is stable - but based on the descriptions available - Gokul you have possibly finally given me the missing information that we have been looking for quite some time.

     

    Update: 2 days running with no replication errors - prior to this change 5 hours was the longest period of continuous replication that would work. So  I am declaring victory on this work around.