Since this is ESP there is no TCP handshake involved. As regards the packet-capture, I would suggest using the packet-capture functionality in the firewall.
Which is the firewall model you have? Depending on this you can do the pcap via datapath-debug (For SRX-HE) or forwarding-options (For SRX-Branch). The Source and destination IP you need to use in the filter would be the VPN end-points. This would capture ESP traffic.