Hi, OpenPlgx! 1. We already filter all what is pos...
# eclecticiq-polylogyx-extension
m
Hi, OpenPlgx! 1. We already filter all what is possible to. There is one more problem - new agents not registering and old are not receiving new config from server. 2. We use shallow config.
m
Can you please share the server specs and the server resource utilization(avg cpu and ram usage). Also, how many endpoints you have enrolled?
Can you also share the details for the hosts last seen/activity which is not refreshing the config To get the details you can click on the host from the hosts page and hover on
Last Seen
m
Agent works on Intel Xeon with 2 virtual cores and 4 Gb of RAM. CPU is about 11% in average, RAM about 60%. 9 agent deployed, but only 3 left after control server reinstall. Unfortunately i cannot provide
Last seen
due to 6 of 9 agent are not appeared.
m
I mean the ESP application server specs and resource usage. Last seen details for any one of the system would be fine
m
Oh, okay. Server also have 2 virtual cores of Intel Xeon Platinum 8000 and 16 GB RAM. One of agents is on screenshot
k
Can you please confirm few things to us to identify why the newly enrolled agents are not showing in ESP UI? 1. The server ip in the agent machine's osquery flags file(C:\Program Files\plgx_osquery\osquery.flags) with the flag
--tls_hostname
is same as your control server(ESP)? 2. Certificate(plgx-esp/nginx/certificate.crt) of the server is matching with the certificate you are trying to enroll the agent?
👍 1
m
Hi, @Kishore Arava! 1. It's very embarrasing, but no - there was localhost. Now it works, thanks. There left one more question - memory leaks or smth like that causing RDP fails.
k
Yes. Might be. Hope agents are reading the updated config from server now, Based on the last screenshot you shared.
o
Is the memory usage coming down in time? I mean is it a spike..i dont think there is a leak here..it could be due to high event load on the system
m
Memory usage going like this in time. And sometimes server going unresponsive. Unfortunately i couldn't look at server in failed state due to failed RDP. Now we exclude all we can to still be secured with.
Current state of one of agents, if you interested in.
I'm using Process Explorer to monitor situation
o
I can agree that 120 Mb is a a little high (and that also seemed like a spike rather than being consistent) but not terrible for a server work load..what kind of physical RAM you have on these server machines running the agent? I am surprised that a 120 Mb RAM is causing RDP failures
m
Unfortunately i couldn't tell about physical RAM because all our servers in virtual environment. 120 Mb is not maximum. Two days ago i see 486 Mb.
One of a test servers now
o
I understand. I was curious to know if these were transient conditions or were they steady state...for e.g. as I read, few of your servers (endpoints) were not connected with the ESP-server due to an error in the flagsfile. But this wouldnt stop the agent from collecting/caching the data in the meantime and the moment connectivity would resume, it would try to pump all the data back to ESP-Server causing a temporary spike.. is that what we are seeing here (or any such transient condition), is what I am trying to understand.., other reason could be the volume of activity which can be controlled thru tuning event filters specific to your environment, but for us to guide you on that would require you to be able to share some data..
m
What kind of a data do you need?
o
Recent Activity data on the system exhibiting the persistent high RAM usage + the plgx_event_filters config
m
There is almost no activity on most RAM consuming machine because of no user activity except of constantly working Process Explorer, which is excluded in agent configuration. Can exclusions count affect on a load?
Plgx event filter config i could not provide due to security reaseons, sorry. We exclude our antivirus, SIEM, Plgx itself, some Microsoft utilities and frameworks. Excluded ports are standart.
I constantly monitor events and add exclusions for non-critical mass events.
Right now
plgx-osqueryd.exe
failed with the next error and can't start. RAM is 98% and except this server is fine.
May be here is some dump or smth like this? I'm mostly works with Linux and Windows services debugging is new for me 😅
o
This error is usually benign (it comes from osquery) but I think what might be happening on this system is for some reason, the benign error is acting up...are you seeing many such errors, as if it were in a loop?
m
Yes, there is a bunch of such errors. Procdump didn't work, by the way 😞
Hi, guys! Are there any news? We still need some solution
o
Without having access to data, it is indeed difficult to suggest options...here is one thing you could try to get rid of this restart loop (which I believe is happening due to some reason osquery being acting up on this system)
from an admin command prompt, run following commands to stop the services 1. sc stop plgx_cpt (wait for 15-30 seconds & run sc query plgx_cpt to make sure it has stopped) 2. sc stop plgx_osqueryd (wait for 15-30 seconds and ensure it has stopped as well) 3. sc stop vast/sc delete vast 4. sc stop vastnw/sc delete vastnw 5. sc start plgx_osqueryd
do not start plgx_cpt (that's an outside monitoring service causing the trigger)
m
Hi, @OpenPlgx! Thanks, i will check this now
Okay, i've done this. I shouldn't start plgx_cpt at all? Now we could only wait for failures. Also, i've get this, is it expected?
h
this means agent hasn't received server port from Options UI
custom_plgx_ServerPort
. Can you run plgx_osqueryd from command line as below and share any errors/warnings you see on CLI from polylogyx osquery extension? 1. run 'sc stop plgx_osqueryd' 2. in osquery.flags file, rename the osquery db file name to something else, say
--database_path=C:\Program Files\plgx_osquery\osquery_temp.db
3. from CLI, run 'plgx_osqueryd.exe --flagfile osquery.flags --verbose' 4. Note any warnings/errors.
o
@Michael, these errors shouldn't cause a functional issue...We can bury them later...but lets see if you still get the high RAM/CPU usage that you mentioned about (or the osquery agent start/stop situation)
m
Hi, guys! Thanks for replies, i very appreciate it. So, on the server where i manually stopped all, removed vast and vastnw, then started plgx_osqueryd service has failed after about 2 days of work. On the server was no activity at all. In current state it still doesn't work. Memory consumed for 97%. It looks like memory leak, but i can't catch anything. All that i've got - without Polylogyx ESP or Polymon all works well
Ah, yes - we tried to test Polymon and it cause the same problems.
@himanshu Here it is.
o
@Michael, the images you have added seems to suggest the memory outage is in windows defender and/or FireFox
The agent is taking a very minimum of 7 MB.
That said, I can understand that the issue is appearing only when you deploy the agent..& I am trying to think how else we can support you given that there isn't a lot of data you can share
m
Hi, OpenPlgx! Yes, agent taking 7 Mb cause it's dead. This is only one case, i haven't seen such error on DC servers, for example, which is Core server without Firefox 🙂 I can clean sensitive data from config and share it, if you need to.
o
what do you mean its dead? Its showing in the taskmgr, right?
Are you saying such errors are not seen on servers that don't have firefox?
Also a bit confused; if the agent is dead, and then also the RAM usage is going to ~97%??
m
I think it's dead cause it doesn't collect any information - last log events was about two days ago.
o
ok, so no queries are active..
m
It's confusing me too. From my point of view it looks like memory leak.
o
leak where? if it was leak in the agent, taskmgr would show up it in agent..so its not a leak
m
No, errors about VRAM are not appearing anywhere, this is only case
So, do you need a config?
o
yes, but can I request something simpler before?
lets start fresh from very basics of osquery (should take about 15-20 mins) if that's ok with you
m
Sure, np. What should i do?
o
unistall the agent, completely from the server showing up the issue
you can do that from an admin command line by running "plgx_cpt -u d";
m
Okay, wait a minute, please
o
(plgx_cpt being the tool you must have downloaded to install the agent, from the server)
m
Done. Do you need the log?
o
no, looks good
just give me the output of following commands: 1. sc query plgx_osqueryd 2. sc query vast 3. sc query vastnw
m
Done
o
oh, you are running from powershell, then perhaps: 1. Get-Service plgx_osqueryd 2. Get-Service vast 3. Get-Service vastnw
expected output "Get-Service : Cannot find any service with service name <service name>"
after that please install osquery from its website: https://osquery.io/downloads/official/4.7.0
m
done
o
can you adjust the osquery.conf to queries of your interest? (you wont be able to run any query for win_* _events table yet.)
I want to see if the base osquery and its queries work fine on your system first, before we go with PolyLogyx additional stuff
You might have to stop the osqueryd first, then edit the .conf file and then start it again
m
Okay. It take some time cause we have several packs which contains both standard osquery and plgx stuff,
I will report as soon as i can, thanks 🙂
o
cool
once we see this working, then we will manually add PolyLogyx Extension as well, then let it run and then add the packs..
in the mean time, can you share your packs/configs?
m
Let me review them for sensitive data and i will share
o
sure
m
I've add and run all packs we have for osquery
Hi, OpenPlgx! Looks like osqueryd works well
o
Great. So along with the .conf file, can you also share what queries/query packs you were running?
m
Sure. Here it is
o
it doesn't have any queries/packs for polylogyx tables?
m
no, it doesn't
o
but you get into situation when you enable queries on PolyLogyx tables, right?
m
Looks like. Most consuming process was
plgx_osqueryd.exe
So what is our next step? Extension?
o
that is just vanilla osquery renamed to plgx_osqueryd
i wanted to see the queries you have for PolyLogyx tables
and yes, the next step would be to first load the extension (without any queries)
m
Do you mean these?
How should i apply extension?
o
1. Download the latest extension binary by cloning: https://github.com/polylogyx/osq-ext-bin 2. stop the osquery service on your system 3. move the file plgx_win_extension.ext.exe, extensions.load, osquery.flags in c:\program files\osquery 4. Adjust your osquery.conf to apply all the filters (based on the above) 5. from an admin command prompt, run notepad and change osquery.flags file's following line: --database_path=C:\Program Files\osquery\osquery.db to --database_path=C:\Program Files\osquery\osquery1.db 6. restart the osquery service 7. from the admin prompt, check to see vast/vasntw service are running. In case not, stop & start the osquery service again
m
Thanks, i will
Looks like it works.
o
great, lets see if you hit the memory usage issue
(also, can you make an exception for plgx_win_extension.ext.exe and osqueryd.exe in Windows Defender)?
m
We do not use Defender, and exclusions in our anti-virus are implemented
o
Your earlier images/screen shots showed MsMpEng.exe running and consuming the highest RAM in the system ...MsMpEng.exe is the Windows Defender Engine
From your earlier post ☝️
m
Oh, yes, my bad. Defender is working on the one of the test servers. Now we testing on servers without Defender
Hi, OpenPlgx! Looks like manually deployed extension works fine. I'm extremely surprised
What is our next step?
o
that is indeed surprising ..there is really no diff in what we did manually vs what would have happened thru the CPT tool
m
I understand, but all what happens have no logic at all. On tests before massive deployment all works perfectly. Then was deployment and problems with RAM and RDP have appeared. Two freaking weeks of collecting information and tries to understand what's going on and now - this. All just works again. I think about vast and vastnw - they are slightly relatives to Sysmon driver, and Sysmon once cause same problems, but long ago. Could it be possible?
Right now failed one of the Plgx test servers. In this moment there were errors related to SSL and only they. I see them not for the first time, but did not pay attention previously. Could it be somehow related to failure? Plgx processes are alive, just not collecting anything
o
vast/vastnw are indeed sysmon-ish drivers....
PolyLogyx processes are collecting..go to the 'Details' tab
I dont think the SSL certs are related ..although what is the issue in these 3 images?? the resource usage seem fairly moderate to me
m
No, it's stop collect, i know details in other tab. When i send message, it doesn't works for about 1.5 hours. Process - alive, but no new events
o
The events are getting collected in the event log all the time (as long as the process is alive in memory). Certain events might get dropped (depending on the filter conditions) but there is no documented way to stop collecting the events...