    The only thing I would add is to be aware that the token you receive expires after 3600 seconds (AFAIK -- this hasn't changed) so you will need to check for 403 responses and grab a new token when you get an unauthorized response.

    You and I can laugh at this joke -- but remember all the nutters out there who will actually take this suggestion seriously. That's what really boggles my mind.

    Issue should be resolved

    I learned a few things along the way and one is to get a more robust monitoring system in place so I will be working on that over the weekend.

    Technically, Pushshift was still ingesting, but there was a problem in the pipeline where Pushshift pulls from the ingest and processes it. That's why was showing ingest data but the API lacked the data itself (it wasn't indexed).

    Yeah I got some really weird corrupted data from Reddit's API earlier and I'm trying to figure out why it's corrupted.

    Still working on it.

    There will eventually be a quota for the free tier but the quota will be high enough to download the monthly updates each month. That said, people are still welcome to host the files on other sites.

    I have a lot of the code written so far to handle account tracking, rate-limits, etc. -- right now the switch is still a couple of months out. The 500-1000 was just an estimation at the time, but the free tier will probably be more than that by a factor of ten probably.

    Also everyone who is signed up for Patreon and supporting the project today (I believe you are one of those -- thank you!) will automatically get a bump up to whatever the first pay tier ends up being (probably at least one call a second).

    I want the first pay tier to be something really reasonable so I'm thinking somewhere between 3-5 dollars a month (once the Beta API officially launches, it will include more than just Reddit data as well).

    This project started out as labor of love to the data enthusiast / scientists community -- my main goal is to recoup my initial investments and make enough where I can continue to build Pushshift full-time while also being as reasonable as possible for students and others who want to use the service.

    I hope that clarifies things -- I'm going to try and do this without being disruptive towards anyone / any projects currently ongoing right now. Ultimately the ideal situation would be to have a few for-profit companies who need heavy access to API end up absorbing a majority of the costs so that I can shift more free stuff towards the lower tiers.

    Unfortunately no -- regex isn't implemented right now (it's too expensive computationally for me to support right now).

    That said, you have two main options:

    1) You can still query Pushshift with exact phrases using quotation marks and just run through a bunch of permutations yourself. The q parameter supports multiple words and phrases when separated with a "|" For instance, if I wanted to find comments with the exact phrase "donald trump" but also the two words president and idiot, you could do:

    q="donald trump"|president|idiot

    2) The second option (one that I'd highly recommend you learn if you haven't played with it yet) is to use Google's BigQuery. BigQuery does allow you to run all types of simple and advanced regex queries. /u/fhoffa has provided some examples in the past in his subreddit /r/bigquery

    BigQuery gives you one terabyte of free queries each month -- My recommendation would be to start by practicing on one of the monthly tables first before you do searches against the entire corpus. You can even do searches for emojis with BQ.

    Even though my current implementation of Lucene lacks regex, there are still a lot of powerful search options you can use. You can view various ways to run more advanced search queries here.

    I have adjusted this to increase the size. Apparently a config change didn't stick when I made some upgrades.

    Please let me know if it works out better for you.

    Can you give me an example of what you are seeing? What call are you making to the API? Which endpoint are you using?

    created_utc is the number of seconds since unix epoch. The resolution is down to the second.

    When I modify they redditextractoR plugin to print the raw Unix timestamp, I get the same exact timestamp for every post of the day

    It sounds like something is going wrong here, because if you look at the raw JSON, you can see it increments by second.

    Thankfully I got some really good info from some generous people in this subreddit and a couple reached out for assistance. Sometimes people will have knowledge that isn't easily available via Google.

    I reached out to this subreddit because I know a lot of wonderful people on here who are very good at what they do -- but it's a shame that one has to also deal with responses like yours which are just a waste of everyone's time. This is a place to share knowledge and this will eventually get indexed on Google so that the next person that has a similar question will see a lot of solid responses -- that's why I came to this community for assistance.

    Yes my Google works -- I asked here because there are experts like you that could help with this research.