Technical Evaluation of a Startup

Goals

The goal of this document is to provide the Company team with helpful and actionable suggestions regarding their current engineering team, practices, and infrastructure in order to maximize the success of the business.

Process

To build this report, I interviewed Alice (CEO), Bob (lead engineer), Carla (lead engineer), and Dennis (product manager), who are the people most responsible for engineering productivity at Company.

Below I have listed recommendations with the following notation:

✅ Good / great / keep going
⚠️ Suggestion / watch out / something to keep in mind
🛑 Stop / warning / a strong recommendation
🔵 Additional information / future considerations

Team Organization

The Company engineering team consists of about eight developers along with Alice and Dennis. Leading the engineering team are Bob and Carla, two full-stack developers. Additional engineers, some remote, provide additional development resources. Alice sets a lot of the top-level product direction and feature requests. Dennis bridges this direction to the engineering team by providing research and detailed user stories.

✅ The current engineering organization structure works well. Dividing product development between high-level, customer-centric people and low-level developers is common. My interviews revealed that this team communicates well and efficiently, and the team seems to possess a reasonable level of mutual trust.

Engineers seem to have a clear understanding of user requirements, and the product team seems to understand engineering constraints at a basic level.

✅ The split between frontend and backend engineers works, and experienced full-stack engineers are in lead roles. As frontend and mobile development becomes more complicated, developer skills are becoming increasingly niche. The product has a clear backend (server-side API) and frontend (user-facing web app) needs, and the current team is equipped to implement the product as well as necessary supporting infrastructure.

Neither lead engineer appears to be shoehorned into areas outside their expertise. The technology choices used are smart and even allow frontend-leaning engineers to understand and make reasonable changes to the backend API.

⚠️ As the team grows, responsibilities might become more amorphous, so responsibility areas might need to be made more clear. As the team grows, technical authorities of various systems will naturally appear, and it’s a good idea to align these people with the areas that they are most interested in or know the most. For example, if you have an expert on the X app, an expert on the Y app, and an API/backend expert, then nominate them as being responsible for the development and performance of those areas of the codebase, respectively.

That’s not to say that people can’t work on different areas, and engineers may tire of working on the same thing. Some engineers might respond well to responsibility and autonomy. Plus, delineation might make code review ownership more clear.

⚠️ Foster and grow your engineering leaders. Since Bob and Carla are effectively in leadership positions, make sure they are given adequate resources and opportunities to continue to grow as leaders. I would suggest that they sign up for the Software Lead Weekly email newsletter and read Michael Lopp’s Managing Humans to best understand the engineering manager mindset. They might find a mentoring service such as Plato useful.

Engineering Process Review

Overall Process

The Company dev team makes heavy use of agile-style development and uses the best tools available for following this kind of process. Stories and epics are stored in Jira, which is linked to GitHub. Stories are broken down into tasks for engineers to complete. Sprints are currently one week long. Sometimes there is more work in the sprint than the engineers can complete, and other times the engineers finish work early.

As engineers complete features they create pull requests (PRs) on GitHub. Most or all engineers are then assigned to review the pull request before it is merged. After two approvals, the branch is merged. On Mondays, Bob releases the latest code to a staging environment by updating the code on long-lived servers, which results in a few minutes of downtime. Afterwards, Alice tests all new features manually, and if all goes well, the staging code is promoted to production.

Following agile practices generally results in effective software development teams. Agile, when done correctly, allows product teams to effectively plan features and measure engineering productivity. Not all engineers are a fit for the agile style, but I saw no evidence of any misfits on the team during my interviews.

⚠️ Encourage engineers to write design documents before implementing large features. These documents don’t have to be large or terribly in-depth, but they should force the team to have at least a small discussion about the architecture of a new feature or refactoring before work starts. Time spent in up-front planning of implementations will save time in code reviews later. Also, design docs will provide crucial documentation for new engineering hires.

At the very least, design docs are about telling team members, “Hey, I want to do X, and I want to do it like this. Do you have any comments before I start?” Since the team already uses Notion, putting these documents in Notion under an “Engineering” section and letting the team comment is probably the best place.

A design doc should have the following sections:

Title
Date (exact or quarter)
Author(s)
Simple one-sentence goals and any non-goals
Background of the current solution and why it isn’t sufficient
Implementation, which includes sufficient database, code, and infrastructure details
Additional sections on any testing, logging, or security implications if necessary

Additional resources:

Agile Workflow

✅ Continue using the agile methodology. This is a good way to organize a development team and quantify its output. Continue to maximize the high level of detail in tasks in Jira so that engineers have as much context as possible. Continuing to use Loom (pre-recorded screencasts with voiceover) is a great way to provide context for tasks, especially with a distributed team.

✅ Continue to let engineering leads do task estimations, and make sure they’re involved early in the planning process. Engineers should be doing the point estimations on tasks since they’ll be the most accurate. However, estimations are still just estimations, not contracts etched into stone. Sometimes a task will take much longer or shorter than expected, so engineers should be encouraged to take time to investigate issues to improve their estimation accuracy. It should be acceptable for an engineer to say, “This might take me 30 minutes or 3 days, and it’ll take me 10 minutes to tell you which.”

✅ Continue to keep the sprint tasks locked, and prevent engineers from being interrupted during sprints. This doesn’t seem to be a problem currently, which is great to see. Interruptions can certainly occur for P0 issues such as critical bugs or downtime, but for the most part it seems that engineers are left to concentrate on completing the assigned tasks. Non P0-bugs can go into the next sprint or into the backlog.

🛑 Strongly consider two-week sprints. One week is almost certainly too short for a sprint, and this is evidenced by engineers not finishing tasks. It’s also difficult to estimate the perfect amount of work that fits in a one-week sprint, so longer sprints might result in more accuracy. For example, you can’t easily fit two 3-day tasks into one week, but you can fit three 3-day tasks into a two-week sprint with room to spare. Also, a one-week sprint can stress the engineering team, especially for larger projects which might span multiple days.

Note: Longer sprints doesn’t mean it will take longer to get feature feedback. With an automated continuous integration system, such as the one Bob is developing with AWS Fargate, changes could be automatically pushed to staging and testable as soon as they’re merged.

🛑 Make sure tasks are sufficiently broken down. If a task takes longer than a sprint or is hard to estimate, it needs to be broken down. Watch out for “hey, while we’re here” additions to tasks since they expand the scope of tickets, and those tasks are a better fit for the backlog.

🛑 Separate the backlog from the icebox. The icebox is a list of tasks that don’t need to be done anytime soon but shouldn’t be forgotten about. This should be separate from the backlog, which should be a prioritized list of issues that engineers can pick up if there is slack in the sprint.

⚠️ Build refactoring into the sprint process. Refactoring is a normal part of software development, and specific time should be carved out in sprints for fixing areas of the codebase that “smell bad.” However, choosing what to refactor is a careful balance between developer happiness and productivity. While clunky or painful parts of the codebase should be refactored, the most important directive is to find product-market fit, not polish code with fine oils. When evaluating a refactoring project, make sure that the product of refactoring will definitely accelerate productivity and not simply be “nice to have.”

⚠️ Optionally, give sprints a theme. Make all the tasks in a sprint about a common goal and further team cohesion, knowledge-sharing, and camaraderie.

Code Review

✅ Keep the current culture of code review via GitHub pull requests. Code reviews are an important way to facilitate knowledge-sharing in an engineering organization as well as ensure high code quality.

⚠️ Familiarize the team with good code review practices. It is important to strive for code review best practices because, despite all efforts, code reviews will still have an element of ego involved. It can be easy for a code review comment to affect an author or reviewer personally, especially less-experienced engineers. To mitigate this, decide on the precise goals of code reviews and make sure they are stated loudly and clearly.

Understand that code reviews exist to improve code quality and knowledge sharing. They do not exist to enforce a reviewer’s perfect ideals upon the engineer they’re reviewing. I didn’t get a sense that this was happening with the Company team, other than seeing one engineer who was a little more detail-oriented than probably necessary. However, it’s important to recognize that code reviews are where egos and feelings can start to interfere if left unchecked. Be on alert for any overzealous pedantry or exhaustion.

I strongly recommend the engineering team reads Michael Lynch’s How to Do Code Reviews Like a Human, parts one and two. These documents cover the intrapersonal aspects of code reviews while still focusing on the business goals.

It may also be useful to read Google’s Code Review Best Practices simply for context. Before you adopt any of their best practices, however, understand that Google is an organization with tens of thousands of engineers and has infinite time to perfect every line of code. Google’s best practices may not be the best practices for a small startup.

⚠️ Possibly stop assigning every engineer to every PR. Instead, manually assign one or two reviewers who are generally responsible for the area of code that’s under change. This can help reinforce responsibility areas and expertise within the codebase. If an engineer gets overwhelmed with code reviews, then revisit the process to make sure effort is distributed fairly.

One way to enforce this is by using GitHub’s CODEOWNERS files, but this approach might be too heavyweight for this stage of the company.

🛑 Stop commenting on things in PRs that could be replaced by code. Commenting on style during pull requests is usually a waste of time and can exhaust both reviewers and authors.

The team is already using Prettier and eslint, so those tools should be considered authorities. If the code passes Prettier and eslint, such as during tests or preferably during a Git pre-commit hook, then the style is fine and further comments in a PR are unnecessary. If improper style makes it into the pull request, then fix the linting infrastructure for future commits.

Encourage engineers to add auto-fixing tools to their editors. If the team is using VS Code, for example, look into the “fix on save” features for both Prettier and eslint.

Remember that code polish doesn’t matter if the company fails and years of your hard work end up on a hard drive on someone’s shelf.

Release Process

🛑 Add a testing workflow to the release process as soon as possible. Company’s products are a testament that comprehensive tests aren’t required for early product development. However, as the codebase grows and the products change, the number of bugs and regressions will increase. Adding tests will save time by catching bugs earlier, lessening manual QA, and releasing confidently.

Given the React and Node.js codebase, I would recommend the team use a combination of Jest, Testing Library, and Cypress --- the three most common testing technologies on the State of JS’s 2020 survey. The test suite should run for every pull request and merge in Git. For pull requests, the testing system should be reported next to the approvals, and PRs should not be merged until the tests pass. This should be easy to do since the team is already using GitHub Actions.

Begin by adding tests to any known-problematic or important areas of the code, like payments. Then start the practice of adding tests along with every new feature. If bugs are uncovered in production, use them as an opportunity to add a test.

🛑 Automate deployments to staging and production with zero-downtime deployment. As soon as code is merged to the staging or production branches, the new codebase should be deployed immediately. This is already partially underway thanks to Bob’s migration from in-place deployments to AWS Fargate, which will handle this automation and allow zero-downtime deployments as new servers are spun up and traffic is migrated automatically.

The testing and feedback cycle for new features could be shortened by pushing continuously to the staging environment. If staging is pushed after each merge to the staging branch, Alice and Dennis may be able to test new features as they’re merged instead of waiting for Monday.

Avoid Kubernetes at this stage of the company. The added complication and operational overhead will only make deployments more difficult.

⚠️ Consider a monitoring solution for backend services and any other critical features. A monitoring and alerting system will become essential to accurately monitor the application’s availability and report when there’s downtime.

I would recommend starting with a simple monitoring system based on CloudWatch to ensure that backend and frontend systems are up and reliable. I would have these systems send alerts to a dedicated Slack channel. This is easy to set up, and sending everything to a Slack channel means it’s easy to build a timeline of events when outages occur.

Of course, the use of reliable systems such as Fargate and S3 should mean that downtime is minimal. However, monitoring can catch things the team might never expect, such as DNS changes or service misconfigurations, so there’s still value in monitoring critical systems.

The team may wish to use StatusGator, a service that monitors over a thousand third-party systems, such as X, and alerts you when they’re down. These alerts can be added to Slack as well.

Since the team is so small and responsive, it might be too early to implement an escalation policy and on-call rotation. When that time comes and 24/7 coverage is needed, consider PagerDuty (preferred) or OpsGenie.

Technical Architecture Review

✅ The current tech stack choices are terrific and scalable. The choices should be good for near-future work and should allow for easy and effective hiring of additional engineers.

TypeScript for both backend and frontend is a solid choice that allows engineers to work on both sides of the application.
Postgres is an ideal database choice for application data, especially when managed through Amazon RDS. (However, note the Data Warehouse discussion below.)
GraphQL + Knex + Dataloader is one of the best solutions to build a TypeScript/Node/GraphQL API at the moment.
The use of GraphQL Code Generator to generate TypeScript types is great and allows rapid development of new GraphQL queries and mutations.
React is one of the best frontend frameworks to choose at the moment and will be for a long time. When the team is ready for mobile, it should be easy to build progressive web apps (PWAs) or native apps with React Native.
I was pleased to see reliance on Apollo Client and its caching mechanism since a lot of the time that can completely eliminate the need for a shared state management engine such as Redux. However, be aware that Apollo has a habit of breaking APIs and leaving behind incomplete documentation. Also, mutations with Apollo can begin to get tricky, so make sure that the team is well-versed on Apollo Client’s best practices for mutations.

✅ Source code is organized and appears to be written well. During my reviews with Carla I noticed a sensible directory hierarchy and a reasonable number of comments. The code appeared to be easy to follow.

🛑 Strongly consider making a monorepo (single repository). Currently the code is split across multiple repositories. Putting all of the code into a single repository will allow single changes to change all affected parts of the code at once, will make searchability and discoverability much easier, and will make releases simpler. Consider using Yarn workspaces for a simple solution to share common code between components in a monorepo.

✅ Keep the monolithic service design for now. Avoid separating out services into microservices to minimize complication. At the moment, scalability problems can be solved by simply running more instances of the backend. Only consider microservices once there is a single-purpose component which absolutely must scale independently of the other services to meet customer needs.

✅ Keep leveraging no-code or low-code tools for easy tweaking. Retool is a fantastic choice to make low-code administrative or internal tools in much less time than building a standalone app.

Data warehouse

🔵 There is a need to unify data across the application’s database, Mixpanel, and other sources. The solution is to build a “data lake” using an inexpensive database that is optimized for reporting.

Postgres, while being a wonderful piece of technology, should not be used for event logs and reporting in the medium or long term. As data grows, building reports will have drastically different data access patterns than the rest of the application, which will strain the database as it struggles to quickly serve data to the running application. For this reason, as well as the cost savings, it is best to move all reporting data to a database that’s a better fit for reporting and analytics. Since the infrastructure is already built on AWS, Redshift is probably the best choice for such a database.

Once the data needed for reports is in one database, it can be joined easily and presented using a BI/analytics software solution to build graphs and dashboards.

Suggested plan:

Copy all relevant application data needed for reports (e.g., X, Y, Z, payments) from Postgres to Redshift daily using an ETL provider, preferably Stitch Data. Stitch makes it trivially easy to efficiently clone data from one data source to another on a schedule, and I’ve used it in prior businesses with great success. Fivetran is another alternative.
Use Stitch Data to bring in third-party data, such as Mixpanel, to the data lake. See their list of data sources to see where data can be pulled from.
If there are data sources that Stitch Data doesn’t support, consider using their Singer.io framework for building custom ETL pipelines. If the data is extremely large, consider putting the data into JSONR, CSV, Avro, or Parquet into S3 and using AWS Spectrum to query that data from within Redshift, which will lower costs further.
Use a collaborative BI/analytics platform that is known to work well with small teams and allow self-service, such as Metabase (free, self-hosted), Apache Subset (free, self-hosted), or Chart.io (SaaS). Whichever solution you pick should allow Alice and Dennis to build dashboards themselves with minimal work from the engineering team. Avoid platforms that are built for larger teams and require a dedicated data engineer, such as Looker and Tableau.

If you need help building any parts of this pipeline, I can refer you to a good consulting team that specializes in building data pipelines and BI/analytics deployments.

Additional thoughts

🔵 Custom authentication: If the team moves off of Auth0 and builds custom authentication, make sure to follow the practices in the OWASP Authentication Cheat Sheet. These will help prevent the team from making mistakes that would jeopardize customer data.

🔵 Deprecated NPM libraries: Watch out for deprecated libraries in the NPM ecosystem, such as Moment.js, and replace them if necessary.

🔵 Rate limiting: Consider adding rudimentary, in-memory rate limiting on all backend operations to prevent simple denial of service attacks, data enumeration, and surprise costs. An in-memory solution that works with Express should be sufficient. Make sure to set the rates higher than you would ever expect on your API endpoints. Make sure that alerts are triggered when users hit these limits.

🔵 Next.js: Consider Next.js as an alternative to Create React App for new applications. The Vercel team has made great progress in the framework and using Next.js provides many advantages, such as reducing code and minimizing bundler configurations. A previous company I worked with switched from CRA to Next.js and was able to delete thousands of lines of code dedicated to URL routing and server-side rendering (SSR).

🔵 Consider Vercel (preferred) or Netlify for the static frontends. Continuous deployment is trivial to achieve with either platform, and each pull request will get its own URL for easy testing.

🔵 Consider Heroku if the Fargate migration doesn’t go smoothly. Heroku tends to be expensive over time, but deployments are simple and easy, and small and medium teams often find that the cost of Heroku is a fraction of operational work from a full- or even part-time individual.

Conclusion

Overall, Company is in a wonderful great position technologically. The team has already made strong technology choices and has recognized deficiencies in their process and sought to improve them. Hopefully the above suggestions are useful, and by implementing some of them I expect the Company team to be set up for success for years to come.